Methodology — Pyze Execution Intelligence

Task-Boundary Mode How task-instance boundaries are drawn from the event stream. Applies to every Task SoP, Step SoP, and Variants view.

Data Foundations

⚙

Data Sources

Inside-out Pyze JS + outside-in Celonis Task Mining

⚙

Pyze JS — Veeva Safety

Inside-out instrumentation. JavaScript deployed inside Veeva Safety captures every click, form interaction, section navigation, and case action at DOM level with direct case-ID linkage from the URL/DOM.

23,723 events (42.4%)

⚙

Pyze JS — Phobos & Vaults

Same JS technology deployed in Merck's Phobos intake system, RIM Vault, and Quality Vault. Case linkage via swivel-chair detection.

14,664 events (26.2%)

⚙

Celonis Task Mining

Outside-in desktop monitoring captures application switching, window titles, and interaction events across all desktop and web apps. Case linkage via swivel-chair detection.

17,625 events (31.5%)

Pyze captures 68.5% of all signal — 2.2x the coverage of traditional task mining alone. Inside-out instrumentation captures granular in-application behavior (section-level navigation, field-level edits, button clicks) that desktop monitoring cannot see. Combined with Celonis Task Mining, this creates a complete Digital Twin of how work actually happens.

📈

Confidence Model

How we filter events before computing metrics

Every event is assigned a confidence level indicating the reliability of its case association:

Level	Meaning	Events
High	Case ID extracted directly from URL/DOM (Pyze JS)	28,341
Medium-High	Strong swivel-chair attribution (short gap, same user)	3,554
Medium	Probable swivel-chair attribution	3,693
Low	Weak attribution (long gap or ambiguous context)	20,424

All agent analyses include High, Medium-High, and Medium confidence events (35,588 events, 63.5%). Low confidence events are excluded to ensure metric accuracy. Exception: the AI Effectiveness (gpteal) analysis includes all confidence levels because we're measuring tool adoption behavior, not case-level accuracy.

⏲

Handling Time & Cycle Time Definitions

How we separate active work from wait time

Handling time = sum of active dwell time per case per user. Any gap exceeding 5 minutes between consecutive events on the same case is classified as idle/wait time (not active touch). This threshold captures context-switching breaks while excluding brief pauses for reading or thinking.

Cycle time = wall-clock duration from first to last event on a case.

The ratio handling / cycle measures "touch efficiency" — what percentage of elapsed time involves active work.

📋

Savings Projection & Data Hygiene

How we extrapolate pilot observations

Three Savings Tiers

Pilot Period: Hours observed during the 19-day pilot.
Annualized (17 users): Pilot hours × (250 working days / 19 pilot days). Assumes representative volume.
Projected (1,000 users): Annualized hours × (1,000 / 17). Linear projection — does not account for economies/diseconomies of scale.

Caveat: The 1,000-user projection is a directional estimate for business case sizing. Actual savings depend on role distribution, case mix, and process maturity at scale.

Data Hygiene

Events from personal browsing domains (social media, shopping, personal finance) are filtered from all analyses. All user identities are anonymized to sequential labels (Analyst 01 through Analyst 17). No personally identifiable information is included.

Opportunities vs Benchmarks

Every finding surfaced by an agent is classified as either an Opportunity or a Benchmark. The distinction drives how it flows through the platform — savings totals, lifecycle workflow, and what the team should do with it.

Opportunity

An actionable finding that proposes a specific remediation and has measurable savings in hours and dollars.

Treatment on the platform

Shows projected savings at pilot, 17-user, and 1,000-user scale
Receives a composite risk factor and Automation Readiness Score
Enters the four-state lifecycle (Surfaced → Accepted → Remediating → Remediated)
Counts toward dashboard savings totals

Example findings

High-Latency Handoffs — fix: automated task notifications
Delete-Confirm Loop — fix: bulk delete UX
gpteal Productivity Uplift — fix: close adoption gap

Benchmark

A measurement or observation that informs strategy but doesn't propose a concrete fix on its own. Provides the context that makes opportunities credible.

Treatment on the platform

Displays a headline metric instead of a savings number
Does not enter the lifecycle (no Accept / Remediate workflow)
Does not count toward dashboard savings totals
Still carries validation context and ties to the theme it informs

Example findings

gpteal Adoption Dashboard — 53% adoption rate
Touch Efficiency Ratio by Case Type — handling/cycle ratio
Stage-Level Handling Time — effort concentration map

How we classify

An agent's output becomes an Opportunity only if it passes all three tests:

Specific remediation — the finding points to a concrete fix (RPA bot, UI change, integration, AI agent deployment)
Measurable savings — we can project hours and dollars recovered at pilot, annual, and scaled volume
Implementation clarity — a delivery team could take the finding and translate it into a scoped project

If any test fails, the finding is classified as a Benchmark. Benchmarks are intentionally kept out of savings totals so the commitment-ready numbers stay honest — but they're prominently displayed alongside opportunities because the measurement context is what makes the opportunity credible. An opportunity saying "automate narrative drafting" is much stronger paired with a benchmark showing "users spend 22.8s per narrative interaction" — the two are designed to work together.

Why we draw the line this way Discovery tools that treat every observation as a savings opportunity produce inflated business cases that don't survive scrutiny. The Opportunity/Benchmark split keeps the conversation honest: what can we commit to, what do we know, and what's the difference.

Autonomous Agents

Each agent is a self-contained analyzer with its own detection logic, signals, and scoring approach. Click to expand any agent for full methodology.

↺

Rework

Detects cross-application round-trip patterns

What It Detects

Case-bound round-trip patterns where an analyst leaves the system of record, performs work in a supporting app, and returns — with the round trip repeating multiple times on the same case. These patterns signal missing integrations, UI gaps, or workflow habits that create hidden friction.

Signals It Analyzes

case_id source_app event_timestamp activity pyzeClick (drilldown)

Events are ordered by timestamp within each case, then scanned for A→B→A sequences where A is an instrumented app and B is the "detour" app. Transitions with <2 second dwell in the detour are filtered as navigation artifacts rather than real work.

How It Scores

Each rework pattern is scored by: frequency × dwell_in_detour_app × cases_affected. Patterns that repeat across many analysts and cases rank highest.

Example Pattern from the Merck Pilot

Acrobat-Word Narrative Drafting Loop: Users switched between Adobe Acrobat and Microsoft Word every 6-8 seconds during narrative drafting. 1,261 consecutive transitions detected across 77 cases. Indicates the built-in Veeva Narrative editor isn't meeting the drafting workflow — candidate for an integrated PDF side-by-side viewer or AI-assisted entity extraction.

View Rework agent →

⏲

Cycle Time

Measures wall-clock case duration and handoff latency

What It Detects

Cases that take longer than they should — through stage-level wait time, inter-user handoff latency, or extended "background" sessions. Distinguishes active work from idle time to separate "we're working on it slowly" from "it's sitting in a queue."

Signals It Analyzes

event_timestamp end_timestamp handoff_wait_hours case_id_source page_title (case type)

For each case, computes: total wall-clock duration, per-stage span (using PV stage mapping), and inter-user wait time (from the handoff analysis table). Case types are parsed from page_title to segment by SUSAR / SAE / AE and Initial / Follow-up.

How It Scores

Findings are ranked by: median wait time in the problematic stage, frequency of cases affected, and variance relative to peer cases of the same type. Outlier cases (e.g., 300+ hours idle) are surfaced as specific examples.

Example Pattern from the Merck Pilot

Follow-up SUSAR Handoff Latency: Cases awaiting 'Select Documents' after 'View Combined Review' sit idle for a median of 131 hours (5.4 days) before the next analyst picks them up. 41% of sequential handoffs exceed 3 days. Root cause appears to be missing task-assignment notifications, not missing work.

View Cycle Time agent →

✍

Handling Time

Computes active touch time and peer-benchmarks analysts

What It Detects

Where effort concentrates across PV stages and which analysts handle similar work fastest. Unlike cycle time (wall-clock), handling time captures only active, focused interaction — the time a keyboard or mouse is actually engaged with the case.

Signals It Analyzes

total_dwell_ms edit_count click_count analyst_name activity (stage map)

Sums total_dwell_ms per case per user, grouped by PV stage. The 5-minute idle threshold separates active touch from wait time. Compares analysts on the same case types to identify peer benchmarks.

How It Scores

Findings surface: touch efficiency (handling ÷ cycle) by case type, analyst variance (best vs median on like cases), and stage-level effort concentration. A 25th-percentile target is used to project savings from bringing slower analysts to peer pace.

Example Pattern from the Merck Pilot

Touch Efficiency Variance: For SUSAR cases, analysts range from 0.2 min to 0.9 min of active handling per case — a 4x spread. If the 8 slowest analysts matched the median, ~46K hours of capacity would be freed annually across 17 users.

View Handling Time agent →

⚙

Automation

Mines repetitive, deterministic click sequences

What It Detects

Click sequences that recur across many cases and analysts with deterministic outcomes — the clearest RPA candidates. Focuses on patterns where the same sequence always produces the same result, distinguishing them from judgment-heavy work (which is the AI Discovery agent's territory).

Signals It Analyzes

activity (n-gram) pyzeClick sequence element_tag inner_text action_sequence

Extracts activity n-grams (2-6 length) within cases from the event log, and fine-grained pyzeClick sequences from the drilldown. Sequences are scored by frequency, estimated time per execution, and breadth across analysts.

How It Scores

Each candidate is assigned an Automation Readiness Score (0-100) along four factors:

Pattern Frequency (30%) — volume of repetition
Decision Complexity (30%) — RPA-simple vs AI-hard
Data Structure (20%) — structured forms vs unstructured content
Cross-App Scope (20%) — single app (easy) vs multi-system (harder)

See Automation Readiness Score for full scoring detail.

Example Pattern from the Merck Pilot

Veeva-Phobos Synchronous Handshake: 99% of 'Complete Action' events in Veeva Safety are followed within 25 seconds by a 'Proceed' or 'Advance' click in Phobos on the same Case ID. 87/88 sequences in the pilot show identical pattern — prime candidate for a Veeva Web Action that triggers the Phobos advance automatically.

View Automation agent →

✨

AI Discovery

Identifies judgment-heavy work where AI agents can assist

What It Detects

Work that is not a deterministic loop but is judgment-heavy — drafting, classification, coding, translation. The complement to the Automation agent: these are the patterns where RPA fails but AI agents can augment human analysts.

Signals It Analyzes

total_dwell_ms (high) edit_count (high) click_count (low) source_app (cross-app) gpteal domain hits

Flags activities with high dwell + high edits + low clicks (signature of thinking/writing work). Detects cross-app patterns involving Word, Outlook, or Acrobat tied to narrative or assessment stages. Surfaces existing gpteal usage as proof that users are already self-serving AI.

How It Scores

Candidates are classified by AI capability type: Summarization (source documents), Drafting (narratives), Classification (case triage), Translation (localized cases), Coding (MedDRA lookups). Each gets its own Automation Readiness Score tuned for AI Agent remediation (lower structure, higher complexity).

Example Pattern from the Merck Pilot

GenAI Narrative Drafting: Users spend an average of 22.8 seconds per interaction on the description field in Veeva Safety. The 'Generate Narrative from Outline' feature triggers at 0ms dwell (background operation), after which users spend significant time editing the output. Quality gap in the auto-generated draft is the real bottleneck — not the drafting workflow itself.

View AI Discovery agent →

⚡

AI Effectiveness

Measures adoption and productivity uplift from GenAI tools

What It Detects

How Merck's existing GenAI tool (gpteal) is being used — who adopts it, how often, on which case types, and whether adoption correlates with measurable productivity gains. Answers "is our AI investment actually landing?" before expanding rollout.

Signals It Analyzes

source_app (gpteal domains) user_id case_id event_timestamp (daily trend) cases/day (throughput)

Filters events where source_app matches gpteal domains (gpteal.merck.com, dtgpteal.merck.com, talkgpteal.merck.com). Uses the events_all view (includes Low confidence) because gpteal usage often appears as swivel-chair events with weaker case linkage. Computes per-analyst adoption tiers, cohort comparisons, and retention signals.

How It Scores

Per-user tier classification (Power / Regular / Light / Minimal / Non-Adopter) based on event count and active days. Cohort comparison between adopters and non-adopters on cases-per-day productivity — the savings opportunity quantifies the gap if non-adopters matched adopter throughput.

Example Pattern from the Merck Pilot

gpteal Productivity Uplift: Adopters process 9.14 cases/day on average vs 5.14 for non-adopters — a 78% throughput advantage. Handling time per case is similar, so the gain comes from reduced between-case friction. Closing the adoption gap for the 8 current non-adopters would recover ~4,376 hours annually at 17 users, ~312K hours at 1,000 users.

View AI Effectiveness agent →

Scoring Frameworks

★

Automation Readiness Score

0-100 score quantifying how automation-ready each opportunity is

Every opportunity surfaced by the Automation and AI Discovery agents receives a 0-100 Automation Readiness Score combining four independently-measured factors via weighted average.

Factor	Weight	What It Measures	How It's Computed
Pattern Frequency	30%	How often the pattern repeats	Bucketed by annualized volume: >1,000 hrs/yr = 95, >500 = 80, >100 = 60, else 40
Decision Complexity	30%	Deterministic vs judgment-heavy	RPA = 90, UX = 75, Integration = 60, AI Agent = 30
Data Structure	20%	Structured vs unstructured inputs	RPA/UX = 90, Integration = 70, AI Agent = 40
Cross-App Scope	20%	Single app vs multi-system	Single = 90, cross-app = 60, multi-system = 40 (from finding text)

Score Bands

Very High 80-100 — ready to automate with high confidence
High 60-79 — strong candidate; minor discovery needed
Medium 40-59 — partial automation possible
Low <40 — AI-assisted, not fully automated

⚖

Risk Adjustment — Confidence-Weighted Savings

Four-dimension weighting that converts unadjusted savings into business-case-ready numbers

Unadjusted savings estimates answer "what is the theoretical maximum if every opportunity is fully realized?" Risk-adjusted savings answer a more honest question: "what should we reasonably expect given real-world constraints?"

Each opportunity is scored High (1.0) / Medium (0.8) / Low (0.5) across four dimensions. The composite factor multiplies the unadjusted savings.

Dimension	Weight	High (1.0)	Medium (0.8)	Low (0.5)
Detection Confidence	40%	Strong statistical signal	Clear pattern, limited sample	Suggestive only
Implementation Feasibility	25%	Proven approach	Custom integration work	Novel AI/ML build
Adoption Readiness	20%	Invisible to user	Similar workflow	Significant behavior change
Compliance Path	15%	Light validation	Standard CSV	Full re-validation

Composite factor = (Detection × 0.40) + (Feasibility × 0.25) + (Adoption × 0.20) + (Compliance × 0.15)

Risk-adjusted annual savings = Unadjusted annual savings × Composite factor

How to read the number: The risk-adjusted total represents the savings to commit in a business case today. As opportunities progress through the lifecycle, factors update with actual implementation data and the forecast converges toward realized value.

Operational Framework

🔄

Opportunity Lifecycle & Value Realization

Surfaced → Accepted → Remediating → Remediated → Monitored

Discovery is only the first step. Every opportunity tracks through a four-state lifecycle — from initial detection through measured value realization — so teams see what's been acted on, what's in flight, and whether expected savings are actually being captured.

Surfaced

Accepted

Remediating

Remediated

State	Who Owns It	What the Platform Does
Surfaced	BA triage	Continues collecting evidence; readiness score updates as data arrives
Accepted	BA / Ops lead	Snapshots baseline metrics for later comparison
Remediating	Implementation team	Monitors for early behavioral change pre-deployment
Remediated	Ops lead / finance	Continuously measures actual hours saved vs projected
Declined	Governance	Pattern stays monitored; re-surfaced if material growth

Value Realization Monitoring

Every Remediated opportunity enters continuous post-implementation monitoring. The platform compares three measurements against the locked baseline:

Throughput delta — cases per day per analyst. Expected to rise after remediation.
Handling time delta — active touch time per case. Expected to fall for the targeted pattern.
Pattern recurrence — does the original rework/loop/handoff pattern still appear?

Actual savings are reported weekly against the projected estimate. If realized value is below forecast after 90 days, the opportunity is flagged — the agent re-analyzes and surfaces any secondary patterns blocking full realization.

Why the full loop matters: Discovery tools create backlogs that go stale. Pyze closes the loop — every opportunity tracks from detection through implementation to measured realization. Teams verify what they actually saved, not just what they could.

Methodology & Data Lineage

On This Page

Data Foundations

Three Savings Tiers

Data Hygiene

Opportunities vs Benchmarks

Treatment on the platform

Example findings

Treatment on the platform

Example findings

How we classify

Autonomous Agents

What It Detects

Signals It Analyzes

How It Scores

Example Pattern from the Merck Pilot

What It Detects

Signals It Analyzes

How It Scores

Example Pattern from the Merck Pilot

What It Detects

Signals It Analyzes

How It Scores

Example Pattern from the Merck Pilot

What It Detects

Signals It Analyzes

How It Scores

Example Pattern from the Merck Pilot

What It Detects

Signals It Analyzes

How It Scores

Example Pattern from the Merck Pilot

What It Detects

Signals It Analyzes

How It Scores

Example Pattern from the Merck Pilot

Scoring Frameworks

Score Bands

Operational Framework

Value Realization Monitoring