Methodology & Data Lineage
How the Execution Intelligence platform captures, classifies, and quantifies operational data. Grouped so you can dive into the section most relevant to your question.
On This Page
Data Foundations
Data Sources
Inside-out Pyze JS + outside-in Celonis Task Mining
Confidence Model
How we filter events before computing metrics
Every event is assigned a confidence level indicating the reliability of its case association:
| Level | Meaning | Events |
|---|---|---|
| High | Case ID extracted directly from URL/DOM (Pyze JS) | 28,341 |
| Medium-High | Strong swivel-chair attribution (short gap, same user) | 3,554 |
| Medium | Probable swivel-chair attribution | 3,693 |
| Low | Weak attribution (long gap or ambiguous context) | 20,424 |
All agent analyses include High, Medium-High, and Medium confidence events (35,588 events, 63.5%). Low confidence events are excluded to ensure metric accuracy. Exception: the AI Effectiveness (gpteal) analysis includes all confidence levels because we're measuring tool adoption behavior, not case-level accuracy.
Handling Time & Cycle Time Definitions
How we separate active work from wait time
Handling time = sum of active dwell time per case per user. Any gap exceeding 5 minutes between consecutive events on the same case is classified as idle/wait time (not active touch). This threshold captures context-switching breaks while excluding brief pauses for reading or thinking.
Cycle time = wall-clock duration from first to last event on a case.
The ratio handling / cycle measures "touch efficiency" — what percentage of elapsed time involves active work.
Savings Projection & Data Hygiene
How we extrapolate pilot observations
Three Savings Tiers
- Pilot Period: Hours observed during the 19-day pilot.
- Annualized (17 users): Pilot hours × (250 working days / 19 pilot days). Assumes representative volume.
- Projected (1,000 users): Annualized hours × (1,000 / 17). Linear projection — does not account for economies/diseconomies of scale.
Data Hygiene
Events from personal browsing domains (social media, shopping, personal finance) are filtered from all analyses. All user identities are anonymized to sequential labels (Analyst 01 through Analyst 17). No personally identifiable information is included.
Opportunities vs Benchmarks
Every finding surfaced by an agent is classified as either an Opportunity or a Benchmark. The distinction drives how it flows through the platform — savings totals, lifecycle workflow, and what the team should do with it.
An actionable finding that proposes a specific remediation and has measurable savings in hours and dollars.
Treatment on the platform
- Shows projected savings at pilot, 17-user, and 1,000-user scale
- Receives a composite risk factor and Automation Readiness Score
- Enters the four-state lifecycle (Surfaced → Accepted → Remediating → Remediated)
- Counts toward dashboard savings totals
Example findings
- High-Latency Handoffs — fix: automated task notifications
- Delete-Confirm Loop — fix: bulk delete UX
- gpteal Productivity Uplift — fix: close adoption gap
A measurement or observation that informs strategy but doesn't propose a concrete fix on its own. Provides the context that makes opportunities credible.
Treatment on the platform
- Displays a headline metric instead of a savings number
- Does not enter the lifecycle (no Accept / Remediate workflow)
- Does not count toward dashboard savings totals
- Still carries validation context and ties to the theme it informs
Example findings
- gpteal Adoption Dashboard — 53% adoption rate
- Touch Efficiency Ratio by Case Type — handling/cycle ratio
- Stage-Level Handling Time — effort concentration map
How we classify
An agent's output becomes an Opportunity only if it passes all three tests:
- Specific remediation — the finding points to a concrete fix (RPA bot, UI change, integration, AI agent deployment)
- Measurable savings — we can project hours and dollars recovered at pilot, annual, and scaled volume
- Implementation clarity — a delivery team could take the finding and translate it into a scoped project
If any test fails, the finding is classified as a Benchmark. Benchmarks are intentionally kept out of savings totals so the commitment-ready numbers stay honest — but they're prominently displayed alongside opportunities because the measurement context is what makes the opportunity credible. An opportunity saying "automate narrative drafting" is much stronger paired with a benchmark showing "users spend 22.8s per narrative interaction" — the two are designed to work together.
Autonomous Agents
Each agent is a self-contained analyzer with its own detection logic, signals, and scoring approach. Click to expand any agent for full methodology.
Rework
Detects cross-application round-trip patterns
What It Detects
Case-bound round-trip patterns where an analyst leaves the system of record, performs work in a supporting app, and returns — with the round trip repeating multiple times on the same case. These patterns signal missing integrations, UI gaps, or workflow habits that create hidden friction.
Signals It Analyzes
Events are ordered by timestamp within each case, then scanned for A→B→A sequences where A is an instrumented app and B is the "detour" app. Transitions with <2 second dwell in the detour are filtered as navigation artifacts rather than real work.
How It Scores
Each rework pattern is scored by: frequency × dwell_in_detour_app × cases_affected. Patterns that repeat across many analysts and cases rank highest.
Example Pattern from the Merck Pilot
Cycle Time
Measures wall-clock case duration and handoff latency
What It Detects
Cases that take longer than they should — through stage-level wait time, inter-user handoff latency, or extended "background" sessions. Distinguishes active work from idle time to separate "we're working on it slowly" from "it's sitting in a queue."
Signals It Analyzes
For each case, computes: total wall-clock duration, per-stage span (using PV stage mapping), and inter-user wait time (from the handoff analysis table). Case types are parsed from page_title to segment by SUSAR / SAE / AE and Initial / Follow-up.
How It Scores
Findings are ranked by: median wait time in the problematic stage, frequency of cases affected, and variance relative to peer cases of the same type. Outlier cases (e.g., 300+ hours idle) are surfaced as specific examples.
Example Pattern from the Merck Pilot
Handling Time
Computes active touch time and peer-benchmarks analysts
What It Detects
Where effort concentrates across PV stages and which analysts handle similar work fastest. Unlike cycle time (wall-clock), handling time captures only active, focused interaction — the time a keyboard or mouse is actually engaged with the case.
Signals It Analyzes
Sums total_dwell_ms per case per user, grouped by PV stage. The 5-minute idle threshold separates active touch from wait time. Compares analysts on the same case types to identify peer benchmarks.
How It Scores
Findings surface: touch efficiency (handling ÷ cycle) by case type, analyst variance (best vs median on like cases), and stage-level effort concentration. A 25th-percentile target is used to project savings from bringing slower analysts to peer pace.
Example Pattern from the Merck Pilot
Automation
Mines repetitive, deterministic click sequences
What It Detects
Click sequences that recur across many cases and analysts with deterministic outcomes — the clearest RPA candidates. Focuses on patterns where the same sequence always produces the same result, distinguishing them from judgment-heavy work (which is the AI Discovery agent's territory).
Signals It Analyzes
Extracts activity n-grams (2-6 length) within cases from the event log, and fine-grained pyzeClick sequences from the drilldown. Sequences are scored by frequency, estimated time per execution, and breadth across analysts.
How It Scores
Each candidate is assigned an Automation Readiness Score (0-100) along four factors:
- Pattern Frequency (30%) — volume of repetition
- Decision Complexity (30%) — RPA-simple vs AI-hard
- Data Structure (20%) — structured forms vs unstructured content
- Cross-App Scope (20%) — single app (easy) vs multi-system (harder)
See Automation Readiness Score for full scoring detail.
Example Pattern from the Merck Pilot
AI Discovery
Identifies judgment-heavy work where AI agents can assist
What It Detects
Work that is not a deterministic loop but is judgment-heavy — drafting, classification, coding, translation. The complement to the Automation agent: these are the patterns where RPA fails but AI agents can augment human analysts.
Signals It Analyzes
Flags activities with high dwell + high edits + low clicks (signature of thinking/writing work). Detects cross-app patterns involving Word, Outlook, or Acrobat tied to narrative or assessment stages. Surfaces existing gpteal usage as proof that users are already self-serving AI.
How It Scores
Candidates are classified by AI capability type: Summarization (source documents), Drafting (narratives), Classification (case triage), Translation (localized cases), Coding (MedDRA lookups). Each gets its own Automation Readiness Score tuned for AI Agent remediation (lower structure, higher complexity).
Example Pattern from the Merck Pilot
AI Effectiveness
Measures adoption and productivity uplift from GenAI tools
What It Detects
How Merck's existing GenAI tool (gpteal) is being used — who adopts it, how often, on which case types, and whether adoption correlates with measurable productivity gains. Answers "is our AI investment actually landing?" before expanding rollout.
Signals It Analyzes
Filters events where source_app matches gpteal domains (gpteal.merck.com, dtgpteal.merck.com, talkgpteal.merck.com). Uses the events_all view (includes Low confidence) because gpteal usage often appears as swivel-chair events with weaker case linkage. Computes per-analyst adoption tiers, cohort comparisons, and retention signals.
How It Scores
Per-user tier classification (Power / Regular / Light / Minimal / Non-Adopter) based on event count and active days. Cohort comparison between adopters and non-adopters on cases-per-day productivity — the savings opportunity quantifies the gap if non-adopters matched adopter throughput.
Example Pattern from the Merck Pilot
Scoring Frameworks
Automation Readiness Score
0-100 score quantifying how automation-ready each opportunity is
Every opportunity surfaced by the Automation and AI Discovery agents receives a 0-100 Automation Readiness Score combining four independently-measured factors via weighted average.
| Factor | Weight | What It Measures | How It's Computed |
|---|---|---|---|
| Pattern Frequency | 30% | How often the pattern repeats | Bucketed by annualized volume: >1,000 hrs/yr = 95, >500 = 80, >100 = 60, else 40 |
| Decision Complexity | 30% | Deterministic vs judgment-heavy | RPA = 90, UX = 75, Integration = 60, AI Agent = 30 |
| Data Structure | 20% | Structured vs unstructured inputs | RPA/UX = 90, Integration = 70, AI Agent = 40 |
| Cross-App Scope | 20% | Single app vs multi-system | Single = 90, cross-app = 60, multi-system = 40 (from finding text) |
Score Bands
- Very High 80-100 — ready to automate with high confidence
- High 60-79 — strong candidate; minor discovery needed
- Medium 40-59 — partial automation possible
- Low <40 — AI-assisted, not fully automated
Risk Adjustment — Confidence-Weighted Savings
Four-dimension weighting that converts unadjusted savings into business-case-ready numbers
Unadjusted savings estimates answer "what is the theoretical maximum if every opportunity is fully realized?" Risk-adjusted savings answer a more honest question: "what should we reasonably expect given real-world constraints?"
Each opportunity is scored High (1.0) / Medium (0.8) / Low (0.5) across four dimensions. The composite factor multiplies the unadjusted savings.
| Dimension | Weight | High (1.0) | Medium (0.8) | Low (0.5) |
|---|---|---|---|---|
| Detection Confidence | 40% | Strong statistical signal | Clear pattern, limited sample | Suggestive only |
| Implementation Feasibility | 25% | Proven approach | Custom integration work | Novel AI/ML build |
| Adoption Readiness | 20% | Invisible to user | Similar workflow | Significant behavior change |
| Compliance Path | 15% | Light validation | Standard CSV | Full re-validation |
Composite factor = (Detection × 0.40) + (Feasibility × 0.25) + (Adoption × 0.20) + (Compliance × 0.15)
Risk-adjusted annual savings = Unadjusted annual savings × Composite factor
Operational Framework
Opportunity Lifecycle & Value Realization
Surfaced → Accepted → Remediating → Remediated → Monitored
Discovery is only the first step. Every opportunity tracks through a four-state lifecycle — from initial detection through measured value realization — so teams see what's been acted on, what's in flight, and whether expected savings are actually being captured.
| State | Who Owns It | What the Platform Does |
|---|---|---|
| Surfaced | BA triage | Continues collecting evidence; readiness score updates as data arrives |
| Accepted | BA / Ops lead | Snapshots baseline metrics for later comparison |
| Remediating | Implementation team | Monitors for early behavioral change pre-deployment |
| Remediated | Ops lead / finance | Continuously measures actual hours saved vs projected |
| Declined | Governance | Pattern stays monitored; re-surfaced if material growth |
Value Realization Monitoring
Every Remediated opportunity enters continuous post-implementation monitoring. The platform compares three measurements against the locked baseline:
- Throughput delta — cases per day per analyst. Expected to rise after remediation.
- Handling time delta — active touch time per case. Expected to fall for the targeted pattern.
- Pattern recurrence — does the original rework/loop/handoff pattern still appear?
Actual savings are reported weekly against the projected estimate. If realized value is below forecast after 90 days, the opportunity is flagged — the agent re-analyzes and surfaces any secondary patterns blocking full realization.