AI insights for scorecards

Company RingCentral

Product AI Conversation Expert (ACE)

Timeline 6 months (on and off), 2025

My role End-to-end design lead including research, strategy, and design execution.

Contact center managers were drowning in manual scorecard reviews — slow, biased, and impossible to scale. I led the design of a GenAI-powered calibration system that turned a human bottleneck into an automated, high-trust workflow.

The problem

Journey and context for scorecard calibration and manual review bottleneck

ACE — RingCentral's post-call analysis tool — lets managers define scorecards to evaluate agent performance. Traditionally, every unscored call flag required manual review. As call volumes scaled, this created an unsustainable bottleneck: slow, prone to bias, and impossible to maintain at pace.

My goal was to transform scorecard calibration from a manual chore into an automated, high-trust workflow — and in doing so, leapfrog competitors while driving down cost per resolution.

Legacy scorecard table and modal components in ACE — Legacy scorecard table and modal components

Navigating the headwinds

Three constraints shaped how I worked throughout this project. Executives had competing priorities, so I learned to arrive with a clear recommendation rather than a menu of options. A distributed team across Spain, India, and the US pushed me toward async-first documentation where every artifact needed to be self-explanatory enough to move decisions without a meeting. And a mid-project reduction in force eliminated my dedicated PM, putting me in direct contact with leadership far more often than planned — raising the stakes on every presentation I ran.

Discovery and tradeoffs

Discovery findings: PRD framing vs real scorecard evaluation use cases

Working closely with Data Science, I found the PRD's initial framing — a simple pass/fail scorecard evaluation — didn't hold up against real use cases. Questions were more nuanced than expected, requiring the AI to understand the manager's underlying intent.

Backend conversations surfaced two key tensions:

1

Cost vs. latency

Evaluating an entire scorecard at once was cheaper but slower. Question-by-question was faster but more expensive. As the first AI experience in the product, minimizing cost was the priority.
2

Targeting vs. coverage

Targeting meant filtering noise so the AI processed only high-relevance inputs. Coverage meant keeping the model fresh as customer needs shifted — especially during fast-moving seasonal promotions.

Design approach

I explored two directions. The first followed the PRD: a scorecard-level evaluation tagging each question as "Strong" or "Needs Work." Testing revealed a critical gap — it diagnosed problems without prescribing solutions. A pass/fail grade without a lesson plan.

First design direction: scorecard-level evaluation per PRD objectivity requirements — The first design direction followed the PRD's objectivity requirements

The second was more aspirational: a prompt-based approach that analyzed the manager's intent at the scorecard or question level. After alignment sessions with Product and Engineering, we landed on a hybrid — weekly, scorecard-level evaluations for the highest-traffic assets, balancing depth of analysis with low latency.

Second design direction: prompt-based question refinement — The second design direction allowed prompt-based question refinement

While the backend was in development, I used the time to modernize the legacy scorecard table and modal components — ensuring the AI features would launch into a polished UI. I also ran "knowledge exchange" sessions with Data Science and Content Design to translate model outputs into an in-product guide: a library of high-performing question examples managers could copy directly into their scorecards.

Incremental UI updates while the scorecard AI backend was in development

When the model was ready, I defined precise interaction details — anchor scrolling between AI suggestions and text fields, and Accept/Undo/Dismiss actions treated as signals for model refinement. I also eliminated the manual weight calculation burden by implementing auto-balancing logic, removing the cognitive load of keeping scorecards totaled to 100%.

Scorecard design: interaction details and AI suggestions

Scorecard design: auto-balancing and weight calculation

Outcomes

10+

Largest user cohort by session count — habit formation confirmed

250%

YoY growth for ACE: 1.2K → 4.8K customers (Q3 2024 → Q4 2025)

2–9×

Broad repeat-use range showing sustained, not novelty, engagement

Post-launch data confirmed the feature had become a daily workflow staple. Consistent scorecard scores paired with faster iteration cycles meant quality and velocity improved together — without adding headcount.

← All projects Next project →