Core Metrics for Effective AI Product Management: Classical ML & LLMs

5 min readDec 7, 2023

Introduction

Whether the heart of a product is powered by AI or not, the essence of Product Management remains the same. Yet, when technology, specifically Artificial Intelligence(AI) and Machine Learning (ML), takes center stage, it brings unique considerations to a product manager’s role. This article aims to serve you as a guide, outlining specific metrics crucial for product managers navigating the domain of AI and ML-centric products. The goal is to highlight certain aspects of metrics that are especially relevant to AI/ML products. I’ll briefly touch on metrics that apply to other products in a more general way.
To be on the same page, let’s briefly cover several crucial topics, and then move to the part of product metrics. We need to understand these aspects, as they directly influence the significance of certain metrics while diminishing the importance of others:

The role of AI/ML in the product

The function of AI/ML varies, ranging from taking the spotlight as the “main character” to quietly working in the background as a “helper.” These roles can be quite diverse, influencing products differently based on the specific usage of the technology and its purpose.

1. AI/ML as the Star (Core Product)

Some products are all about AI/ML. In the example of Krisp.ai (Voice Productivity solution), we can say that ML is the core of the product; the majority of benefits users experience — from the cutting-edge transcription technology to the exceptional noise cancellation feature — are directly fueled by ML models.

2. AI/ML as the Quiet Helper (Enabling Element)

Sometimes, AI/ML operates behind the scenes as a secret helper. Take Amazon or Netflix, for example. They leverage AI/ML to observe your preferences, predict your potential interests, and suggest content accordingly. Even though you don’t see the AI, it’s there.

3. Both in One (Hybrid Approaches)

Check out Spotify. AI/ML creates playlists and suggests music (the star), while quietly learning your habits for a seamless experience (the helper). It’s a blend of personalized recommendations and intuitive support.

The difference between Classical ML Models and LLMs.

There may be numerous differences among ML/AI models, including factors like model size, operational environment (on-device, on-cloud, etc.), and the diversity of input and output types. However, within the scope of this article, let’s specifically explore only the key differences between Classical ML Models and LLMs.

High Level diagram on a difference of ML and LLM Models

Classical ML Models

Classical ML Models, driven by traditional algorithms, excel in precision for well-defined tasks, offering optimal performance in scenarios where clear input-output relationships are crucial. For instance, a meeting type can be identified by an ML model using its trained data — It takes the meeting details as input and provides its categorized type as output (or just a binary output, to detect a specific type, such as “It’s a 1:1 or it’s not”).

2. Large Language Models

In contrast, LLMs like GPT-4 are good at understanding and creating human-like language, making them handy for tasks that need an understanding of language and generating new text out of it. In the context of meeting type classification, an LLM can take the meeting details along with a prompt as input. Without explicit training for this specific task, it can output the categorized type.

Product Metrics for AI/ML Products

Welcome to the heart of our discussion — product metrics for AI/ML products. Now, let’s delve into the core of this article: metrics. Examples will include some information discussed earlier. Congratulations on reaching this point!

Understanding System Health Across Model Types

Now, let’s delve into the maintenance of our AI/ML models, exploring the metrics that demand monitoring and consideration. All forthcoming examples will be illustrated through Krisp’s product features: transcript creation and meeting notes generation. To provide context, our in-house ML model operates on the device for transcript creation, while a third-party Large Language Model (LLM) is employed for generating meeting notes.

Latency and Throughput Metrics

Achieving optimal user experiences requires a good balance between quick responses and high processing capacity.

Example: For the meeting notes generation we should consider both latency and throughput

— Latency
Latency refers to the time it takes for a system to respond to a request. In the context of meeting notes generation, it measures how quickly the model can process and provide an output for a given input transcript.

— Throughput
Throughput measures the rate at which a system (for example GPT3) can process a certain number of requests within a given time frame. In meeting notes generation, it assesses the model’s capacity to handle and fulfill a specific volume of requests efficiently.

Error Rates

Understanding and managing error rates is crucial for ensuring accurate and reliable outputs.

Example: When evaluating the accuracy of transcript technology, key technology metrics to consider include Word Error Rate (WER), Diarization Error Rate (DER), among others. These metrics provide insights into the precision and correctness of the transcription model’s output.

— Word Error Rate (WER)
Measures the percentage of incorrect words in the transcribed output compared to the reference transcript.

— Diarization Error Rate (DER)
Evaluates the accuracy of speaker diarization, measuring errors in attributing spoken words to the correct speaker.

— Other Error Metrics
Depending on the specific context, additional error rates such as Sentence Error Rate (SER) and Overall Error Rate (OER) can also provide valuable insights into the model’s performance.

Text Generation Metrics

— Number of Generated Tokens (Words, Sentences, etc.)
The number of tokens LLM gives as an output. This might help to monitor content length, structure, and readability to align with user expectations.

AI Proxy Metrics for Model Performance

Understanding how users interact with AI/ML models and the satisfaction derived from these interactions can be derived through AI proxy metrics. While not directly assessing the internal workings of the model, these metrics provide valuable insights into user engagement, feature adoption, and operational efficiency.

User Engagement Metrics

— Click-through Rates (CTR)
The percentage of users who click on AI-generated meeting notes, indicating user interest in the generated content.

— Time Spent Reviewing Meeting Notes
The duration users engage with the AI-generated meeting notes, reflecting the level of interest and involvement in the content.

Feature Adoption and Satisfaction

— Adoption Rates
The proportion of users actively using the Meeting Notes Generation feature within the product.

— User Satisfaction Surveys
Feedback collected from users about their satisfaction with the AI-generated meeting notes.

In conclusion, successful AI product management relies on key metrics like system health, user engagement, feature adoption, and continuous learning. Metrics such as latency, error rates, and user satisfaction offer crucial insights. It’s essential to note that other metric types may also be relevant, and a comprehensive understanding of these metrics is vital for delivering a seamless and valuable user experience.

Core Metrics for Effective AI Product Management: Classical ML & LLMs

Introduction

Product Metrics for AI/ML Products

Understanding System Health Across Model Types

AI Proxy Metrics for Model Performance

Written by Elen Gabrielyan