Praktika delivers ultra-low-latency transcription for global language education with Baseten

company overview

Praktika delivers efficient and engaging learning experiences for millions of students worldwide by bridging the gap between learning apps and human tutors. Their AI avatars can teach nine different languages and teach in the student’s native language, adapting to each user’s pace, skill level, and learning style.

Praktika delivers personalized learning experiences by combining cutting-edge AI models with a pedagogical foundation designed by its team of innovators.

problems

Praktika’s user experience hinges on its AI avatars acting as close to human tutors as possible. Its users rely on Praktika to create a seamless experience that closely resembles real-life conversations in the language they are practicing.

To power this experience, Praktika’s inference must be highly performant and very reliable across a globally distributed user base. Praktika was previously utilizing the inference solution from a cloud vendor, but they found it fell short in a few key areas:

Latency needed to be lower. Praktika’s previous inference provider was unable to achieve their desired latency of <300 ms for their transcription workload.
Scalable and flexible compute. Praktika has had explosive growth and serves millions of users worldwide. They needed a global inference provider who could support autoscaling as well as the growth of their user base without locking them into inflexible compute reservations.
Lacked reliable external inference expertise. Praktika’s previous inference solution struggled to optimize performance across Praktika’s diverse set of modalities. Continuous iteration and improvement is core to Praktika’s DNA, yet their inference provider lacked the flexibility to keep pace.

A big portion of our latency was speech-to-text. We wanted to reduce latency without affecting the quality and accuracy of our responses. Beyond that, we also wanted a partner who could help us explore and stay up to date with new model releases and performance optimizations to ensure we always have a superior user experience.
Anton Marin, Co-founder & CTO

solutions

Our forward deployed engineers and model performance team partnered with Praktika to optimize their transcription workload. The Baseten Inference Stack powers the fastest and most accurate whisper transcription model. We utilized our core inference stack and a transcription specific runtime optimizations to reduce latency.

We achieved a number of optimizations for Praktika’s workload:

Optimized transcription runtime to reduce latency. We utilized an optimized runtime by deploying TensorRT-LLMs C++ executor. This combined with in-flight batching (which speeds up inference by processing requests immediately) reduced latency considerably.
Unlocked unlimited and global compute. By utilizing our inference optimized infrastructure Praktika gained access to almost limitless compute. This allows them to scale with their viral growth without having to worry about reserving hardware.
Deep partnership for all of inference. While the Baseten Inference Stack unlocks performance and infrastructure our forward deployed engineering team partnered with Praktika to deeply understand their workload and optimize their model to their use case. We continue to work hand in hand with the Praktika team as they push the boundaries of their models.

Our inference stack allowed Praktika’s engineers to tune their autoscaling settings and runtime to ensure the lowest latency and efficiency at scale. These optimizations cut latency from above a second to 300ms on Praktika’s transcription workload, vs 1,000-1,500ms before.

We are very satisfied with the latency we achieved. The user experience feels so much faster. It’s so fast we actually don’t need as many replicas which helps keep costs down.
Anton Marin, Co-founder & CTO

results

After validating the optimized Whisper performance Praktika seamlessly shifted their traffic to Baseten with the help of our OpenAI compatible APIs. While our flexible tech stack brought performance and infrastructure improvements, our deep and transparent partnership provided Praktika confidence that Baseten would be a strong infrastructure provider.

Their optimized Whisper has:

<300ms p50 latency
50% cost savings

This performance improvement means Praktika users can enjoy more seamless and lifelike conversations with their Gen AI language tutors.

Baseten has been almost perfect as a technology provider so far. It’s been a pleasure working with them.
Anton Marin, Co-founder & CTO

what's next

Our team continues to partner with Praktika in exploring new modalities and models. We have a deep partnership to ensure Praktika can respond in real time to the needs of their users and the rapid pace of the AI market. Praktika recently launched 7 new languages. Check out Praktika to experience real-time voice AI live.