One model handles both speech-to-text and text-to-speech. A specialist 350M model classifies intent in 15ms. No separate ASR or TTS services needed.