Overview
How Realtime agents work and how to deploy them
Realtime agents are conversational AI powered by a large language model and an optional knowledge base, delivered through an avatar and streamed via WebRTC. This page explains the main components and how to deploy agents.
How it works
Conversation flows from user speech through speech-to-text, turn detection, and the LLM (which can query an optional knowledge base). The model response is converted to speech and then to the avatar. The diagram below shows the pipeline; the avatar output is delivered back to the user using the real-time protocol.
flowchart LR
User(["fa:fa-user User"]) --> STT(["fa:fa-microphone STT"])
STT --> Turn(["fa:fa-exchange-alt Turn Detection"])
Turn --> LLM(["fa:fa-brain LLM"])
LLM --> TTS(["fa:fa-volume-up TTS"])
LLM -.-> Knowledge(["fa:fa-book Knowledge"])
Knowledge -.-> LLM
TTS --> Avatar(["fa:fa-circle-user Avatar"])
Components
| COMPONENT | FUNCTIONALITY | OPTIONAL | PROVIDERS |
|---|---|---|---|
| Speech to Text (STT) | Transcribes user speech to text | No | D-ID |
| Turn Detection | Detects when the user has finished speaking; manages turn-taking | No | D-ID |
| Large Language Model (LLM) | Generates the agent's responses | Yes | OpenAI |
| Knowledge | Knowledge base (RAG) for contextual answers | Yes | D-ID |
| Text to Speech (TTS) | Converts the agent's text response to speech | Yes | ElevenLabs or Azure |
| Avatar | Renders the speaking avatar to the user | No | D-ID |
Deployment options
Use the prebuilt Agent UI and client keys; no backend required. Embed the widget in your site; D-ID handles auth via client keys.
Create a session on your backend (e.g. Agent Sessions API), obtain a token, and pass it to the client. The client uses the token to connect to the stream.
Use the SDK to build your own client UI and control the full experience (layout, UX, integrations).
FAQ
Client keys are restricted to creating sessions only on certain domains and for agents you allow. They do not allow other functionalities (e.g. edit).
Yes. Simply don't include them in the agent request and send the audio or text chunks directly to us through the WebRTC connection.
Yes. Update the agent with PATCH; new sessions use the updated config; ongoing sessions keep the previous one.
External Key: you use your own API key and D-ID routes to OpenAI. Custom LLM: you host the model and D-ID sends requests to your endpoint.
Yes. You can control both functionality and visual aspects directly in the embedded code.
Updated about 14 hours ago
