Overview
Realtime agents are conversational AI powered by a large language model and an optional knowledge base, delivered through an avatar and streamed via WebRTC. This page explains the main components and how to deploy agents.
How it works
Conversation flows from user speech through speech-to-text, turn detection, and the LLM (which can query an optional knowledge base). The model response is converted to speech and then to the avatar. The diagram below shows the pipeline; the avatar output is delivered back to the user using the real-time protocol.
flowchart LR
User(["fa:fa-user User"]) --> STT(["fa:fa-microphone STT"])
STT --> Turn(["fa:fa-exchange-alt Turn Detection"])
Turn --> LLM(["fa:fa-brain LLM"])
LLM --> TTS(["fa:fa-volume-up TTS"])
LLM -.-> Knowledge(["fa:fa-book Knowledge"])
Knowledge -.-> LLM
TTS --> Avatar(["fa:fa-circle-user Avatar"])
Components
| COMPONENT | FUNCTIONALITY | OPTIONAL | PROVIDERS |
|---|---|---|---|
| Speech to Text (STT) | Transcribes user speech to text | No | D-ID |
| Turn Detection | Detects when the user has finished speaking; manages turn-taking | No | D-ID |
| Large Language Model (LLM) | Generates the agent's responses | Yes | OpenAI, Google |
| Knowledge | Knowledge base (RAG) for contextual answers | Yes | D-ID |
| Text to Speech (TTS) | Converts the agent's text response to speech | Yes | ElevenLabs or Azure |
| Avatar | Renders the speaking avatar to the user | No | D-ID |
Deployment options
Use the prebuilt Agent UI and client keys; no backend required. Embed the widget in your site; D-ID handles auth via client keys.
Create a session on your backend (e.g. Agent Sessions API), obtain a token, and pass it to the client. The client uses the token to connect to the stream.
Use the SDK to build your own client UI and control the full experience (layout, UX, integrations).
FAQ
Updated 14 days ago
