Overview

Realtime agents are conversational AI powered by a large language model and an optional knowledge base, delivered through an avatar and streamed via WebRTC. This page explains the main components and how to deploy agents.

How it works

Conversation flows from user speech through speech-to-text, turn detection, and the LLM (which can query an optional knowledge base). The model response is converted to speech and then to the avatar. The diagram below shows the pipeline; the avatar output is delivered back to the user using the real-time protocol.

flowchart LR
    User(["fa:fa-user User"]) --> STT(["fa:fa-microphone STT"])
    STT --> Turn(["fa:fa-exchange-alt Turn Detection"])
    Turn --> LLM(["fa:fa-brain LLM"])
    LLM --> TTS(["fa:fa-volume-up TTS"])
    LLM -.-> Knowledge(["fa:fa-book Knowledge"])
    Knowledge -.-> LLM
    TTS --> Avatar(["fa:fa-circle-user Avatar"])

Components

COMPONENT	FUNCTIONALITY	OPTIONAL	PROVIDERS
Speech to Text (STT)	Transcribes user speech to text	No	D-ID
Turn Detection	Detects when the user has finished speaking; manages turn-taking	No	D-ID
Large Language Model (LLM)	Generates the agent's responses	Yes	OpenAI
Knowledge	Knowledge base (RAG) for contextual answers	Yes	D-ID
Text to Speech (TTS)	Converts the agent's text response to speech	Yes	ElevenLabs or Azure
Avatar	Renders the speaking avatar to the user	No	D-ID

Deployment options

Embed with Client Keys (backendless)

Use the prebuilt Agent UI and client keys; no backend required. Embed the widget in your site; D-ID handles auth via client keys.

Backend session and token

Create a session on your backend (e.g. Agent Sessions API), obtain a token, and pass it to the client. The client uses the token to connect to the stream.

SDK and custom UI

Use the SDK to build your own client UI and control the full experience (layout, UX, integrations).

FAQ

Client keys are restricted to creating sessions only on certain domains and for agents you allow. They do not allow other functionalities (e.g. edit).

Yes. Simply don't include them in the agent request and send the audio or text chunks directly to us through the WebRTC connection.

Yes. Update the agent with PATCH; new sessions use the updated config; ongoing sessions keep the previous one.

External Key: you use your own API key and D-ID routes to OpenAI. Custom LLM: you host the model and D-ID sends requests to your endpoint.

Yes. You can control both functionality and visual aspects directly in the embedded code.