Overview

Realtime agents are conversational AI powered by a large language model and an optional knowledge base, delivered through an avatar and streamed via WebRTC. This page explains the main components and how to deploy agents.

How it works

Conversation flows from user speech through speech-to-text, turn detection, and the LLM (which can query an optional knowledge base). The model response is converted to speech and then to the avatar. The diagram below shows the pipeline; the avatar output is delivered back to the user using the real-time protocol.

flowchart LR
    User(["fa:fa-user User"]) --> STT(["fa:fa-microphone STT"])
    STT --> Turn(["fa:fa-exchange-alt Turn Detection"])
    Turn --> LLM(["fa:fa-brain LLM"])
    LLM --> TTS(["fa:fa-volume-up TTS"])
    LLM -.-> Knowledge(["fa:fa-book Knowledge"])
    Knowledge -.-> LLM
    TTS --> Avatar(["fa:fa-circle-user Avatar"])

Components

COMPONENTFUNCTIONALITYOPTIONALPROVIDERS
Speech to Text (STT)Transcribes user speech to textNoD-ID
Turn DetectionDetects when the user has finished speaking; manages turn-takingNoD-ID
Large Language Model (LLM)Generates the agent's responsesYesOpenAI, Google
KnowledgeKnowledge base (RAG) for contextual answersYesD-ID
Text to Speech (TTS)Converts the agent's text response to speechYesElevenLabs or Azure
AvatarRenders the speaking avatar to the userNoD-ID

Deployment options

Embed with Client Keys (backendless)

Use the prebuilt Agent UI and client keys; no backend required. Embed the widget in your site; D-ID handles auth via client keys.

Backend session and token

Create a session on your backend (e.g. Agent Sessions API), obtain a token, and pass it to the client. The client uses the token to connect to the stream.

SDK and custom UI

Use the SDK to build your own client UI and control the full experience (layout, UX, integrations).

FAQ