Vision
Overview
Vision lets an ElevenLabs-backed agent see the user's camera feed. D-ID samples frames in the background, runs them through a vision model, and feeds the resulting visual context into the ElevenLabs conversation as system updates. The agent then sees the user the way a person on a video call would.
There are two things to configure:
- Enable the
visionflag on the D-ID agent. - Add a visual-awareness instruction to your ElevenLabs agent's system prompt so it uses the context naturally.
Enable vision
Add vision.enabled to the create or update request.
On create
curl -X POST "https://api.d-id.com/v2/agents/integrations/elevenlabs" \
-H "Authorization: Basic <YOUR KEY>" \
-H "Content-Type: application/json" \
-d '{
"preview_name": "Mia",
"presenter": {
"type": "expressive",
"presenter_id": "public_mia_elegant@avt_TJ0Tq5"
},
"external_agent": {
"type": "elevenlabs",
"agent_id": "<YOUR ELEVENLABS AGENT ID>",
"secret_id": "<YOUR SECRET ID>"
},
"vision": {
"enabled": true
}
}'On an existing agent
curl -X PATCH "https://api.d-id.com/v2/agents/integrations/elevenlabs/v2_agt_abc123" \
-H "Authorization: Basic <YOUR KEY>" \
-H "Content-Type: application/json" \
-d '{
"vision": {
"enabled": true
}
}'Parameters
| Parameter | Type | Description | Required |
|---|---|---|---|
vision.enabled | boolean | Activates the vision pipeline for the agent. | Yes |
Update the ElevenLabs system prompt
The D-ID side only delivers the visual context as background system messages; how the agent uses that context lives entirely in your ElevenLabs system prompt. Append the following block to the system prompt of your ElevenLabs agent:
**Visual awareness**: You have a live camera feed and can see the user in real time. Visual observations about the user's appearance, surroundings, and actions are provided to you as background context updates — treat them as what you naturally see, and don't mention any visual analysis system. Use what you see naturally: acknowledge their environment, react to their expressions, or comment on something visible when it fits the conversation. Don't narrate every visual detail unprompted; use it the way a person on a video call would. If you have no visual context and the user asks what you see, say something natural like "Hmm, I can't quite see you right now — is your camera on?"What this prompt does:
- Tells the agent the visual context is from a real camera feed, not a description it requested.
- Stops the agent from referring to "vision analysis" or "frame descriptions" in user-facing speech.
- Sets a natural cadence: react when relevant, don't narrate constantly.
- Gives the agent a fallback line for when the camera is off or no frames have arrived yet.
