Vision

Overview

Vision lets an ElevenLabs-backed agent see the user's camera feed. D-ID samples frames in the background, runs them through a vision model, and feeds the resulting visual context into the ElevenLabs conversation as system updates. The agent then sees the user the way a person on a video call would.

There are two things to configure:

  1. Enable the vision flag on the D-ID agent.
  2. Add a visual-awareness instruction to your ElevenLabs agent's system prompt so it uses the context naturally.

Enable vision

Add vision.enabled to the create or update request.

On create

curl -X POST "https://api.d-id.com/v2/agents/integrations/elevenlabs" \
  -H "Authorization: Basic <YOUR KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "preview_name": "Mia",
    "presenter": {
      "type": "expressive",
      "presenter_id": "public_mia_elegant@avt_TJ0Tq5"
    },
    "external_agent": {
      "type": "elevenlabs",
      "agent_id": "<YOUR ELEVENLABS AGENT ID>",
      "secret_id": "<YOUR SECRET ID>"
    },
    "vision": {
      "enabled": true
    }
  }'

On an existing agent

curl -X PATCH "https://api.d-id.com/v2/agents/integrations/elevenlabs/v2_agt_abc123" \
  -H "Authorization: Basic <YOUR KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "vision": {
      "enabled": true
    }
  }'

Parameters

ParameterTypeDescriptionRequired
vision.enabledbooleanActivates the vision pipeline for the agent.Yes

Update the ElevenLabs system prompt

The D-ID side only delivers the visual context as background system messages; how the agent uses that context lives entirely in your ElevenLabs system prompt. Append the following block to the system prompt of your ElevenLabs agent:

**Visual awareness**: You have a live camera feed and can see the user in real time. Visual observations about the user's appearance, surroundings, and actions are provided to you as background context updates — treat them as what you naturally see, and don't mention any visual analysis system. Use what you see naturally: acknowledge their environment, react to their expressions, or comment on something visible when it fits the conversation. Don't narrate every visual detail unprompted; use it the way a person on a video call would. If you have no visual context and the user asks what you see, say something natural like "Hmm, I can't quite see you right now — is your camera on?"

What this prompt does:

  • Tells the agent the visual context is from a real camera feed, not a description it requested.
  • Stops the agent from referring to "vision analysis" or "frame descriptions" in user-facing speech.
  • Sets a natural cadence: react when relevant, don't narrate constantly.
  • Gives the agent a fallback line for when the camera is off or no frames have arrived yet.

FAQ