Inquiry About Building a Real-Time AI Co-Host Like in the Demo Video

Dear D-ID Support Team,

I recently watched your demo video (https://m.youtube.com/watch?v=48y1MUiahRA&time_continue=22&embeds_referring_euri=https%3A%2F%2Fdocs.d-id.com%2F) showcasing a real-time AI presenter interacting with a human host during a live session. I’m very impressed by the seamless interaction and fast response times.

I’m currently exploring the possibility of developing a similar AI assistant that can serve as a co-host during live events or meetings, capable of listening to the human host, processing their speech in real time, and responding naturally with synchronized facial animations and voice.

Could you please let me know:
1. Which specific D-ID features or products (e.g., Clips, API, or AI Agents) are used to achieve this kind of interaction?
2. Is real-time voice input and response supported out of the box, or does it require integration with external services (e.g., ASR, LLM, TTS)?
3. What are the recommended system requirements or latency benchmarks for achieving the near real-time response shown in the video?
4. Are there any sample codebases, SDKs, or integration guides available that would help me implement this?

I would appreciate any guidance or resources you can provide. Thank you in advance for your support.

Best regards,
Claire