Discussions
Guidance on Streaming Full-Body
Hi D-ID Support team,
I’m building a service where all conversational logic runs on my own backend; I only need D-ID for the visual layer:
Avatar type: full-body Clips Premium presenter
Mode: real-time streaming (the backend pushes each user message as it arrives)
Idle state: when the avatar is not speaking, I overlay a silent “idle” loop so the scene never freezes
The missing piece is a reliable signal for when the streaming avatar actually starts and finishes speaking, so I can fade the idle clip out/in at the right time.
I’ve seen three possible approaches:
WebSocket events (speech_start, speech_end) – some docs/forums reference them, but I can’t establish a wss://…/clips/streams/{id}/events connection for Clips Premium.
WebRTC track events (track.onunmute, track.onended) – they don’t fire because the track arrives un-muted and stays open.
Polling RTCPeerConnection.getStats() – feasible, but it’s heuristics-based (audioLevel / framesDecoded) and less precise.
Could you confirm:
Which method is officially supported for Clips Premium streams today?
If WebSocket events are unavailable for this tier, is polling WebRTC stats the recommended fallback?
Any best-practice tips for generating a seamless idle clip (e.g., specific fluent/pad_audio settings) so transitions look smooth?
Thanks a lot for your guidance.
Best regards,
David.