Cache me if you can - Paris Vibe-a-thon: “Build a One-Person Unicorn” — November 15, 2025
AI Tinkerers - Paris
Hackathon Showcase

Cache me if you can

Live voice AI companion analyzing prosody+transcript to output real‑time 0–100 emotional scores and longitudinal timelines.

3 members Watch Demo

🍕 Project Name: Pizza – Your Emotional Voice Companion

Short description (problem, target, solution)
Today, millions of people confide in AI, but almost every system only looks at text and ignores tone – the pauses, stress, and energy that actually reveal how we feel. Pizza is a mental-wellness oriented voice companion for young adults and knowledge workers who talk to AI about their day. It uses live prosody analysis plus conversational voice agents to track how you sound over time and gives you a gentle “emotional index” after each session – like a mood journal built from your voice, not just your words.


🧩 Core Problem & Target Customer

  • Problem: Existing chatbots are blind to vocal cues; they cannot reliably perceive stress, fatigue, or emotional drift across days. Users get nice answers, but no signal about their own emotional patterns.
  • Target customer:

    • Young adults & professionals who already use AI as a daily confidant.
    • People interested in self-awareness and emotional hygiene, not clinical therapy.

⭐ Main Features

  1. Live voice conversations
  • User speaks to Pizza in the browser or mobile.
  • ElevenLabs handles low-latency STT + TTS for a natural voice dialogue.
  1. Real-time prosody analysis (Audio Intelligence)
  • Incoming audio stream is mirrored to our prosody service.
  • We extract pitch, intensity, speech rate, and pause patterns.
  • Fused with transcript + lightweight LLM sentiment to compute a per-segment emotional index (0–100) and labels like calm, stressed, low energy.
  1. Emotional timeline & journaling
  • Each session stores: transcript, emotional curve, and a short textual reflection generated by the agent.
  • Users see evolution over time (day vs. day, week vs. week) to spot patterns (e.g., “Mondays after 8pm you’re always more tense”).
  1. Gentle feedback, not therapy
  • Pizza is explicitly a self-care assistant, not a therapist.
  • It reacts to high stress scores with simple grounding suggestions and signposts to professional help instead of trying to “treat” anything.

🎬 Demo Flow for Judges

  1. Start a session
  • From the web UI, click “Start talking”. Mic connects to ElevenLabs Conversational Agent.
  1. Live conversation
  • User speaks for ~1–2 minutes about their day.
  • On screen, a small emotion bar and waveform update in real time (prosody + sentiment).
  1. Adaptive AI voice
  • As the emotional index rises (stress) or drops (low energy), we send metadata back to the agent so the ElevenLabs voice softens or becomes more upbeat using style controls.
  1. Session summary
  • After ending the call, we show:

    • Emotional graph for the session.
    • 2–3 bullet insights (e.g., “You sped up when talking about work deadlines.”).
    • Overall daily score and comparison with previous sessions.
  1. History view
  • Judges can open the journal view and see several past demo sessions to illustrate longitudinal tracking.

🏅 How Pizza Meets the Judging Criteria

1. Execution & Functionality

  • Fully working end-to-end loop: mic → ElevenLabs agent → Pizza backend (analysis) → emotional scoring → UI visualization.
  • Core flows implemented:

    • Start/stop session
    • Live emotion indicator
    • Post-session summary and storage
  • Clear boundaries & guardrails (non-therapeutic, safety checks on content).

2. AI & Agents Usage

  • ElevenLabs Conversational Agent is the central orchestrator of the voice conversation.
  • We use AI in two distinct layers:
  1. Conversational layer – LLM-driven, empathetic but bounded dialog.
  2. Audio intelligence layer – prosody features + LLM sentiment to build an emotional index.
  • The agent uses the emotional index as state to adapt its responses (slower, more reassuring voice when stress is high; more energetic tone when energy is low).

3. Scalability (One-Person Model)

  • Backend is a stateless API running on serverless (Cloud Run / Functions).
  • Persistence via managed storage (Firestore / Cloud Storage / Postgres-like DB).
  • All heavy lifting (STT/TTS/streaming) is offloaded to ElevenLabs, meaning one developer can maintain and evolve the product by:

    • Updating the scoring logic.
    • Adjusting prompts and safety rules.
    • Adding new visualizations without touching infra.

4. Problem Clarity & Market Impact

  • Clear, focused problem: help people understand how they feel over time, not just answer their questions.
  • The solution is complementary to existing mental health apps:

    • No diagnosis, no clinical promise.
    • Works as a low-friction daily ritual (talk 5 minutes → get emotional snapshot).
  • Potential impact:

    • Better self-awareness and earlier detection of burnout-like patterns.
    • Can be offered as a white-label SDK for wellness apps and HR wellbeing programs.

5. Demo & Product Narrative

  • Narrative is simple and relatable:

    “You already talk to AI like a friend. Pizza listens not just to your words, but to your voice, and helps you see how you’re really doing over time.”

  • Live demo shows:

    • Real conversation, not a canned script.
    • Emotional indicator moving as the speaker’s tone changes.
    • A clean, easy-to-grasp journal summary at the end.

6. Partner Tech Usage (ElevenLabs, Google Cloud, etc.)

  • ElevenLabs

    • Conversational / Realtime API for streaming audio & low-latency transcription.
    • TTS with expressive styles to adapt voice tone.
    • Agents used as the central conversational loop and integration point.
  • Google Cloud / Others

    • Vertex / Gemini (or equivalent) for lightweight text sentiment and intent classification.
    • Cloud Run or Functions to host backend APIs.
    • Firestore / Cloud Storage for session and user history.
    • Optional: BigQuery for aggregated analytics over many sessions (future).

🧠 Autonomous Agent Logic & Failure Modes

Agent logic (high level):

  1. Receive transcribed user utterance + prosody features.
  2. Update emotional state: state.emotion_index, state.label.
  3. Decide response style:
  • High stress → shorter, slower, more validating replies.
  • Neutral → open questions, reflective listening.
  • Low energy → slightly more upbeat, but still soft.
  1. Generate response text via LLM with explicit safety and “not a therapist” prompt.
  2. Send response + style hints to ElevenLabs TTS.
  3. Log state and summary at end of session.

Known failure modes & mitigations:

  • Mis-transcription or noisy audio

    • Fallback to “I didn’t quite catch that, can you repeat slowly?”
    • De-weight outlier segments in emotional index.
  • Emotion misclassification

    • Use smoothing over time (rolling window) to avoid overreacting to one wrong spike.
    • Show emotional index as approximate + always allow user to mark “that’s not accurate” (future feedback loop).
  • Network / streaming failures

    • Graceful degradation to text-only chat if audio breaks mid-session.
  • Safety concerns (self-harm, abuse, etc.)

    • LLM checks for high-risk content.
    • Instead of counseling, Pizza responds with validated, pre-approved templates and redirects to professional hotlines/resources.

🛠 Technologies Used

  • Languages & Frameworks

    • Backend: Python (FastAPI) or Node.js (Express)
    • Frontend: React + TypeScript
  • APIs & Libraries

    • ElevenLabs Conversational AI / Realtime API
    • ElevenLabs TTS (v2/v3) with style / emotion controls
    • Vertex / Gemini API for sentiment + reflection generation
    • Audio processing: librosa / torchaudio or similar for prosody features
  • Hosting & Orchestration

    • Google Cloud Run / Functions for backend
    • Static hosting for frontend (Cloud Run, Firebase Hosting, or similar)
    • Firestore / SQL DB for user data and session logs
  • Datasets

    • No external private datasets; we rely on live demo data & synthetic test conversations generated with ElevenLabs voices to validate the scoring pipeline.

🚀 Next Steps After the Hackathon

  1. Improve emotion model
  • Replace hand-crafted features with a dedicated paralinguistic / emotion embedding model.
  • Add per-language calibration (English, French, etc.).
  1. Personalization
  • Learn a user’s baseline voice over multiple sessions and adjust scores relative to their own normal, not a generic model.
  1. API / SDK
  • Expose Pizza as a drop-in API so wellness apps can plug in “emotional voice journaling” in a day.
  1. Clinical collaborations (long-term)
  • Explore research pilots with psychologists or universities, staying within clear ethical and regulatory boundaries.

In short, Pizza is a focused, buildable, and scalable voice companion that showcases ElevenLabs’ strength in conversational audio while adding a unique emotional-intelligence layer with a clear real-world use case.

Before this hackathon, we had an earlier prototype of the idea under a different name: a batch audio journal running on Google Cloud / Vertex AI.
That prior work included:

  • A small FastAPI backend with Docker + Cloud Run deployment scripts.
  • An async pipeline where users uploaded an audio file, which was sent to Google STT and a Gemini model for text-based analysis.
  • A basic prosody experiment using Google Cloud’s audio features on whole recordings (not streaming) and a simple “session score”.
  • A minimal web UI skeleton (auth, layout, color theme) and shared UI components, plus some Terraform / GCP config.

During this hackathon, we forked and renamed the project to Pizza and built all of the following from scratch or heavily reworked

11 Labs Google Cloud Lovable N8N