Why I’m Building an AI Speech Tool for People Who Stutter.

Stuttering and speech disfluency affect many people, especially in situations where clear communication is important. I’ve been working on a small project that uses AI to clean up recorded speech—removing stutters, filler words, and long pauses—while trying to keep the speaker’s voice as natural as possible. It’s still early, but the goal is to make recordings easier to understand, not to change how someone speaks. In this post, I’ll explain why I started building it, how it works, and where it might be useful.

5/8/20242 min read

Robotics, AI, Insights

Introduction

Speech disfluencies—such as stutters, repeated syllables, filler words, and long pauses—are common and completely natural. But in certain situations, like preparing a voice message, giving a presentation, or creating video content, they can make recordings harder to follow or less effective at communicating the speaker’s point.

This project started as an experiment: could I build a lightweight AI tool that helps people clean up disfluent speech in recordings, without losing their natural voice or style? Over time, the idea became more structured, and I’ve been working on a prototype that combines transcription, voice cloning, and speech synthesis to make it possible.

How It Works

One thing I learned early on is that simply identifying stutters and cropping them out of the audio doesn’t work well. It often makes the speech sound choppy, robotic, or even harder to listen to than before. Natural human speech has rhythm, pacing, and tone that get disrupted when you cut audio segments directly.

So instead, I came up with a better approach:

  1. Use a speech-to-text model to transcribe the audio.

  2. Clean up the transcription by removing disfluencies like stutters or filler words.

  3. Convert the cleaned transcript back into speech—in the speaker’s original voice.

For transcription, I’m using AssemblyAI, which not only transcribes informal speech but also filters out disfluencies. Its speaker diarisation feature can distinguish between different speakers, which helps when working with dialogues or interviews.

Then, to recreate the cleaned speech, I use Spark-TTS—a lightweight voice cloning model from Unsloth. It can generate realistic speech that closely matches the original speaker’s voice, even without training on large amounts of their audio. The result is a smoother, more fluent version of the same voice.

I'm also exploring video generation to add realistic facial motion to match the cleaned audio, which could be helpful for video-based communication.

Why This Matters

The point of this project isn’t to “fix” how someone speaks. It’s about giving people the option to improve clarity in recordings where it matters—without needing to re-record or overly edit themselves.

This can be useful in a variety of situations:

  • Content creators who want their speech to sound clearer in videos or podcasts

  • Students or professionals preparing voiceovers or presentations

  • People with speech disfluencies who want to be better understood in recorded formats

There are very few tools that handle this well, especially ones that respect the speaker’s voice and give full control over what gets changed. Most importantly, this approach keeps the identity of the speaker intact—it doesn’t replace how they speak, just offers a cleaner version when needed.

Having this kind of tool could reduce anxiety around communication and save time in editing, all while keeping things authentic. It’s not a solution to everything, but it could make certain parts of everyday life a bit easier for some people.

Conclusion

This tool is still a work in progress, and I’m learning as I go. There are many details to fine-tune, especially when it comes to making the final output feel natural and accurate. But if it ends up helping people communicate more clearly in recorded settings—without needing to re-record or worry about disfluencies—I’ll consider it worthwhile.