This is a small text-to-speech tool that you can use for anything you’d like, entirely free of charge. Paste text in the box, pick a voice, and click Generate. Your browser turns it into an audio entirely on your device. It is the same tool I use to generate the recordings of my essays.
It runs the open-source Kokoro model directly in your browser, using WebGPU when it’s available. Despite its lightweight architecture, it delivers comparable quality to larger (and premium) models while being significantly faster and entirely free.
Kokor is distributed under the Apache-2.0 license. If you want to use it for your project, you can find it on Hugging Face.
How to use it
Paste your text. Pick a primary Voice from the dropdown — there are currently twentyeight. If you want a hybrid voice, pick a second one in Blend with and adjust the mix. Click Generate.
You can add pauses in your audio by writing [0.5s] in the text. Change the duration to what you prefer.
The first time you click Generate, your browser downloads the Kokoro model. It’s roughly 300 MB, so the first run can take a minute or two depending on your connection. After that the model is cached in your browser, and subsequent runs are fast — a couple of minutes of audio takes about as many seconds on a modern machine.
When it finishes, you can play it back, download the MP3, or grab the WAV.
A few things worth knowing
Browser. Chrome, Edge, and Brave support WebGPU, which is what makes this quick. Safari and Firefox fall back to a slower WASM mode — it still works, just be patient.
Privacy. Everything runs locally. No telemetry, no server calls, no third parties, no cookies. Your text and the generated audio stay in your browser and never reach our servers or inbox in any way.
Quality. Some voices are better than others, it depends on how long they were trained for and I have no control over that. Blending two voices within the same language usually produces a more interesting result than picking one. I recommend Heart (Female American) as it is the most trained of all.
Audio Files. The model generates WAV and MP3 files. WAV is uncompressed PCM (24 kHz, 16-bit, mono, ~384 kbps, ~2.9 MB per minute of audio). “Lossless from the model” — but the model itself only produces 24 kHz, so this isn’t studio-quality, it’s speech-quality without lossy compression on top. MP3 is encoded with lamejs (24 kHz, mono, CBR, 128 kbps, ~960 KB per minute).
Generate Your AI Voice From Text
Generated audio
AI Voice Samples
You can listen below to a short sample of each voice reading the following sentence out loud:
Some mornings, the coffee tastes better than usual — and I never quite know why.
American Samples
Adam (Male)
Alloy (Female)
Aoede (Female)
Bella (Female)
Echo (Male)
Eric (Male)
Fenrir (Male)
Heart (Female)
Jessica (Female)
Kore (Female)
Liam (Male)
Michael (Male)
Nicole (Female)
Nova (Female)
Onyx (Male)
Puck (Male)
River (Female)
Santa (Male)
Blend Adam (M) + Heart (F) – 50/50
British Samples
Sarah (Female)
Sky (Female)
Alice (Female)
Daniel (Male)
Emma (Female)
Fable (Male)
George (Male)
Isabella (Female)
Lewis (Male)
Lily (Female)