Naaadai — Verified vernacular voice data for India's AI

The gap

When a language goes unrecorded, an entire community becomes invisible to the technology built on top of it.

01

AI learns from the data it is given

Most Indian speech datasets capture urban, standardized Hindi. The rural dialects that millions actually speak are thinly represented, so models stumble the moment they leave the city.

02

You can test it for yourself

Open any major voice platform and try Maithili or Theti. It either fails or defaults to generic Hindi. Theti appears in no commercial dataset we have found.

Inside the dataset

The same sentence, two ways the people of Khagaria actually speak it.

In Gogri Jamalpur, everyday Hindi is shaped by Theti: different words, different sounds, different rhythm. We collect both, so models learn how people really talk.

“आज खेत में पानी देना है।” One line, captured in both the local Hindi dialect and Theti. Request both to compare them side by side.

Coverage

Four languages, sourced where they are actually spoken.

Select a language above to see how and where we collect it.

ठेठी

Our home ground

Theti, the dialect no dataset hasठेठी

For most buyers, Theti is the reason they come to us first. It is the everyday language of Khagaria, and it sits in no commercial dataset we have ever found.

Our founder grew up speaking it. That is why this is the coverage we can build with a depth and accuracy no crowdsourcing platform can match.

Where

Khagaria district, Bihar (Gogri Jamalpur belt)

Language family

A dialect in the Angika branch of the Bihari languages

Status

Angika, Theti's parent language, is listed by UNESCO as vulnerable

Availability

Found in no commercial speech dataset; collected first-hand by Naaadai

How it works

Built on the ground, reviewed by hand.

Every dataset moves through the same four steps, run by people, not scripts.

1

Collect

Native speakers record read and spontaneous speech through our verified NGO network in Bihar and Maharashtra.

2

Transcribe

Every file is transcribed verbatim by a native speaker of that language, not an automated tool.

3

Review

A second trained annotator checks each file by hand, so audio, transcript and labels all match.

4

Deliver

You receive clean audio, time-aligned transcripts, rich metadata and a clear commercial licence.

What you receive

A dataset your team can train on the day it lands.

Audio

Per-utterance files in WAV, FLAC or MP3Delivered to your target sample rate

Transcripts

Verbatim, time-aligned textPlain text, CSV or JSON

Speaker metadata

Age band, gender, language & dialect, regionAnonymised, no personal identifiers

Content mix

Read and spontaneous speechBlended to your use case on request

Quality

Two-pass native-speaker reviewPer-batch quality report included

Licensing

Clear commercial licenceExclusive & per-use terms available

Sourcing, consent & ethics

Data your legal team can approve without a second look.

✓

Informed consent

Every speaker takes part knowingly and willingly, with consent recorded before a single file is used.

⌗

Clear licensing & provenance

Each dataset ships with a clean commercial licence and documentation showing exactly where it came from.

♥

Ethically sourced

Collected through a women-led NGO network, paying fairly so the income stays in the rural communities the data comes from.

Founder's note

"All of India says main for I. Where I am from, we say hum, and for years I was made to feel ashamed of it."

Where I grew up, we speak Hindi with our own accent, in our own way. The rest of the country says “main” when they mean I, one person. We say “hum”, which sounds like “we”, but we mean only ourselves. It is small, but it is the kind of thing that marks you the moment you leave home.

So many people from our towns and villages carry a quiet identity crisis when they go to a city or abroad, because no one understands why we talk the way we do. I lived it. I was laughed at for my accent and for saying things “the wrong way”, until I started to believe there was something wrong with where I came from.

For the longest time I didn't even know my language had a name, or that it sits under Angika. It was just how we talked at home in Khagaria. The way we speak was never a mistake to be corrected, it is a language, with its own logic and history.

I started Naaadai so the next generation never has to feel that shame. I want our voices written down, understood, and built into the technology everyone else already takes for granted, so that being from here is something to be proud of, not to hide.

Tejaswini Founder · Naaadai

Questions

What buyers ask us first.

Can we license the data commercially, or exclusively?+

Yes. Every dataset comes with a clear commercial licence and full consent and provenance documentation. Exclusive and custom licensing terms are available, just tell us how you intend to use the data.

How quickly can you deliver?+

Samples from existing material can go out within days. Timelines for larger or custom collections depend on the languages, volume and content mix, we will give you a clear schedule with your quote.

What formats and metadata do you provide?+

Per-utterance audio in WAV, FLAC or MP3, verbatim time-aligned transcripts in text, CSV or JSON, and anonymised speaker metadata covering age band, gender, language, dialect and region. We deliver to your target sample rate.

Can you collect custom data for a specific dialect or domain?+

Yes. Our on-the-ground NGO network lets us collect to a brief, specific dialects, age or gender mixes, scripted prompts for your vocabulary, or spontaneous speech in a domain that matters to you.

How do you ensure quality and consent?+

Every file is transcribed by a native speaker and reviewed by a second trained annotator, with a quality report per batch. Consent is recorded from every speaker before any file is used, and never sourced from anonymous web scraping.

Work with us

Request a sample, or get a quote.

We work with AI companies, research labs, and annotation platforms that need verified vernacular voice data from communities the field has overlooked.

inquiries@naaadaivoice.com