Microsoft’s VALL-E neural language model can clone any voice from just three seconds of audio, marking a significant advancement in text-to-speech synthesis. The system trained on 60,000 hours of English speech from over 7,000 unique speakers, representing a 100-fold increase over previous TTS models. In June 2024, Microsoft released VALL-E 2, which became the first AI voice system to achieve human parity in zero-shot text-to-speech synthesis. The broader AI voice generator market reached $3.0 billion in 2024 and is projected to grow to $20.4 billion by 2030.
VALL-E Key Statistics
- VALL-E requires only 3 seconds of audio to clone a voice with 85% accuracy as of 2024
- The AI voice generator market reached $3.0 billion in 2024 with projected growth to $20.4 billion by 2030
- Voice phishing attacks using AI-cloned voices increased 442% from H1 to H2 2024
- 70% of people cannot distinguish between real and AI-cloned voices
- ElevenLabs, a VALL-E competitor, raised $180 million at a $3.3 billion valuation in January 2025
VALL-E Training Data and Model Architecture
Microsoft built VALL-E using the LibriLight dataset assembled by Meta, comprising 60,000 hours of English speech recordings sourced primarily from LibriVox public domain audiobooks. This training corpus included audio from more than 7,000 distinct speakers, enabling the model to generalize across diverse voices, accents, and acoustic environments.
The training data scale represents a fundamental departure from traditional TTS development. Previous systems relied on studio-quality recordings from limited speakers, while VALL-E’s dataset was 100 times larger than any prior text-to-speech model. Microsoft conducted training using 16 NVIDIA TESLA V100 32GB GPUs.
| VALL-E Training Metric | Value |
|---|---|
| Total Training Audio | 60,000 hours |
| Unique Speakers | 7,000+ |
| Voice Sample Required | 3 seconds minimum |
| Training Scale vs Previous Systems | 100x larger |
| Training Hardware | 16 NVIDIA V100 GPUs |
VALL-E 2 Performance Benchmarks
Microsoft released VALL-E 2 in June 2024, achieving what researchers describe as human parity in zero-shot text-to-speech synthesis. This made VALL-E 2 the first system to reach this milestone in voice cloning technology.
The updated model introduced two critical innovations. Repetition Aware Sampling prevents infinite loop issues during inference, while Grouped Code Modeling reduces sequence length and accelerates processing. These improvements enabled VALL-E 2 to exceed human speech quality on subjective metrics including SMOS (similarity) and CMOS (naturalness) benchmarks.
Microsoft confirmed that due to potential misuse risks, VALL-E 2 remains a research project with no public release planned. The company cited concerns about voice identification spoofing and speaker impersonation as primary reasons for restricting access.
AI Voice Generator Market Growth Statistics
The AI voice generator market has experienced rapid expansion, driven by technologies like VALL-E and competing platforms. The global market reached $3.0 billion in 2024 and is projected to grow to $20.4 billion by 2030, representing a compound annual growth rate of 37.1%.
North America accounted for 40.6% of the global market share in 2023. The software segment dominates with 67.2% market share, while media and entertainment represents the largest end-user category.
Venture capital funding for voice AI reached approximately $2.1 billion in 2024, representing nearly a seven-fold increase from $315 million in 2022. The AI voice cloning market specifically grew from $2.1 billion in 2023 with projections reaching $25.6 billion by 2033.
Voice AI Market Segments
| Market Segment | 2024 Value | Projected Value |
|---|---|---|
| AI Voice Generator Market | $3.0 billion | $20.4B (2030) |
| AI Voice Cloning Market | $2.1 billion (2023) | $25.6B (2033) |
| Voice AI Agents Market | $2.4 billion | $47.5B (2034) |
| Text-to-Speech Market | $3.87 billion (2025) | $7.28B (2030) |
VALL-E Competitor Investment and Funding
ElevenLabs has emerged as VALL-E’s primary commercial competitor, tripling its valuation within one year. The company raised $180 million in a Series C round in January 2025 at a $3.3 billion valuation, following an $80 million Series B in January 2024 at $1.1 billion.
ElevenLabs has raised $281 million across all funding rounds since 2022 and achieved an estimated annual recurring revenue of $200 million by August 2025. The company provides AI-driven dubbing in 32 languages and serves 41% of Fortune 500 companies, with customers including The Washington Post, HarperCollins, and Paradox Interactive.
OpenAI introduced its Voice Engine in March 2024, which requires only 15 seconds of audio to clone a voice. This represents a longer input requirement than VALL-E’s 3-second minimum but still demonstrates significant advancement in voice synthesis technology.
Voice Cloning Security Risk Statistics
Technologies like VALL-E have raised significant security concerns due to their potential for misuse in deepfake audio scams and fraud. One in four adults has experienced AI voice scams, and 70% of people cannot distinguish between real and cloned voices.
Voice phishing attacks using AI-cloned voices jumped 442% from the first to the second half of 2024. Among victims of voice clone scams, 77% reported losing money. The largest single deepfake fraud loss reached $25 million in the Arup case in February 2024.
Projected AI Fraud Losses
Deloitte projects that generative AI-enabled fraud will grow from $12.3 billion in 2023 to $40 billion by 2027, representing a 32% compound annual growth rate. North American deepfake fraud losses exceeded $200 million in Q1 2025 alone. Over 400 companies face CEO deepfake attacks daily, and 49% of businesses reported deepfake incidents in 2024.
| Security Risk Metric | Value |
|---|---|
| Adults Experiencing AI Voice Scams | 25% (1 in 4) |
| Unable to Distinguish Real vs Cloned | 70% |
| Voice Clone Victims Who Lost Money | 77% |
| Average Deepfake Attack Cost | ~$500,000 |
| Vishing Surge (H1 to H2 2024) | 442% increase |
VALL-E Technology Applications
Microsoft has identified several legitimate use cases for VALL-E technology while acknowledging the need for consent protocols and detection mechanisms. Neural and AI-powered voices now dominate the TTS market with a 67.9% revenue share in 2024.
Customer service and interactive voice response systems represent the largest application segment, accounting for 31.3% of market share. Microsoft has extended VALL-E capabilities through VALL-E X for cross-lingual synthesis supporting English, Chinese, and Japanese, with maximum audio generation length of 22 seconds.
The text-to-speech market reached $3.87 billion in 2025 with projections of $7.28 billion by 2030 at a 12.89% CAGR. The technology preserves the speaker’s emotion and acoustic environment characteristics from the original audio prompt, including reverberation if present in the sample.
FAQ
How much audio does VALL-E need to clone a voice?
VALL-E requires only 3 seconds of audio to clone a voice with approximately 85% accuracy. OpenAI’s Voice Engine requires 15 seconds for comparable results.
Is VALL-E available to the public?
No. Microsoft has kept VALL-E and VALL-E 2 as research-only projects with no public release planned due to concerns about voice spoofing and impersonation risks.
How large is the AI voice generator market?
The AI voice generator market reached $3.0 billion in 2024 and is projected to grow to $20.4 billion by 2030, representing a 37.1% compound annual growth rate.
Can people detect AI-cloned voices?
70% of people cannot distinguish between real and AI-cloned voices. Human accuracy in identifying high-quality deepfake videos stands at only 24.5%.
What are the main security risks of voice cloning?
Voice phishing attacks increased 442% in 2024. Among voice clone scam victims, 77% lost money. Projected AI fraud losses will reach $40 billion by 2027.
