Integrating AI Text-to-Speech into Android Apps: A Practical Guide

The AudioBook app started as a weekend experiment: could I convert any PDF to natural-sounding audio? Eighteen months and 50K+ users later, I know more about TTS pipelines, EPUB parsers, and Android audio management than I ever expected to.

This guide covers the real implementation challenges — the ones the official docs don't tell you about.

📖 What's Covered

TTS provider selection, EPUB/PDF text extraction, chunking strategies for long-form content, and building a robust Android audio playback system with chapter navigation.

Choosing a TTS Provider

I evaluated four providers: Google Cloud TTS, Amazon Polly, ElevenLabs, and Android's built-in TextToSpeech API.

Android TTS (built-in): Free, offline, terrible voice quality. Fine for accessibility, not for an audiobook experience.
Google Cloud TTS: Best price/quality balance. WaveNet voices are excellent. $4 per 1M characters.
Amazon Polly: Good quality, competitive pricing, but voices sound slightly synthetic compared to Google WaveNet.
ElevenLabs: Best voice quality by far. Premium pricing that doesn't scale for a free-tier app.

We went with Google Cloud TTS. The WaveNet en-US-Neural2 voices pass the "close your eyes" test — most users can't distinguish them from human narration at normal playback speed.

EPUB & PDF Parsing: The Hard Part

Text extraction sounds simple until you hit edge cases: PDFs with text in images, EPUBs with complex HTML structures, right-to-left languages, and academic papers with multi-column layouts.

class EpubTextExtractor {
    fun extract(inputStream: InputStream): List<Chapter> {
        val book = EpubReader().readEpub(inputStream)
        return book.spine.spineReferences.mapIndexed { index, ref ->
            val rawHtml = String(ref.resource.data)
            val text = Jsoup.parse(rawHtml).text().trim()
            Chapter(index = index, title = ref.resource.title ?: "Chapter ${index + 1}", text = text)
        }.filter { it.text.length > 50 }
    }
}

For PDFs, I use PdfBox-Android for text extraction and fall back to ML Kit's document scanner for image-based PDFs. The hybrid approach handles ~94% of PDFs users upload.

Android Audio Playback Pipeline

The biggest architectural challenge: TTS APIs have character limits (5000 chars for Google Cloud). Long-form books need chunking, caching, and seamless gapless playback across chunks.

My approach: chunk text at sentence boundaries, generate audio files per chunk to local cache, use Android's MediaPlayer chained via setNextMediaPlayer() for gapless transition, and preload the next chunk while the current one plays.

⚠️ Watch Out For

MediaPlayer's setNextMediaPlayer() is flaky on some Xiaomi and OPPO devices. Always implement a fallback to ExoPlayer's ConcatenatingMediaSource for gapless playback reliability.

Lessons Learned After 50K Users

Cache aggressively — regenerating TTS audio costs money and time. Store generated chunks in a content-addressed cache keyed by hash of the text + voice settings. Users who re-open a book should never wait.

Playback speed is the most-requested feature. Google Cloud TTS's speakingRate parameter (0.25–4.0) works well, but going above 2.0x noticeably degrades quality. I built a client-side pitch-preserving speed adjustment using ExoPlayer's PlaybackParameters for speeds above 2x.