Building an AI Note-Taking App: Architecture Decisions & Lessons Learned

AI NoteTaker started as a proof-of-concept and shipped as a production app in six months. Voice transcription, GPT-powered summarization, smart tagging, and full offline support — four hard problems, each with lessons that changed how I approach AI-assisted apps.

Here's every significant architectural decision, and an honest retrospective on what I got wrong.

📖 Stack

Android (Kotlin), Google Speech-to-Text API, OpenAI GPT-3.5, Room DB, Firebase Sync, Clean Architecture.

Voice Input Pipeline

The first decision: on-device speech recognition (Android's SpeechRecognizer) vs cloud (Google Cloud Speech-to-Text). On-device is free and works offline but struggles with technical vocabulary and long-form dictation. Cloud is accurate but costs money per minute and requires connectivity.

We went with a hybrid approach: on-device for real-time display during recording (showing the user their words as they speak), cloud for the final accurate transcript after recording ends. Users see instant feedback, accuracy is high, and cloud API calls are batched — one per recording session, not streaming per word.

AI Summarization with GPT

The GPT integration was architecturally simple but had a UX problem: summarization takes 2-5 seconds, and users expected instant results. The solution: show the raw transcript immediately, trigger summarization in a background coroutine, and update the UI when complete — with a visual indicator that the summary is "generating."

class SummarizeNoteUseCase(
    private val openAiService: OpenAiService,
    private val noteRepository: NoteRepository
) {
    suspend operator fun invoke(noteId: String) {
        val note = noteRepository.getNote(noteId) ?: return
        noteRepository.update(noteId, summarizing = true)
        try {
            val summary = openAiService.summarize(note.transcript)
            val tags = openAiService.extractTags(note.transcript)
            noteRepository.update(noteId, summary = summary, tags = tags, summarizing = false)
        } catch (e: Exception) {
            noteRepository.update(noteId, summarizing = false, summaryError = e.message)
        }
    }
}

Offline-First Design

Notes must be available offline — that's table stakes for a note-taking app. All notes store in Room DB locally. Firebase Firestore syncs in the background when connected. The golden rule: the UI always reads from Room, never from Firebase directly.

This eliminated an entire class of loading states and made the app feel instant even on slow connections. The complexity is in the sync layer — conflict resolution when the same note is edited on two devices offline required a last-write-wins strategy with merge hints.

What I'd Do Differently

Use Whisper instead of Google STT: OpenAI's Whisper model is more accurate and cheaper at scale. I'd choose it today.
Design the tag schema earlier: We refactored the tagging system twice. AI-generated tags and user-created tags needed different storage models from day one.
Ship faster, add AI later: Voice + offline storage alone would have been a useful MVP. We over-engineered the AI features before validating the core loop.

Despite those retrospective critiques, the app taught me more about AI-assisted mobile architecture than any tutorial or course. Ship early, ship real, and let the users teach you what matters.