Author
Ali Arbab
Project
02 / BolHisaab
Status
Launching at bolhisaab.in
Navigate
BolHisaab
Voice-first Hindi/Hinglish khaata for the 63M shopkeepers running paper books.
Tap mic, say “Ram ne paanch sau udhaar liya,” and the accounting entry writes itself — confirmed in a natural Indian voice with one-tap Undo. Llama 3.1 8B parses intent in ~200ms; Sarvam Saarika v2 transcribes Indian voices natively; a single Postgres RPC inserts the row and returns the new running balance in one round trip. Append-only ledger with soft-delete audit. Pre-launch at bolhisaab.in.
“Ram ne 500 udhaar liya.”— and the ledger writes itself.
Production
bolhisaab.in — coming soon
Domain registered · deploy pending
- github.com/Ali-Arbab/BolHisaab
Source
Next 16 · Supabase · Sarvam · Groq Llama 3.1 / 3.3
- Next.js 16 + React 19
- TypeScript strict
- Tailwind CSS v4
- Supabase + Anon Auth + RLS
- Llama 3.1 8B (Groq) primary
- Llama 3.3 70B fallback
- Sarvam Saarika v2 STT
- Sarvam Bulbul v2 TTS
- Whisper-large-v3-turbo fallback
- Zustand + TanStack Query
Few-shot prompting beat fine-tuning every single time at this scale.
Problem
The shopkeeper at the corner store knows their books down to the rupee, in their head. To get that knowledge into accounting software they have to translate it twice — into English, then into the app's mental model of what counts as a credit vs a debit. The translation tax means most don't bother; the books stay in a paper notebook that doesn't survive a flood.
Why me
I grew up listening to shopkeepers settle accounts in Hindi mixed with two or three other languages, naming customers and amounts in a flow no software product manager has ever sat down and observed for an afternoon. The voice path felt obvious once I noticed nobody had built it.
Learned
STT accuracy and intent parsing are different problems with different solutions. Whisper v3-turbo at 95%+ word-accuracy on Hindi is the easy win; the hard win is getting Llama to extract 'Ram took 500 on credit yesterday' from 'Ram ne kal panch sau udhaar liya' without inventing a 'yesterday' timestamp that's actually today. Few-shot prompting beat fine-tuning every single time at this scale.
More on Ali's journey & projects
/about →Four sequential calls collapsed into one. ~500–800ms saved.
The hot path is a single endpoint, POST /api/voice. It collapses what used to be four sequential calls (/transcribe → /parse → handleIntent → commit) into one round trip — saving ~500–800ms end-to-end. The route accepts either a pre-computed transcript (from the browser's Web Speech API) or an audio blob (from MediaRecorder), then runs STT → Llama parse → party resolution → confidence-gated auto-commit in a single Vercel function.
Speech-to-text is a three-tier auto-select decided at mount time: Sarvam Saarika v2 when the API key is set (flagship Indian-language ASR, far stronger than generic Whisper on Hinglish + Indian numerals), browser Web Speech on Chrome (streaming, free, audio never leaves device — saves ~400ms vs an upload round trip), Groq Whisper-large-v3-turbo as the floor for non-Chrome browsers, with a Devanagari prompt that biases the decoder toward shopkeeper vocabulary. After STT, a dedupeRepeats() helper collapses Whisper's known double-utterance hallucination before the LLM sees the text.
Intent parsing is two-tier. Llama 3.1 8B-instant is the primary at temperature: 0, max_tokens: 400, response_format: { type: "json_object" } — returns in ~200ms. Llama 3.3 70B-versatile is the fallback (~800ms), reached only when 8B JSON cannot be parsed even after a regex extraction pass. gpt-oss-120b was explicitly rejected as primary — it failed Groq's strict json_object validator ~80% of the time, doubling effective latency to ~1800ms once the fallback kicked in. Smaller + tolerant beat bigger + brittle.
The parser sits behind a 200-entry in-memory LRU keyed by transcript + context hash; only complete intents are cached. Defense-in-depth against malformed LLM JSON runs in five layers: JSON mode at the API boundary, a regex extractor for stray prose, a hand-written normalize() that fills missing keys with null and coerces enums, Zod parse, and a 70B fallback retry. If everything still fails, the route returns a hard-coded UNKNOWN so the user always gets some spoken response.
An 8-state machine that knows when Chrome is lying.
The client is a flat 8-state machine, stored in Zustand:
idle → recording → transcribing → parsing → confirming → executing → done | error
Status drives every visible piece of chrome — the Hinglish status pill above the mic (“Sun raha hoon… (stop ke liye dabayein)”, “Soch raha hoon…”, “Likh raha hoon…”), the pulsing dot in the header, and the confirmation sheet. The store deliberately does not cache server data — that lives in TanStack Query (15s staleTime, refetch on focus). Voice state is per-turn and ephemeral; ledger state is shared and live.
Capture is push-to-stop (one tap to start, one tap to end) — not push-to-talk and not auto-stop on silence. Auto-stop kept truncating natural pauses mid-sentence (“Ram ne… [pause] paanch sau udhaar liya” became just “Ram ne”). Web Speech runs continuous: true with a finalRef and endedRef that ride through Chrome's three end-conditions (silence, user-stop, error) without hanging the UI. MediaRecorder uses 16 kHz mono with echo-cancellation/noise-suppression/AGC and a MIN_DURATION_MS = 500 guard because Whisper hallucinates Hindi nonsense on sub-half-second clips.
Two cross-platform shims earn their keep. iOS Safari requires a user gesture to unlock audio playback — primeTTS() speaks an empty utterance at volume 0 on the first mic tap to unlock both speechSynthesis and the <Audio> element. MediaRecorder labels webm blobs as audio/webm;codecs=opus, but Sarvam's validator only accepts the bare audio/webm — so the server strips the ;codecs=... suffix and re-wraps the blob as a fresh File before sending. A separate <PrewarmVoice /> component fires a fire-and-forget POST on app mount — empty FormData hits the 400 early-return inside the handler but still forces Turbopack to compile the heavy voice route end-to-end before the user's first real press. The “first request is slow” cliff vanishes during demos.
Indian numerals as deterministic mappings, not LLM arithmetic.
The system prompt is the single biggest accuracy lever in the app. It was trimmed from ~2000 tokens to ~600 because every input token costs roughly 1ms on Llama 8B-instant. It teaches the model Hindi credit/debit polarity, Indian number words as deterministic mappings, payment-mode synonyms, honorific stripping, and a strict JSON schema with null (never omitted) for unknown fields:
You parse Hindi / Hinglish / English shopkeeper voice transcripts into JSON.
Output JSON ONLY — no prose, no markdown fences.
VOCAB:
- credit (shopkeeper GAVE udhaar): "udhaar liya/diya", "le gaya", "de diya"
- debit (shopkeeper RECEIVED): "diye/diya" (party subject), "chukaya",
"vapas kiya", "paid", "mila/mile"
- Numbers: "sau"=100, "hazaar"=1000, "lakh"=100000,
"dhai sau"=250, "sava sau"=125, "pauna sau"=75,
"saade"+X = X+50 (e.g. saade paanch sau=550)
- Mode: "upi/gpay/phonepe/paytm/online/QR"→upi, "cash/nakad"→cash
- Honorifics: strip "bhai/ji/chacha/didi/uncle/aunty"
RULES:
- ALWAYS include every key from the schema. Use null (not omit) when
unknown. This is mandatory.
- Input may be Devanagari (राम ने पाँच सौ उधार लिया) or romanized.The Indian numerals table is the differentiator. Generic ASR keeps mis-hearing dhai sau as “to son,” saade paanch sau as “say five hundred,” pauna sau as scrambled English. Encoding them as deterministic numeric mappings inside the prompt — rather than asking the LLM to do arithmetic — turns idiomatic Hindi numbers into reliable resolutions.
Cross-script party matching is the most domain-specific code in the repo. Chrome's Web Speech API returns “Ram” sometimes and “राम” other times for the same utterance. Without normalisation, every other transaction would create a duplicate party row. The phonetic key handles this in five passes: Devanagari → Latin transliteration via a hand-built ITRANS-style map; honorific stripping in both scripts (bhai, chacha, didi, जी); spelling normalisation (ph→f, th→t); vowel folding ([aeiou]+ → a) and consonant deduplication; Levenshtein with two thresholds (0.85 auto-resolve, 0.6 disambiguation candidate).
The 0.85 auto-commit gate is the demo wow-moment. When the model returns confidence ≥ 0.85, the matched party already exists, and auto-commit isn't disabled, the route writes the transaction directly and the client just shows an Undo toast — no confirmation modal, no extra tap. Llama 8B-instant consistently reports 0.9+ on clean utterances, so most single-sentence commands are saved in one round trip. The Undo is what makes this safe: voiding is a soft delete that preserves the audit trail.
Append-only by construction. One RPC, one round trip.
Ram
₹500
udhaar liya
Naya baaki: ₹1,200
The ledger is append-only with soft delete. The transactions table is the source of truth; balances come from a party_balances SQL view; reversals set voided_at (and voided_reason) instead of deleting rows. Every read filters voided_at IS NULL. Once a transaction is in, it's there forever — even when voided — which is the right posture for an accounting product where dispute-resolution and audit matter more than storage.
Every voice-originated row also stores its voice_transcript (raw shopkeeper utterance) and parsed_intent (full LLM JSON) as JSONB columns. Two reasons: forensics (“wait, what did Ram actually say?” is a real question users ask weeks later) and a corpus for re-training prompts when the model behaviour drifts.
Writes flow through commit_voice_tx, a security invoker PL/pgSQL function that runs as the calling user so RLS still applies:
create or replace function commit_voice_tx( _shop_id uuid, _party_id uuid, _amount numeric, _direction text, _mode text, _voice_transcript text, _parsed_intent jsonb ) returns table(tx_id uuid, new_balance numeric) language plpgsql security invoker as $$ -- insert + balance read in one round-trip; ~150ms saved per turn -- against the Singapore region's RTT from India. $$;
~150ms saved per turn by collapsing the insert + balance-read into one PL/pgSQL round-trip — measured against Supabase's Singapore region from typical Indian connections.
Authentication is a deliberate UX choice: anonymous Supabase auth with no email, phone, or OTP. The user types a shop name and is in. Identity is auth.uid() ↔ shops.owner_id, and every row is scoped by RLS. Currency rendering uses Indian numbering throughout (₹1,23,456 via Intl.NumberFormat("en-IN")). Color semantics are domain-named: --credit (emerald) and --debit (rose) live as design tokens, not generic UI sugar — the design vocabulary mirrors the ledger vocabulary.
Domain-named CSS tokens
--credit and --debit are first-class design tokens, not utility classes. The Badge primitive has dedicated credit/debit variants. Naming the token after the meaning, not the color, lets a future redesign rebalance one without breaking the other.
iOS focus-zoom defeated
Every <input> uses text-base (16px) so Safari doesn't auto-zoom on focus. Mobile-first hygiene most projects skip.
Hand-rolled bottom sheet (no Radix)
~50 lines on top of Framer Motion's motion.div + AnimatePresence, with backdrop click-to-close, body scroll lock, ESC handler, and pb-[env(safe-area-inset-bottom)] for the iOS home-indicator. The comment in the file: 'No Radix; self-contained overlay + slide panel.'
Translator defeat
Chrome's auto-translate sees Hinglish and tries to 'fix' it client-side, breaking React hydration on every nav. Triple-belt fix: <html lang='hi' translate='no'>, notranslate class, metadata.other.google = 'notranslate', plus suppressHydrationWarning. Hard-earned bug, specific to Indic-language products.
Online/offline as useSyncExternalStore
OfflineBanner uses the React 18 textbook pattern with custom subscribe/getSnapshot/getServerSnapshot. SSR-safe, no flash-of-banner-on-hydrate.
PrewarmVoice route compilation
A 12-line client component fires a useless POST to /api/voice on mount so Turbopack compiles the heavy route before the user records anything. Empty FormData hits the 400 early-return but still forces full-route compilation.
A non-trivial slice of the engineering went into 27 documented gotchas — bugs already debugged so a future contributor doesn't rediscover them. Examples: Llama 8B's tendency to drop keys instead of returning null; MediaRecorder MIME rejection by Sarvam's validator; Whisper's noisy-input self-repeat; Chrome speech-recognition's auto-end race with user-stop.
- No application-level rate limiting.
/api/parse,/api/transcribe, and/api/ttsaccept any caller. Implicit limits exist (VercelmaxDuration, Groq TPM/RPM, Sarvam quota) but no app-level token bucket. - Voice transcripts persisted indefinitely. Every commit writes
voice_transcript+parsed_intentto Postgres for forensics and audit. Intentional, but a real production deployment would need a right-to-be-forgotten hook. - No CSRF tokens. Routes accept multipart form-data with cookie-based session. Real exploitation risk is low, but it's a missing layer worth flagging.
- Anonymous-auth means cookie clear = orphaned ledger. No phone/email recovery path. Trade-off taken explicitly to remove onboarding friction for the target user.
- Push-to-stop, not VAD. True hands-free with voice-activity detection was deferred — the current tap-to-stop is reliable but still requires one tap.
Aaj ka kaarobaar (Today's business)
₹4,52,683
Aapko milna hai:₹1,23,500
Aapko dena hai:₹84,200
~200ms
Llama 8B intent parse
~150ms
saved per turn by RPC
~500–800ms
saved end-to-end
0.85
auto-commit threshold
200
LRU intent-cache entries
3 + 2
ASR backends + LLM tiers
5
JSON-defense layers
88px
mic FAB diameter
500ms
min recording duration
63M
Indian shopkeepers (TAM)
27
documented gotchas fixed
8
voice-state machine states