Building MooBudget: LLMs in Production for Personal Finance

Keeping track of your spending is one of those things everyone knows they should do, but few stick with. Most budgeting apps ask you to manually categorize every transaction, fill out forms, and remember to open the app after every purchase. It's tedious, and people give up.

MooBudget is my MTech capstone project at NUS, built with my teammate Donda. The idea is simple: make expense tracking as easy as sending a text. You text or voice message a Telegram bot "grabbed coffee $5" or snap a photo of a receipt, and it handles the rest. No forms. No app switching. Just a conversation.

Try MooBudget: Start tracking your expenses now — message @Moobudget_bot on Telegram or visit app.moobudget.com

The capstone is officially a 5-person project, but Donda and I handled the planning, implementation, and reports as a two-person team. Claude Code on the Max plan helped immensely — and gave us a lot of hands-on experience using LLMs for development, not just as a product feature.

Telegram is our first channel, with WhatsApp and a webchat widget on the roadmap for the future.

The full project spans a polyglot architecture (Python + Node.js + React), cross-platform authentication between Telegram and web, Infrastructure as Code with OpenTofu, load testing with K6, and a CI/CD pipeline with Semgrep SAST — I might write about those in a future post. This one focuses on the part I'm most excited about: integrating LLMs into a production app.

Quick Context: What MooBudget Does

MooBudget supports multiple ways to log expenses:

Natural language text — "grabbed coffee $5" or "$12.50 for lunch at hawker"
Receipt photos — OpenAI's Vision API extracts the details
Voice messages — transcribed via OpenAI and processed like text
Bank statement PDFs — parsed by AWS Textract + LLM enhancement

On top of that:

Weekly insights — every Saturday, users get a personalized spending breakdown via Telegram
Ask Mode — type "How much did I spend on food this month?" and get back dynamic charts and analysis
Web dashboard — app.moobudget.com for richer views and data exploration

The Telegram bot is a Python service. A separate Python worker handles all async AI tasks via SQS. A Node.js/Fastify API serves the React web dashboard. Supabase (PostgreSQL) is the data layer, AWS provides S3/SQS/Textract, and Redis handles caching and rate limiting.

With that context, let's get into the LLM integration.

The Chat Service: Tool Use in Action

The Telegram bot doesn't just parse text — it uses OpenAI's function calling (tool use) to understand user intent and execute actions. When you say "grabbed coffee $5," the LLM figures out the intent, extracts the structured data (amount, category, merchant, date), calls the right tool to create the transaction, and responds with a natural language confirmation.

We kept the agentic loop deliberately simple — no multi-step planning, no chain-of-thought orchestration. A small set of tightly scoped tools. Research shows LLM performance degrades past ~15 tools, and for our use case, simplicity won. The pattern is extensible — adding a new capability is just a matter of registering a new tool.

Receipt Extraction: Vision API in Production

When a user sends a receipt photo, it gets queued for async processing. A worker sends the image to OpenAI's Vision API, which extracts structured data — date, merchant, amount, category — and creates the transaction automatically.

We're seeing ~90% accuracy on well-formatted receipts. The model handles different layouts, handwritten amounts, and multiple currencies surprisingly well. Where it struggles is complex receipts with discounts, multi-item breakdowns, or poor image quality. We've seen it hallucinate amounts rather than admitting uncertainty.

To deal with this, we designed our prompts to explicitly request uncertainty signals rather than guesses. We also built a feedback loop — when users edit an AI-extracted transaction, we log the correction. Over time, this gives us accuracy metrics and a growing dataset for prompt improvements.

Bank Statement Processing: Token-Aware Batching

Bank statement processing was our most complex pipeline. Users upload a PDF, AWS Textract extracts the raw transaction data via OCR and table detection, and then an LLM enhances each transaction with proper categorization.

The interesting engineering challenge was token-aware batching. A bank statement can have hundreds of transactions. Sending them all to the LLM at once would blow the token limit. Too few per batch wastes money on API overhead. So we sample a few transactions to estimate tokens per entry, then dynamically calculate optimal batch sizes within model limits — with fallback logic if a batch fails.

We also sanitize PII before sending anything to the LLM — account numbers, personal identifiers, anything that doesn't need to be there for categorization gets stripped out.

Weekly Insights

Every week in the early morning, MooBudget runs analysis for each active user — category breakdowns, spending patterns, anomaly detection, and savings opportunities. The results are compiled into a context object and sent to OpenAI, which generates widget configurations (charts, metrics, recommendations). Later in the day, the insights are delivered as formatted Telegram messages for Telegram users, with an accompanying link to more insights in the web app.

Ask Mode works similarly — when a user asks "show me my food spending for the last 3 months," the LLM interprets the query and decides which visualizations to render and for what time range. If the OpenAI API is down, we fall back to keyword matching — it won't understand nuanced queries, but it keeps the feature functional.

Voice Input

This is a feature I think people will genuinely enjoy using — it's completely fuss-free. You just hold the mic button in Telegram, say "twelve fifty for grab to work," and that's it.

Under the hood, voice messages get transcribed via OpenAI's speech-to-text API, then flow through the same tool-use pipeline as regular text messages. One thing we found: providing finance-specific context in the transcription prompt ("this is a financial expense recording") dramatically improves recognition of numbers and currency terms.

Why All the LLM Work Is Async

LLM API calls are slow and rate-limited. Receipt extraction with the Vision API can take 5-10 seconds. Bank statement processing with Textract + LLM enhancement takes even longer. Users can't wait for this in a Telegram chat — they expect a near-instant acknowledgment.

So the user experience is: send a receipt, get an instant "Got it! Processing..." response, and a few seconds later the bot follows up with the extracted transaction. The heavy lifting happens in the background.

The worker system is designed around reliability — idempotency to avoid duplicate processing, dead letter queues for failed tasks, and correlation IDs so we can trace any request from input to LLM call to final result. When an extraction comes back wrong, we can pinpoint exactly what happened.

Rate Limiting and LLM Security

When you expose an LLM-powered service to the internet via a Telegram bot, you need to think carefully about abuse. Every message a user sends can trigger one or more OpenAI API calls — and those cost real money.

We implemented multiple tiers of rate limiting — different profiles for different operations, with the file upload and messaging limits being the most important for LLM cost control. Each receipt upload triggers a vision call, each voice message triggers transcription plus tool-use. Without limits, a single user or bot could run up a significant OpenAI bill in minutes.

On the security side, every piece of user data that touches an LLM goes through PII sanitization first — the model only sees what it needs. Tool definitions are tightly scoped with schema validation on every call, so even if the LLM hallucinates a tool or parameter, it gets rejected before any business logic runs. We're also looking at integrating OpenAI's Moderation API as an input guardrail.

What I'd Do Differently

Invest in LLM observability from day one. We're planning to add Langfuse for prompt versioning, trace debugging, and accuracy tracking. In hindsight, this should have been in from the first LLM integration. Flying blind on prompt performance in production is uncomfortable. When receipt extraction accuracy drops, you want to know whether it's a prompt regression, a change in the types of receipts users are uploading, or a model behavior change from OpenAI.

Build an eval pipeline early. We have the FER metric for receipts and we log corrections, but we don't have a systematic way to run prompts against a test set and measure accuracy over time. RAGAS or Promptfoo would have given us this. Without it, prompt changes are deployed on vibes.

Conclusion

MooBudget launched to production in November 2025, with an alpha rollout to selected users in December. The app is live at app.moobudget.com and the Telegram bot at @Moobudget_bot.

Building MooBudget taught me that the gap between "LLM demo" and "LLM in production" is mostly about error handling, async processing, and security — the same boring engineering fundamentals that make any system reliable. The AI is the exciting part. The engineering around it is what makes it work.

MooBudget was built as an MTech SE32PT Capstone Project at NUS. The project is a genuine startup effort — we're continuing development beyond the capstone.

Tools used in development: Claude (Claude Code + claude.ai), Perplexity, GitHub Copilot. All architectural decisions and substantive engineering work are our own.