Introducing the Capsule

Ethan Ding

July 23, 2023

When we started TextQL, we had a single goal in mind - fully automate the response and behavior of a human junior data analyst. We figured it'd be valuable because… well, people hire data analysts all the time for ~90k a year + healthcare costs…

The biggest problem we ran into right away, as YC's 56 text-to-SQL startups have probably caused everyone to realize by now, is that returning SQL the way a human would is a monumental challenge. Returning correct SQL is even harder. "But wait," you wonder, "aren't those the same thing?" Not really.

A junior data analyst joins your company, and no matter how many hundreds of IQ points they have or how many PhDs from MIT they obtained, they wouldn't be able to answer something as simple as "revenue over time" on day one. That requires an understanding of your metadata, which exists scattered across your current data stack. Over time, as they learn more about your data (accumulating metadata), they can return more accurate data analysis. Until then, they return "I don't actually know how to answer that, let me check" - which, for all intents and purposes, is an accurate response for them to return.

So to bridge that gap, we invented the Capsule.

💊 What is a Capsule?

Let's say someone said “Hey <analyst_name>, when I say revenue I’m talking about these columns out of a Salesforce report”. How do we store that? To store it in a modern data catalog, we would need to create a metric called “revenue”, tie it to a specific line of SQL, and then tag it with sales. This decomposes that sentence into three pieces of information so a data catalog can normalize it. This organizational format is valuable in a UI, but it is far from the optimal way for an LLM to access information.

An LLM is best suited to parsing information in sentence format and only seeing the most relevant facts needed to make an informed decision. Composing the sentence “Revenue for sales mean columns XYZ from Salesforce” out of the following data catalog API response is the wrong way to teach an LLM:

{ metric: revenue, source: Salesforce, tags: { sales, } }

So how do we aggregate all of the metadata into a single unit of information that is optimized for a large language model to be able to follow instructions with?

Enter The Capsule‍

As we’ve taken several customers live and tuned our model deployment to understanding their internal metadata, we realized that we needed a specific kind of unit of data. It needed to be intuitive enough that any user conversation could be stored as one, but also versatile enough that any existing piece of the customers’ documentation could be composed into without data loss.

And so we invented the Capsule, a universal piece of metadata that any conversation can be compressed into, any existing data tool can enrich, and a relevance model can easily identify the most relevant capsules for a given piece of analysis.

<Redacted image of a Capsule's internal workings because we haven't quite decided how open we want to be about our secret sauce>.

Story Behind How We Created Capsules

Capsules was a bottoms-up concept that Mark invented when taking one of our customers live and realizing we were barely clearing 65% accuracy. Mark realized that their data team was explaining how to calculate metrics like CPA on Facebook by describing a set of columns, relationships, and how to analyze them. Of course, he could’ve built a new set of staging models and then defined the metric in dbt, but that would’ve taken him an hour for each piece of information. Instead, he took all the information that the customer provided, put it in a package, and then sent it directly to the model, discovering that it brought accuracy up to 95% instantly.

He figured out how to create a new capsule (he was calling it an asset group at the time) from every dialogue session, and save that back into our metadata store. That’s when we realized this architecture actually enables our model to “get smarter as you use it” not just theoretically but actually.

We described this concept to some AI researcher friends, who said it sounded similar to the way data is stored for active retrieval augmented generation.

🛠️ Active Retrieval Augmented Generation

TL;DR, storing units of context (like Capsules) in sentence format and retrieving it during a conversation with an LLM will significantly improve your accuracy. See paper here: https://arxiv.org/abs/2305.06983a

Byproduct of its simplicity is our ability to store all their existing metadata (Google Docs, BI dashboards) into our Capsule library during onboarding, enabling our analyst to hit the ground running.

Example of a Capsule Being Stored Live

‍<Redacted asset, because again we're indecisive about sharing our secret sauce with the world>.

<Call to Action 2>

Hey, still here? Interested in adopting this weird tech so your business people get off your back?

Damn, this CMS system doesn't let me embed buttons, but here's my Calendly link if you want to talk about using TextQL for your product - https://calendly.com/ethanding.

Work with TextQL