I’ve spent most of my career leading product teams, but I kept running into the same problem: my instincts were solid, but my vocabulary wasn’t. I could talk about value, outcomes and user stories all day, but when it came to how things actually worked, I was mostly nodding along. To change that, a few months ago I made the conscious decision to reinvent myself, to dive deeper into the technical aspects of things, namely all things Artificial Intelligence. I enrolled at MIT, started reading a lot, downloaded all kinds of IDEs. I’ve also been taking courses online, anything to surround myself in this incredible revolution we’re all in.
Most recently I’ve built and coded ingest models and am now deep in addressing vector database issues. Over and over again, I kept finding solutions in 3 concepts: chunking, sharding, and caching. They sound like words used in construction (which they kindof are) or snowboarding, but they really focus on product performance, relevance, and scalability. If you’re building anything with embeddings, vector databases, or retrieval-augmented generation (RAG), these three concepts aren’t just technical optimizations, they’re the difference between a product that’s fast, helpful, and scalable… and one that’s not.
This article is for technical product managers like myself, who want to stay close to how things work, how you can work with engineering to reduce costs, increase scalability and in the end increase customer value.
Designed by me using Chat GPT Image Generator
Chunking: Feeding the Model Bite by Bite
Chunking is the process of breaking text down into small, semantically meaningful pieces before sending it into an embedding model or vector database. It’s how we avoid feeding a 40-page PDF into the model and hoping it knows what to do with it.
I used to think smaller chunks were better. More precision, right? Turns out, that’s wrong. Too-small chunks lose context. Too-large chunks blur relevance. The sweet spot is usually 200–500 words, ideally aligned with paragraphs or sections. You want enough content to capture a single idea, but not so much that it gets noisy.
I also learned that chunking benefits from overlap. Giving each chunk a little tail from the previous one (say 10–20 words) helps maintain flow and improves semantic matching during retrieval. It gets even better if you include titles, section headers, or metadata in each chunk. (This works with GPT prompts as well.) The model sees them as hints, and that can make all the difference in what gets retrieved.
If you’ve ever had your AI model hallucinate answers from thin air, bad chunking is probably the reason.
Sharding: Scaling the Search
Next up sharding. Think of this as dividing your vector database into smaller, more manageable regions. Instead of searching across 10 million vectors every time someone asks a question, you narrow the field to the most relevant shard.
Most engineers do this instinctively—by customer, geography, document type, or even by time window. But here’s the part I didn’t fully appreciate until recently: sharding doesn’t just improve speed. It boosts relevance, too. You’re cutting noise before the model even gets involved.
There are tradeoffs. Just like with Chunking, Shard too aggressively and you miss context. Don’t shard enough and you end up with bloated indexes and spiky latency. But when done right, sharding is like handing your model a curated folder instead of the entire internet.
For product managers, this is critical. If your search is slow or your results feel scattered, ask your engineers how they’re sharding the data. Seriously, do it. Then ask if the way your customers navigate the product could inform the sharding strategy.
Caching: Don’t Search If You Don’t Have To
Finally, caching. Most people know what this is from having to delete it from browser history. It’s easy to think of caching as only a browser thing, images, CSS, maybe some API responses. But vector search benefits massively from smart caching, especially for repeated or similar queries.
You can cache the top-k results of popular questions, cache the embeddings themselves, or even cache reranked outputs. Some teams go a step further and do semantic caching: group similar queries into a single cluster and map them to the same result set.
This is one of those areas where infrastructure and UX collide. A slow search feels like a dumb assistant. A fast, accurate one feels like magic. Caching helps you get closer to the latter without crushing your cloud bill.
Why PMs Need to Know This
Here’s the part that matters: chunking, sharding, and caching aren’t engineering concerns. They’re product concerns.
They affect hallucination rates, latency, cost, and ultimately—user trust. You can’t A/B test your way out of bad retrieval. You can’t delight users with stale cache hits. And you can’t roadmap your way to relevance if the system can’t find what matters.
Too often, product managers stay in the world of “what” and leave the “how” to someone else. That’s fine when you’re shipping UI improvements or onboarding flows. But if you’re building AI-native products, especially ones powered by vector databases and LLMs, then retrieval is the product.
And if retrieval is the product, then chunking, sharding, and caching are your levers.
How to Get Smarter
You don’t need to become a backend engineer, but you do need to ask better questions. Sit in on architecture reviews. Read the Weaviate, Pinecone, or Qdrant blogs. Try building a small vector search app on your own. Take a class. Ask your engineers how they’re handling retrieval today. Better yet—ask what’s hard about it.
These aren’t just technical details. They’re product-defining decisions.
Final Thought
Product managers talk a lot about customer empathy. It’s time we show a little empathy for our engineers and systems, too. Chunk smarter. Shard intentionally. Cache wisely. And most of all, learn enough to lead the products you want to build—not just the features you want to ship.