This week
- Kicking the tires on an initial, naive agentic search with some thoughts on how it could be improved further...
About a month ago
- Jeff Kaufman shared some data around contra dance attendance as a function of requirements on wearing surgical masks. He compares this data to survey data, which is a useful way to validate in both directions. I found the plot compelling for a different reason – depending on how...
about 1 month ago
- I recently read You do not need “analytics” for your blog because you are neither a military surveillance unit nor a commodity trading company by Leon Paternoster. It’s a well-argued piece, and I agree with the general thrust… but I also won’t be removing analytics from my site...
- Tagging posts on my blog took way too long, so I started wondering why was it so hard. Join me on a quick tour of (im)perfect datasets in my life.
2 months ago
- An analysis of DiskANN, a newer graph-based ANN index built for cheaper disk while still retaining high recall and throughput....
- Metrics can be incredibly powerful. But you have too many of them. Let’s talk about how and when to use metrics. The Golden Rule The golden rule of metrics is this: any metric you maintain should directly drive action if outside expected bounds. The reason this is an important...
- A free introductory search course for anyone who wants better search without all the hard work...
3 months ago
- Say what you will about Jupyter Notebooks, but I think they are an incredible medium for learning and quick experimentation. I use Jupyter Notebooks all the time for my work and personal use. So, naturally, I was curious when I read that you could use Claude Code with Jupyter...
- After publishing my Analysis of Links From The White House’s “Wire” Website, Tina Nguyen, political correspondent at The Verge, reached out with some questions. Her questions made me realize that the numbers in my analysis weren’t quite correct (I wasn’t de-depulicating links...
- We begin with the ever intrusive normal distribution. Its Hill plot resembles the first half of a cycloid or something. Increasing the variance of the distribution does not change anything about the Hill plot. Changing its mean does not change the shape of the plot, but it...
- A little while back I heard about the White House launching their version of a Drudge Report style website called White House Wire. According to Axios, a White House official said the site’s purpose was to serve as “a place for supporters of the president’s agenda to get the...
4 months ago
- Why Applications & Pipelines Should Use DSPy Below is a talk I delivered at the 2025 Data and AI Summit, focusing on how to use DSPy to define and optimize your LLM tasks. We use a toy geospatial conflation problem – the challenge of determining if two datapoints refer to the...
- Which looks like a better change, this NDCG bump of 0.005 on baseline? In [1]: ndcgs(baseline).mean(), ndcgs(graded_syns).mean() Out[1]: (np.float64(0.5411098691836396), np.float64(0.5461684655797919)) Or the NDCG bump of 0.01 on top of baseline? In [1]: ndcgs(baseline).mean(),...
- I set up Google Analytics on my site in 2010, and since then use it to track page views to my site. I only care about page views, which I find useful to figure out which pages get the most traffic. It’s interesting data, and sometimes rather useful. But Google collects much more...
- Modern search engines push waaay too much complexity into the engine. Frustrating search practitioners. Let’s stop doing that. Let’s just get the top N from the search engine, and boost/rerank/etc in our API code. Using tools we know and love. Elasticsearch, Vespa, Weaviate, and...
- Osmosis-Structure-0.6B is a small model trained with reinforcement learning to do one thing well: extract structured data, typically JSON, from unformatted text. That’s it! Convincing LLMs to consistently produce JSON or specifically tagged answers has been a headache since...
5 months ago
- One huge gap I see in the RAG community is an over emphasis on human (or LLM) evals and lack of engagement based evals (ie clicks, conversions). Maybe RAG apps are too early in the build phase to have tons of live users? Or actually, as I suspect it’s just a hard problem? Let’s...
- Using CUDA Deep Neural Network (cuDNN) in Python Let's go through how to implement scaled dot product attention using the cuDNN Python API. This is the most computationally expensive part of inference in a transformer-style model, while also being partially parallelizable so...
- Link: Diving into the Data on Feature Availability and Adoption [BlinkOn 20] - YouTube This is a great talk from Annie Sullivan at BlinkOn 20 about the availability and adoption of web features. Annie discusses the importance of understanding how features are used in the wild,...
- What happens when you embed geospatial capabilities in generalist data tools? More people engaging with geo data. I just returned from the inaugural Cloud-Native Geospatial conference. It was fantastic, I highly recommend you jump in if Jed and team organized another. One of the...
6 months ago
- Lately you can’t shut me up about hybrid search. The core problem retrieval engines have in hybrid search boils down to getting a healthy set of candidates that represent the best vector candidates that also match lexically Essentially hybrid search can become a big chicken +...
Rows per page