Distilling Research at Scale

technology

vision

Author

Published

2025-04-04

Abstract

New concepts must be explained to be understood, yet we rarely invest sufficient effort in clarifying them. While explanations demand time and care, this investment is essential, because without it, ideas remain opaque, inaccessible, and dormant. Today we see that the pace of explanation production has not kept up with the volume of knowledge production, resulting in a growing “research debt”: “the accumulation of missing interpretive labor”. To address this, we need deliberate, focused measures to resolve our intellectual deficits. This essay proposes the concept of “cognitive compilers,” an overarching effort to translate technical research complexity into its most accessible forms — for the sake of discovery — in a scalable way.

Today’s scientific literature is quickly growing in volume and complexity. The complexity of a modern paper is akin to foreign text without a Rosetta Stone. In turn, our ability to comprehend hasn’t kept pace with our ability to produce.

When research is difficult to understand, science slows. But when ideas are communicated clearly, readers can critically engage instead of being stymied by obscurity. Unlike computer programs, academic papers don’t have programmed translators to make sense of them. I use the term “cognitive compilers” to refer to tools that decode technical research.

Here, I present one instance of a cognitive compiler: Distill, a machine learning journal composed of expository articles featuring accessible prose and interactive explanations.

I investigate Distill as a cognitive compiler, examining its approach to clarifying complex technical research. Distill made a valuable contribution to the machine learning research community, but ceased operations in 2021 after a five-year run, largely due to burnout. In light of the absence of a successor to the Distill publication, I outline potential improvement and strategies for long-term sustainability. Finally, I discuss the institutional changes necessary to incentivize better research communication.

1 Modernizing Research Communication

As technology has advanced, so too has research communication. Early stages featured solo-authored manuscripts and traditional print publications, which gave way to static PDFs in the era of digital publishing. More recently, we’ve seen the rise of interactive explanations embedded in web pages, enhancing the way complex ideas are conveyed. Looking ahead, we are entering an experimental phase driven by new interfaces and technologies aimed at bridging the comprehension gap.

1.1 Introducing Distill

Noticing how research debt, “the accumulation of missing interpretive labor” could plague machine learning research—a burgeoning technical field that introduced new systems, methods, and frameworks at a rapid pace, without authoritative sources—Olah et al. launched Distill in 2018 to combat it.

As its name suggests, Distill aimed to clarify complex ideas. It focused on creating explanatory articles, informal scientific artifacts like research commentary, and a developer suite to help others produce Distill-style articles.

The goal of the Distill project was not to introduce new findings but to clarify existing ones. However, many technical papers focus on incremental improvements over prior methods—without providing the foundational context needed for understanding. They often justify their claims with benchmarks and metrics that lack explanation, even though this missing context is crucial for making sense of their work.

2 Tools for research clarity

How we incorporate context depends on our goals: for learning, it should suit the reader’s skill level; for verifiability, it should be correct and formal; for transparency, it should be explorable; for collaboration, modular and reusable; for accessibility, inclusive and jargon-free; and for engagement, interactive and compelling. Exploratory visualizations, annotated examples, plain-language summaries, and other tools for research clarity can support these goals, making research communication clearer, more inclusive, and more efficient.

2.1 Explorable explanations to clarify complexity

Machine learning is a relatively new field filled with technical and complex ideas. Given its diverse research community, the field’s progress has relied on quickly bringing researchers up to speed with foundational concepts. Consequently, Distill set out to articulate core ML concepts with clarity, making them accessible to those lacking domain-specific expertise.

First, I would like to present the value of an explorable explanation for a dimensionality reduction method called t-SNE (t-Distributed Stochastic Neighbor Embedding) developed in 2008. In 2016, Distill wrote an article on t-SNE and how it can be used to reduce and visualize complex high-dimensional data.

In an orthodox research paper, a researcher interested in trying out a new method to interpret their high-dimensional dataset may come across an applied paper using t-SNE. In it, they might find that, given some configuration of hyperparameters like perplexity and steps, a set of extracted features are meaningful. This paper likely won’t explain the context of the high-dimensional data or the rationale behind the chosen method. The researcher will ultimately have trouble answering their central question of how t-SNE compares to other methods.

A Distill article, by contrast, explains the method’s derivation, visual behavior, and conceptual foundations—making it not just usable, but understandable. It goes beyond demonstrating listing results, it explains what metrics mean and how the method compares to the status quo.

An explorable explanation from the Distill article, How to Use t-SNE Effectively, illustrating how interactive visualizations can clarify complex methods like dimensionality reduction. Source: Distill.pub — An explorable explanation from the Distill article, *How to Use t-SNE Effectively*, illustrating how interactive visualizations can clarify complex methods like dimensionality reduction. Source: Distill.pub

However, these explanations require resources individual researchers often lack—like time and engineering support. Rather than expecting every author to produce Distill-quality explanations from scratch, if we develop modular distillations where explanatory components can be shared and reused across publications, the development of distillations will become easier over time.

2.2 Towards embeddable, explorable explanations

For cases outside traditional journal submissions—such as research blogs—researchers might consider providing explorable explanations, though creating them from scratch demands significant time and effort. But what if it were as simple as sourcing an existing explanation and embedding it into a post? I envision a future where visual artifacts or dynamic widgets from explorable explanatory papers can be cited and embedded directly into applied research, making complex methods more accessible and reusable.

Like programming macros, distillation abstracts complexity, letting researchers focus on shaping understanding rather than execution. A key barrier to using macros in research writing is that most work is published in static formats like PDFs. Luckily, there have been efforts to rectify this. SciA11y by Semantic Scholar and arXiv have started to offer their papers in an accessible HTML format. With tools like iframes for embedding dynamic models and arXiv’s HTML support, researchers could transform static papers into interactive web experiences. While LaTeX submission requirements may pose some limitations, LaTeX is more machine-readable than PDFs and easier to convert to HTML—suggesting we already have the foundation for embeddable research.

Reusable, embeddable tools need a starting set of explanations that don’t exist yet and take a lot of work to create. Distill’s burnout and pause show how demanding this is: each piece required a visualization expert, web developer, and subject specialist. Though the small, skilled team made great work, they didn’t make it easy enough for others to copy their methods, which prevented their model from being widely adopted.

2.3 Code-in-context

Distill is open source, meaning the code for its articles is publicly available. However, this alone doesn’t guarantee the code is useful to readers. Providing code-in-context—embedding source code within explanations—makes it clear how the code connects to the concepts being discussed. This is inspired by Knuth’s literate programming paradigm, which treats code as a narrative tool woven into explanations that is still capable of being tangled into machine-executable code. Like Knuth’s vision, code-in-context transforms code from a static artifact into a dynamic learning resource that can be understood by humans and compiled by machines.

Tools like Jupyter notebooks, Wolfram computational notebooks, and Google Colab serve as workspaces for computational experiments. These platforms allow researchers to share and replicate experiments easily, as documents include source code, rendered visualizations, and rich-text explanations. Notebooks can also be customized: with edited source code, reconfigured cells, and local build environments.

In a notebook, the code and its outputs are synergistic—they work together to create a cohesive and interactive research artifact. If we value reproducibility and transparency in distillations, then web notebooks may be a form worth investigating. They can serve as the foundation for scalable distillations that others can build upon, repurpose, or extend.

3 Sustainable distillations

Distill’s indefinite hiatus highlights the difficulty of sustaining the production of high-quality distillations. Each piece required a specialized team, including a visualization expert, web developer, and subject specialist. While this model produced exceptional work, it was hard to scale.

To make distillation more sustainable, future efforts should prioritize:

Modularity. Developing reusable templates and visualization libraries so researchers don’t need to build distillations from scratch.
Community Contributions. Encouraging collaborative contributions, similar to Wikipedia or open-source software projects.
Institutional Support. Incentivizing distillation through grants, research credits, or alternative publication formats recognized by academia.

3.1 Comments and collaborative editing

An improved Distill would let authors, collaborators, and readers to systematically add contextual notes in a similar way to how they’re made on a Google Docs document or GitHub pull request.

For instance, Nota is a framework for programming languages, designed in the browser for the browser, using interactive elements to make complex notations, systems, and proofs more accessible. While the browser enables live collaboration, its versioning should be more sophisticated—akin to Ink & Switch’s Upwelling, which combines real-time collaboration with editorial review and structured version control.

Context should be collaborative. Just as code comments help programmers work together, research needs shared annotations. Current tools like GitHub and Google Docs offer basic features, but technical content requires more. A better system would enable:

Direct Clarification. Comments that provide additional explanations to improve clarity.
Implicit Guidance. Comments that highlight areas of confusion, whether caused by unclear explanations or gaps in the reader’s knowledge.

Regardless, the ability to contribute doesn’t guarantee participation. alphaXiv, a more communal version of arXiv, allows readers to leave comments, yet most papers remain commentless. OpenReview facilitates open peer review and showcases public academic discourse in action, but anonymity and professional obligations likely influence participation. Researchers may hesitate to critique peers’ work for fear of risking their professional reputations.

Presently, on Distill, readers can provide suggestions through GitHub issues, but the process is clunky and hinders precise feedback. A better interface won’t necessarily increase input—eliminating friction is just one part of fostering engagement. A more effective approach would integrate collaborative editing, allowing readers to comment directly on the content and incorporate feedback into the final published version.

4 Institutional incentives

A research enterprise centered on clarity must address not only software challenges but also institutional ones. Even if we find the ideal model, it won’t exist in isolation. Key questions arise: Should we push for distillation to be respected in traditional research institutions if it currently isn’t? How will researchers adapt to the new artifacts we create?

While Distill aimed for accessibility, it primarily served technical researchers, especially in interpretability. Future efforts should expand to non-technical, non-English, and interdisciplinary audiences.

Scaling knowledge distillation requires a critical mass of contributors. However, academics are often too busy to engage in activities that don’t advance their careers. Without addressing academic incentives, platforms like Distill will primarily attract researchers outside academia, i.e. outside where a large fraction of research is produced.

Understanding the impact of explanations can help institutions and funding bodies justify investment in distillation. Individual projects collecting fine-grained user data—such as time spent engaging with explanations or improvements in comprehension—can contribute valuable statistics that support funding efforts. By providing structured evaluation methods, institutions can better allocate resources to the most effective research communication tools.

4.1 Reimagining Research Recognition

Current systems for measuring research impact are incompatible with non-traditional works including distillations. Metrics such as the h-index were designed for traditional scholarly outputs and don’t accommodate emerging formats like interactive articles, visual essays, or computational notebooks.

To change this, we need better heuristics for impact across different artifact types. Citation counts could be normalized by medium, and artifacts should be tagged to allow meaningful comparisons. For example, the comparison of a distillation and a method research paper should contextualize the differences in audience, purpose, and structure.

Alternative metrics—or altmetrics—offer a more holistic view. These include social media mentions, GitHub forks, interactive reuse, and downstream derivatives. While altmetrics are less standardized and harder to track, they better capture the visibility and influence of modern, non-traditional work.

Improving recognition also means improving infrastructure. Research search engines like Google Scholar tend to favor PDFs appearing in scholarly journals, this means that web-native research is harder to discover. Expanding indexing systems to include interactive artifacts—each with BibTeX entries and metadata—would make these works more citable and comparable.

Assigning DOIs (Digital Object Identifiers), persistent identifiers that track work online, to non-academic work is difficult, but was practiced by the original Distill publication and should be adopted by future projects. Books have ISBNs, journals have ISSNs, but web-native research often goes unrecognized. CrossRef registration—the organization that issues DOIs to publishers—requires metadata upkeep, stable hosting, and fees, which are barriers for individuals and small projects. A shared fiscal publisher model could help by sponsoring DOI registration for individuals and small distillation journals so their works are searchable, citable, and identifiable.

As research formats diversify, so too must our systems of academic credit. Without such evolution, we risk dismissing the contributions of researchers who focus on distillation.

5 Conclusion

Distillations clarify research, yet scaling them remains a challenge. With Distill on hiatus and research output accelerating, research debt risks becoming unmanageable. Only when distillation is embedded in researchers’ workflows will synthesis keep pace with the rate of publishing. Distill was an example of a “cognitive compiler”—a platform that produced interactive explanations of machine learning.

Sustaining projects like Distill requires structural change. Researchers operate under constraints that limit their ability to prioritize clarity, and academic incentives rarely reward efforts in distillation.