Towards codified context, durable documentation, and process preservation
Here, my goal is to introduce my research interests and offer some exposition on how they came to be. Libraries are troves of information, but they also include the systems that help people make use of the information inside them. We should consider rethinking the library and ensure it includes valuable information worth preserving. Then, we should think about the new structures it may embody. Context is the basis of understanding. An important type of context is knowledge about the process—how things are created, developed, or discovered. Capturing this kind of context is challenging as it takes a considerable amount effort, can take many forms, and is often difficult to accurately represent. As a result, it remains poorly understood and organized. Documentation can be seen as “codified context”, capturing not only explicit instructions but also the tacit knowledge, background information, and contextual nuances that bolster understanding. Documentation often overlooks its role in codifying procedural knowledge—the “how-to”, the processes behind what we build, research, and decide to develop. Yet context extends further, from tacit knowledge to the contributors behind a work. How do we capture and integrate this context into what we catalog and preserve? What is the ideal library, and how do we build it?
I’d like to thank David Spivak for his contributions to this post and for his guidance and supervision for this project.1
1 Library reform
Libraries collect and organize human knowledge, equipping people with the ability to access, learn, and build upon others’ ideas. In turn, today’s society is equipped to handle problems that are more complex than at any point in history. Still, there is more to document and the scope of our libraries could be meaningfully larger. By expanding what we preserve, we could solve even more complex problems than we do today.
A weakness of today’s knowledge systems is their inability to capture and integrate procedural knowledge. Much of it remains undocumented, locked in our minds, or scattered across sources—thereby going to waste. Debugging, the act of fixing an existing engineered system, is a necessary but tedious step of R&D. Debugging is not only about fixing bugs to make the system functional, it’s about inching towards a vision that likely fuzzily exists in our minds. This vision is made increasingly clear and concrete by tinkering.
Additionally, capturing procedural knowledge helps reveal trails of thought, or even better: mechanistic explanations behind a conclusion, gives us the capacity to check conclusions and “choose our own adventure” in trying to arrive to our own conclusions in research. Citation is a virtue in traditional academic research, but means to tracing the provenance of ideas could be far more ubiquitous.
As an example of a unique modern form of “library,” take GitHub. Modern platforms like GitHub demonstrate how libraries can evolve to preserve not just code, but the entire development process. Developed rather recently, it simultaneously serves as a platform, a collaborative development environment, and a history tracker for developers.
Researchers used to solely write papers to present their findings. Now, they may link to drafts, data, and code offering a more complete picture of what it took to arrive at a final set of results. GitHub simultaneously supports both the act of conducting research and our ability to share and communicate it with others. As a reformed library itself, it has expanded our ability to document and share current research processes. We can take GitHub as a model of a new kind of library—one that we can apply in many more contexts to preserve and understand the (uniquely human) path to discovery.
2 Conserving thought
The challenge of preserving thought extends beyond storing information—it requires capturing the context, reasoning, and evolution of ideas. Moreover, the purpose of documentation isn’t just to share knowledge with future generations or to fortify institutions; it also helps us preserve and build upon our past work, ensuring that human ingenuity can progress and evolve rather than be lost to time. As I’ve edited, copied, pasted, and deleted text from previous versions of this document, even the version history on my text editor (Overleaf) can’t fully capture the context, goals, and nuances that were present when I first engaged with the project. Arguably, neither can the most sophisticated version control systems. In light of this: How can we effectively collaborate on long-term projects if we struggle to understand the intentions driving the artifacts of our own past?
Our memories are the imperfect recall of imperfect snapshots of the past, the “ground truth.” Preserving the integrity and authority of our sources is worth striving for. Doing so will require rethinking the tools that support researchers and developers who value provenance, documentation and then ensuring it is usable and accessible.
To conclude, I will share some questions that emerged from grappling with the points discussed above. I am making them public not only for others with similar curiosities, but also for my own record as I explore them. Together, these questions form an early-stage research agenda.
3 Directions for future work
In his book Libraries of the Future (1965), Licklider presents a vision for a new age of libraries. How do the above ideas align with his vision, and in what ways do they differ?
How can we design information storage systems that detect and flag errors without altering the original content, ensuring both historical integrity and clarity for future understanding? How do we decide what and when to correct, and how must these corrections be made? How do we keep track of the edits that have been made in the order they’ve been made? And what storage mechanisms allow us to update works while maintaining cost-effectiveness (e.g., by reducing redundancy whilst keeping accuracy)?
At what point do edits transform a source into something fundamentally different, and how do we measure and express this difference?
As inspiration for reformed knowledge representations, what analogies can we draw to gain insights from other fields? Biological systems offer powerful models for how knowledge can be preserved and transmitted across generations. As an example, DNA could be considered a generational log of what worked for survival, including the processes and mechanisms for replication for a given cell.
- Given these analogies (and additional ones we could generate), what can we learn from existing attempts to disclose procedural knowledge in various complex systems?
- What fundamental properties allow certain systems—biological, cultural, and social—to preserve and transmit procedural knowledge over long periods, even without explicit design?
- How do different encoding methods (such as physical, procedural, and symbolic) and various forms of media (e.g. written and auditory) work together to enhance our understanding and retention of information?
- How do these systems balance redundancy with efficiency?
How should we catalog know-how? What would a schema or notational system for procedural explanations look like? How could we detail them in a way that includes the necessary context?
Just as hashing and checksums—algorithms used to verify data integrity—ensure that a piece of data is accurately copied or downloaded, they’re also a way to verify data integrity. Similarly, when humans reproduce knowledge—whether through paraphrasing or replicating a textbook solution—we perform a kind of mental “checksum” to verify that an idea or concept has been accurately understood and transferred. Given this, can verification-via-reproducibility help us maintain the integrity of information that been passed down over time?
How might temporal metadata help us track the continuous evolution of intellectual artifacts and thought processes over time? How do we design identifier systems for the artifacts we develop? How might we move beyond commit-based, discrete-time version control systems?
How might the concepts of authorship and contribution evolve when cataloging research artifacts that take unique forms or are in progress? How do we take account of contributions that don’t rise to the level of citation, such as hearing an interesting thought from a friend? Should we broaden our definition of authorship, or should we create more space to cite unplanned, non-traditional, and perhaps informal academic interactions?
How can we support the R&D needed to develop these technologies? Would academia be a feasible place for this work? How have relevant research projects been supported in the past?
How can we (better) capture the “negative space" of knowledge work: the dead ends, failed attempts, and pivot points that led to final works?
What does it look like to have dynamic interfaces that adapt to individuals? What user information is required to achieve this, and how can it be acquired in a privacy-preserving way?
Footnotes
David Spivak played a key role in refining the questions, thinking through the overall structure, and providing thoughtful edits on both wording and content. He originally suggested this undertaking and helped shape and sharpen the final version.↩︎