Higher Ed Musings 9.
AI and the Blank Spaces
How is AI going to change higher ed? I am going to return to this question over and over. Today I’m thinking primarily about what it would (will) mean when the top LLMs have access to a larger portion of valuable human knowledge. Pre-transformer, the obstacle to better training was the neural network architecture. Now the quality of the datasets is the bottleneck. Today’s LLMs have been largely trained on the internet and it shows. Internet data was “good enough” but not great. Better data that captures human cognition (all the books and papers ever published!) will improve the LLMs and (some say) get us closer to AGI.
The corpus of human knowledge, or the global knowledge base of humanity, is vast and dispersed. While large portions of the internet are accessible, much if the most valuable portions are not: proprietary digital libraries, gated scientific databases (journals, paper), and undigitized collections in libraries and archives, museums, as well as oral history, and knowledge from rural and indigenous communities.
So what does this mean for higher ed?
Elite private universities, which hold a small but crucial percent of proprietary knowledge — perhaps only 2-3% of total human knowledge — will benefit. This knowledge set may be a small percent but is incredibly valuable: cutting-edge research across various fields pre-publication; deliberately unpublished confidential ongoing studies and recent findings in anticipation of patenting; knowledge from specialized facilities, unique laboratories, and research centers that generate proprietary data and methodologies; knowledge from industry partnerships; expert faculty knowledge; archives and special collections containing unique historical documents and artifacts.
The real prize may be the corpus of privately held business knowledge. Perhaps 20% of the world’s total knowledge held by private companies from R&D in the pharmaceutical, engineering, manufacturing, and consumer products industries, just to name a handful, where collectively hundreds of billions of dollars are spent annually with no incentives to publish. There is applied knowledge and mined data from individuals, from the financial sector data (transaction histories, credit scores, investment trends), healthcare sector data (individual medical histories, genetic information, treatment outcomes), and insurance sector data. These proprietary datasets contain granular data about individuals and populations. They are detailed and longitudinal, frequently used to generate predictive models and insights. And it is by design inaccessible to the world.
The universities that will succeed in this information ecosystem will be those that focus on the very best data (beyond the publications presently available online) and adjust their teaching and research model accordingly.
First, universities will need to realize that they are the holders and producers of an increasingly valuable asset and will need to create walls around their propriety information, to maintain a vault of verified facts and published knowledge. Universities are already behind industry and independent research groups in using AI technology; there are already increased incentives for top academic scientists to leave universities for AI-driven research laboratories.
Second, universities will adjust teaching and researching to account for the fact that so much of the best of what is known is available at everyone’s fingertips in new ways. Canons will matter in new ways.
Older faculty trained outside the AI ecosystem may be the most valuable asset in higher ed. They will be able to ask AI to scan and absorb all papers, publications, and works in progress across a given field and offer up a clear and detailed picture of the range of research produced by its faculty and graduate students.
Imagine maps of your field and you are like Marlow, seeing the blank spaces, and understanding what is left to explore.
Universities will adjust to teach the mountains and the valleys as mountains and valleys. The questions will be: what do we know and how do we know it? What don’t we know and why? What is left to explore? Why is there a mountain of knowledge there and nothing here? If we can visualize, through the scholarship, why Shakespeare matters, why Einstein matters, can we also see who has mattered that we don’t recognize?
The most foresighted universities will partner with business to access and share their proprietary datasets. And then… stay tuned.



There's an interesting muddle here between data, information, and knowledge. One way out: knowledge is a human activity: the dynamic moment of thought hitting its object and bouncing back. This is the domain that remains after all/most *information* is reified by AI, but it was really the proper *academic* domain all along.
In fact, our realization of this, right at the moment when "knowledge" is being reified and literally dehumanized, could be a case of Hegel's owl of Minerva spreading its wings at dusk: https://www.oxfordreference.com/display/10.1093/oi/authority.20110803100258860