O’Reilly Media, Miso AI and the Battle for Knowledge in the Age of Scraping
O’Reilly Media and Miso AI are reshaping knowledge discovery with AI powered search, citations, and safeguards against content scraping.
From books to vectors
O’Reilly Media is not a new arrival to digital disruption. Founded in 1978 as a publisher of technical handbooks, it has grown into a digital platform with more than 200 partner publishers and some 2.5m subscribers. It was among the first to spot the commercial promise of the web. In 1993 it launched the Global Network Navigator, an early web portal, and has been building on that foundation ever since.
“O’Reilly today is more like a “learning ecosystem” more than a publisher. That ecosystem houses books, videos, training events and increasingly, AI powered search and discovery. The firm is best known among technologists for its distinctive animal branded books, but the business model is now dominated by its subscription platform, used by enterprises and individuals worldwide.”
Julie Baron, Chief Product Officer, O’Reilly
The challenge, Baron argues, is no longer publishing more content but helping people navigate it. “When you’ve got millions of pages, how do you surface the one paragraph that solves the problem in front of you?”
That question brought O’Reilly into partnership with Miso AI.
Julie Baron, Chief Product Officer, O’Reilly & Lucky Gunaesekara, co founder and Chief Executive of Miso present at the Press Gazette Future of Media Technology Conference in London on 11th September 2025
Miso’s founder: from O’Reilly conference to partner
Lucky Gunaesekara, co founder and Chief Executive of Miso, tells a story that underlines O’Reilly’s long influence in the field. As a medical student in his twenties, he wandered into an O’Reilly AI conference in the United States. Inspired by what he saw, he left medicine to work in machine learning.
Years later, he is running Miso, a company providing AI driven search and personalisation for publishers. Miso’s platform now supports around 20 publishers directly, generating more than three million AI powered answers a month. It monitors 12,000 websites against some 2,000 known bots and scrapers. Its central aim is to create value for publishers, not to extract it.
The O’Reilly, Miso tie up rests on aligned values: a belief that creators deserve fair compensation and that AI tools should cite their sources, not absorb them invisibly.
The problem with discovery
For publishers, the old model of discovery was simple: Google. Traffic flowed from search into articles or resources. SEO became the craft that kept publishing afloat. That traffic is now declining. As generative AI tools provide answers directly to users, the open web is losing some of its role as an intermediary.
O’Reilly had already seen the problem internally. With its mix of books, videos and training material, it needed a search model that could handle not just keywords but intent. A user might not want a 300 page book on Python, but a single diagram showing gradient descent. Traditional indexing was never built for that task.
The solution, developed with Miso, was AI Answers a product that breaks down content into snippets enriched with context. A passage is not just text, it is tagged by subject, placement in the book, related images, and cross references. This vectorisation makes it possible for an AI model to provide a precise answer with attribution.
Unlike ChatGPT or other public models, O’Reilly’s Answers do not hallucinate. They cite directly from licensed material. “That’s the steak, not the hamburger,” says Gunaesekara, invoking Tim O’Reilly’s warning about AI scraping: once content is minced into statistical ‘pink slime’ by indiscriminate models, it cannot be reconstituted.
Snippets and the economics of value
The shift to snippets raised an awkward question: if an AI tool surfaces only a fragment of a book, is the publisher losing revenue from the full read?
Baron argues the opposite. “Snippets can be ten times more valuable than browsing,” she says. In technical contexts such as a production system failure or a coding error the ability to extract a precise answer quickly has more value than paging through hundreds of irrelevant results.
O’Reilly has persuaded its partner publishers to opt in. The promise is higher engagement, better discovery across the archive, and compensation tied to actual usage of content fragments. Early results suggest that readers exposed to snippets explore more titles overall, raising the tide for the platform’s entire catalogue.
Guarding against scraping
Behind the product story is a defensive one. Both O’Reilly and Miso are alarmed by the volume of AI scraping. Models are being trained on copyrighted content without consent, and publishers’ robots.txt files are often outdated or ignored.
Miso’s Sentinel tool monitors traffic patterns and identifies likely scraping. Gunaesekara points out that nearly half the sites they track lack even basic bot declarations, and of those that do, most are outdated. The result is a field day for large AI companies looking to train models at scale.
The remedies are imperfect: centralised bot management, technical barriers, and legal letters backed with evidence. But as Baron notes, the pace of scraping now resembles the early days of SEO, a constant, fast moving contest rather than a one off compliance exercise.
The “hamburger law”, as O’Reilly calls it, is brutal: once content has been chewed up into anonymous embeddings, it cannot be restored. The only defence is to catch violations at the point of ingestion.
Attribution as a differentiator
If scraping is the threat, attribution is the opportunity. O’Reilly’s Answers provide direct quotes and references, which neither ChatGPT nor most general purpose AI tools can do. That, they argue, is the competitive high ground for publishers.
The difference is not academic. In areas such as programming or legal compliance, a wrong answer can be costly. Attribution builds trust and allows users to follow the trail back to the full source. For publishers, it establishes grounds for monetisation rather than leakage.
Baron sees this as central to O’Reilly’s positioning: “We want AI to be a funnel for discovery, not a siphon that drains value away.”
Lucky Gunaesekara, co founder and Chief Executive of Miso present at the Press Gazette Future of Media Technology Conference in London on 11th September 2025
Commercial footing
The partnership is not simply about defending territory. O’Reilly and Miso see commercial gains. Answers are now deployed across dozens of sites, with plans for more than 100 by year end. The business model remains subscription based, a blend of direct to consumer, enterprise, government and academic accounts, but enhanced by higher engagement metrics.
Private AI models, rather than shared big tech services, underpin the system. Gunaesekara stresses the importance of running models privately and configuring them for each publisher. That keeps content safe from training leakage and allows fine tuned benchmarking across different model families.
The firms are cautious about striking deals with big tech. They prefer to keep their models under direct control, using open source frameworks where possible but avoiding wholesale reliance on external providers.
The long tail
One of the more intriguing aspects of AI Answers is its performance on the “long tail” of queries. A conventional search engine excels at popular, fat head queries: “What is Kubernetes?” or “Best way to learn Python.” But the real value, says Gunaesekara, comes from the odd, highly specific question, the sort that may only ever be asked once.
For example, preparing a technical lecture in Japan, he used O’Reilly’s platform to generate precise Japanese language primers that no editor could have anticipated. The long tail, multiplied across millions of users, is where AI tools delight. And delight is what keeps people subscribing.
Baron notes that legacy publishers often underestimate the revenue locked in the archive. Historical material can still generate returns if made discoverable. AI powered snippet search is one of the few technologies that can make the long tail commercially viable.
Beyond discovery: creation and translation
O’Reilly and Miso are also experimenting with content creation workflows. Their “Studio” tool allows editorial teams to apply large language models to existing content in controlled ways. This includes generating lists, explanations, or translations on the fly, but always with attribution.
One striking example is real time translation: a user can ask a question in one language and receive an answer in the same language, with citations pointing either to translated or original material. That capability, Baron says, is already popular with enterprise clients operating across markets.
Again, the principle is not to compete with general purpose AI but to ride the wave, adding value by combining licensed content with private models.
The larger fight
Both executives are frank that the broader landscape is hostile. Generative AI firms have every incentive to ingest publisher content without permission, and technical barriers are weak. Publishers face an arms race they cannot easily win alone.
That is why O’Reilly and Miso stress collaboration. Publishers must consent, opt in, and share in the value. AI providers must be pushed, legally and commercially, to respect copyright. And creators, whether authors or media houses, must be fairly compensated.
The O’Reilly, Miso alliance is an experiment in this direction. It offers a practical model for publishers who want to embrace AI discovery without ceding their content wholesale.
Conclusion: a cautious optimism
O’Reilly Media has travelled far from its origins in technical handbooks. With Miso, it is testing a model of AI powered discovery that protects attribution, enriches archives, and offers a commercial alternative to scraping.
The approach is not guaranteed to win. Big tech platforms will continue to evolve, and the scraping arms race will intensify. But the experiment shows one path for publishers, to embrace AI not as a threat to content but as a way to surface it, provided the rules are clear, the models are private, and the creators are paid.
As Baron puts it, the task is to “ride the horse, not push the car.” In other words, work with the momentum of AI, but make sure it is harnessed to publishers’ interests, not against them.
Thanks for reading all the way to the end of the article! This post is public so feel free to share it, and if you have not done so already sign up and become a member.