The Wild Wild Web

  • Surf Mode new!
  • About
  • Sites
  • News
  • Recent
  • Pending
  • Keep up-to-date with the latest news of this category.
    | Search My Site
    You can now take a direct flight from Moscow, Russia to Pyongyang, North Korea and (maybe) back. That’s pretty brutal. Worse, you have to fly on either a Russian airline or a North Korean airline. Double brutal and scary. Probably best to give both places a skip.
    | Mwmbl
    It’s been so long since we’ve had an update on the blog that people are often confused as to whether the project is still active. It definitely is! I’m just bad at updating the blog. Most of the updates have been going to the Matrix channel. So an update is long overdue. Most of the recent work has [...]
    | CORE
    As of July 2025, CORE marks 15 years of supporting the global open access community through indexing and enriching research outputs. What began as a small-scale project has grown into the world’s largest open access indexing platform supporting over 10,000 repositories and journals, and making milli [...]
    | CORE
    Earlier this year, we shared our excitement ahead of the Open Repositories 2025 (OR2025) conference in Chicago. With a packed programme and growing momentum around open science infrastructure, CORE brought a series of contributions focused on the responsible use of AI, metadata innovation, and natio [...]
    | Marginalia Search
    As some of the work planned for Marginalia Search this year has been progressing a bit faster than anticipated, there was time to implement an unplanned change. This post details the implementation of a system for detecting when servers are online, to avoid serving dead links and improve data qualit [...]
    | Kagi
    Three years ago, Kagi officially launched with a splash on popular technology forum Hacker News (to which we are eternally grateful for helping put Kagi on the map).
    | Marginalia Search
    The most recent change to the search engine is a system that profiles websites based on their rendered DOM. The goal is identifying advertisements, trackers, nuisance popovers, and similar elements. The search engine already tries to do this, but isn’t very good at it because it’s only looking at st [...]
    | CORE
    The COAR Annual Conference 2025, held in Tokyo from 12–14 May, brought together repository experts, open science advocates, and infrastructure leaders from across the globe. Representing CORE was Professor Petr Knoth, who contributed to three sessions throughout the event, each addressing urgent and [...]
    | Marginalia Search
    The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months. Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a gr [...]
    | clew
    The web is mind-bogglingly huge; let's look at how personal websites can thrive and interact despite that.
    | Marginalia Search
    A problem the search engine’s crawler has struggled with for some time is that it takes a fairly long time to finish up, usually spending several days wrapping up the final few domains. This has been actualized recently, since the migration to slop crawl data has dropped memory requirements of the c [...]
    | Marginalia Search
    I’m happy and grateful to announce that the Marginalia Search project has been accepted for a second nlnet grant. All the details are not yet finalized, but tentatively the grant will go toward addressing most of the items in the project roadmap for 2025. I’ve already been working full time on the p [...]
    | clew
    While on a fourteen-hour international flight, I finally managed to come up with an architecture for Clew's web crawler that I'm happy with. Here's the run-down.
    | Kagi
    To satisfy the growing demand for our services in work environments, we are launching Kagi for Teams ( https://kagi.com/teams ) - bringing our unmatched quality, privacy-focused search and AI tools to businesses worldwide.
    | clew
    I believe I've reached a point in Clew's development where, armed with the knowledge I've acquired from months of crawling sites and using that data to search the index, it's time to wipe the index and start over.
    | Mwmbl
    By many measures, Mwmbl is doing great. We have indexed over half a billion pages, we have over 4,000 registered users, and over 30,000 curations from those users. Our volunteers are crawling around 5 million pages a day. But the score that I care about most right now is NDCG. This measures the qual [...]
    | Mwmbl
    It’s two years since we launched Mwmbl, the open source, non-profit search engine, on Boxing Day 2021. A good time to take stock of where we are and where we’re going. We’ve indexed over 100 million pages Thanks to our volunteers, who crawl the web using the Firefox extension and command line script [...]
    | Mwmbl
    Mwmbl is the first search engine to allow users to change the search results: You can add results, delete them, and rerank them. The changes you made are saved instantly to the index and will be shown to other users who run the same query. But what is the point of users changing search results? Th [...]
    | Mwmbl
    We recently launched the new version of Mwmbl which includes the long-awaited feature of allowing users to curate search results. This is an experiment, since we don’t know: Will people want to curate search results? How will we determine what is an objectively good search ranking? How will we deal [...]