No Stupid Questions @lemmy.world I Cast Fist @programming.dev 1y ago

How would one approach indexing pages for a search engine?

I mean, I know one way is using a crawler that loads a number of known pages and attempts to follow all its listed links, or at least the ones that lead to different top level domains, which is how I believe most engines started off

But how would you find your way out of "bubbles"? Let's say that, following all the links from the sites you started off, none point to abc.xyz. How could you discover that site otherwise?

5 comments

I found this an interesting read https://www.marginalia.nu/log/63-marginalia-crawler/ There's lots of posts about the development of his search engine
You're search engine would have to be told about that site some other way.

I'm not sure if you can anymore, but at least years ago you could register your site with Google that way it could find it without other links to your site being present.
- Yea, it's still possible to register your site via google.
Links is the main way, sites that aren't at all mentioned on the internet often aren't worth indexing. That's why site maps and tools to submit your website to major search engines peaked in the 00s. But if you really want everything you could always subscribe to lists of newly registered domains and create rules to to scrape them repeatedly with exponential backoff.
If you mean Lemmy's internal search engine, it will probably go through evolution and improvement in the coming months, just like everything else about Lemmy. If you want to index Lemmy with your own search engine, you'll want to connect it up to ActivityPub but why not help out with the internal one, or anyway share your experiences?

From what perspective are you asking? Are you a search implementer? I think the field is going through rapid change, so you will want to stay on top of things.