Update: Downloading all archive.org metadata
Following up from my previous post.
I used the API at https://archive.org/developers/changes.html to enumerate all the item names in the archive. Currently there are over 256 million item names. However I went through a sample of them and noted the following:
- Many do not have the .torrent available because some of the files are locked due to copyright concerns, like their music collection. Ex: https://archive.org/details/lp_le-sonate-per-pianoforte-vol-1_carl-maria-von-weber-dino-ciani_0
- A lot of items have been removed from public access completely, and possibly deleted even on their storage backend. Ex: https://archive.org/details/0-5-1-0-hernan-hernandez
There are many, many items from the archive which have been removed. Much higher than I expected. If you have critical data, of course Internet Archive should never be your only backup.
I don't know the distribution of metadata and .torrent file sizes since i have not tried downloading them yet. It looks like it would require a lot of storage if there are many files or the content is huge (if only 50% of the items remain and the average .torrent + metadata is 20KB it would be over 2.5 TB to store). But on the other hand, the archive has a lot of random one off uploads that are not very big, so some metadata is 800 bytes and the torrent 3KB in those cases (only 640 GB to store if combined is 5 KB).
The link to the above release post has the wrong caption for me. Its title says "Ambulance hits Oregon cyclist, rushes him to hospital, then sticks him with $1,800 bill, lawsuit says - Divisions by zero"
Downloading all archive.org metadata
I'd love to know if anyone's aware of a bulk metadata export feature or repository. I would like to have a copy of the metadata and .torrent files of all items.
I guess one way is to use the CLI but this relies on knowing which item you want and I don't know if there's a way to get a list of all items.
I believe downloading via BitTorrent and seeding back is a win-win: it bolsters the Archive's resilience while easing server strain. I'll be seeding the items I download.
Edit: If you want to enumerate all item names in the entire archive.org repository, take a look at https://archive.org/developers/changes.html. This will do that for you!
Whatever happened to DNA-based storage research?
It seems like 6 or 7 years ago there was research into new forms of storage, using crystals or DNA that promised ultra high density storage. I know the read/write speed was not very fast, but I thought by now there would be more progress in the area. Apparently in 2021 there was a team that got a 16GB file stored in DNA. In the last month there's some company (Biomemory) that lets you store 1KB of data into DNA for $1,000, but if you want to read it, you have to send it to them. I don't understand why you would use that today.
I wonder if it will ever be viable for us to have DNA readers/writers... but I also wonder if there are other new types of data storage coming up that might be just as good.
If you know anything about the DNA research or other new storage forms, what do you think is the most promising one?
This was something I suggested for this instance, since there is even a guide for hosting an onion service: https://lemmy.dbzer0.com/post/135234
Maybe /u/db0 will have more time after the spam settles down, but it seems he's got a lot on his plate at the moment between being an admin and doing AI stuff.
I often look for older or niche content, and even for that I still often have plenty of takers on public trackers. That my machine is port forwarded might have something to do with it. I'd say I have a "medium" amount of disk space and only stop seeding when I delete the files, but sometimes I limit the upload rate to keep some for other activities.
What was Empress's last Denuvo-breaking release?
Prediction: AT-style decentralized hoarding of the web
The more that content on the web is "locked down" with more stringent API requests and identity verification, e.g. Twitter, the more I wonder if I should be archiving every single HTTP request my browser makes. Or, rather, I wonder if in the future there will be an Archive Team style decentralized network of hoarders who, as they naturally browse the web, establish and maintain an archive collectively, creating a "shadow" database of content. This shadow archive is owned entirely by the collective and thus requests to it are not subject to the limitations set by the source service.
The main point is that the hoarding is not distinguishable from regular browsing from the perspective of the source website, so the hoarding system can't be shut down without also giving up access to regular users.
Verification that the content actually came from the real service could probably be done using the HTTPS packets themselves, and some sort of reputation system could prevent the source websites themselves from trying to poison the collective with spam.
Clearly, not all of the collected data should be shared, and without differential privacy techniques and fingerprint resistance the participating accounts can be connected to the content they share.
Has anything like this been attempted before? I've never participated in Archive Team, but from what I read it seems similar.
Have OSes evolved enough that encrypted DNS is available? If so, would someone with enough technical knowledge link a guide on how to set it up within a popular OS?
I imagine that even if you plug in one of the suggested DNS provider IP addresses into your network settings, the OS is still going to make plaintext requests that your ISP can snoop on unless you require it to be encrypted somehow.
Depending on the content, 10 or 20 comes quick
Note that H.264 and H.265 are the video compression standards and x264 and x265 are FOSS video encoding libraries developed by VideoLAN.
I agree, and with FOSS you have the opportunity to contribute back to the software. One time I was using commercial software and reached out to the company about how to decode a special file format for use in a script and the response was that it was "proprietary". If it was FOSS or even if they just had given me the information, I would have contributed to growing the ecosystem.
Software could have trojans. But why not music?
It must be a bug. For me, I didn't see the subscribe button at all yesterday, just a plaintext "Subscribe" that I couldn't click. When visiting one of the posts, the button finally appeared today.
New account created today, yeah that's fishy.
Torrents use cryptographic hashes to verify the torrent content, so if he seeds it to you, then your torrent client will validate data he gives you. If the data doesn't verify or if he wants you to do anything else like clicking a link, avoid and report.
It's sometimes possible to find the same files on other download sites, but "retrieving dead torrents" in general isn't possible without having the same data.
This was data from pushshift before Reddit nuked it in March. You can find this torrent (called "Reddit comments/submissions 2005-06 to 2022-12") and others, including 2023-01 and 2023-02, on https://academictorrents.com by user Watchful1.
Thanks! For anyone curious, the links to academictorrents version of the Reddit archives are available on /r/datahoarder and probably their lemmy.ml instance too.
Note that Mozilla VPN uses Mullvad's network under the hood. Also, depending on your device you should be able to block connections that don't use the VPN. On Android, the "kill switch" can be found in the settings as described here: https://mullvad.net/en/help/using-mullvad-vpn-on-android/#block-without-vpn
Pushshift is down now? Is there a data hoarder who has a backup of all the historical Reddit data that we can seed?
Understandable, that sounds like a major pain. Hopefully the anti-spam does not impact reputable anonymous users, but I can only imagine the trouble caused by the influx of spam.
Good question. It's not quite the same.
The most compelling reason is that browsing an onion service does not leak any information about the destination to an exit relay because the connection goes directly to the destination service. Connecting via an onion service makes timing correlation attacks much harder to carry out to deanonymize users since there is no exit relay to record when connections to lemmy.dbzer0.com
are made. Posts and the timestamps associated with them on a public social network make timing correlation attacks even easier to perform, since there is evidence on which to validate the results.
It also acts as an advertisement about the site's commitment to anonymity and privacy.
Div/0 Onion Service request
Is there any chance that Divisions by zero could run an onion service so that its users could get the extra anonymity/privacy benefits that come with it when browsing over Tor? For comparison, Reddit also runs an onion service.
There is a setup guide for doing this with lemmy, which was published just a few days ago, at https://join-lemmy.org/docs/administration/tor_hidden_service.html
Use Tor Browser if you need anonymity, which isn't offered by private browsing mode or most other extensions. In case you don't want to route through the Tor network, Mullvad Browser offers the same fingerprinting resistance techniques as Tor Browser.
Proton is a good service, but their years of reluctance to include more anonymous payment methods such as Monero and the inability to register an account from an anonymous IP address without a phone number makes me question the relative benefit of using them as a VPN.
These do not by themselves result in a compromise of anonymity if Proton is trustworthy and the Swiss laws still enable them to disassociate your identity (given via payments) and your account usage, but regulation and governments tend to become stricter rather than looser over time and I would demand more from a service you are entrusting with all your internet traffic.
If you want to learn Python, the tutorial in the documentation is a thoroughly excellent starting point. Reading the documentation (the most up-to-date, deliberate content) will make you far more of a Python wizard than codecademy ever could.