Distributed Wikipedia Mirrors in Freenet

2017-05-16

Distributed Wikipedia Mirrors in Freenet

There was a recent post about uncensorable Wikipedia mirrors on IPFS. The IPFS project put a snapshot of the Turkish version of Wikipedia on IPFS. This is a great idea and something I've wanted to try on Freenet.

Freenet is an anonymous, secure, distributed datastore that I've written a few posts about. It wasn't too difficult to convert the IPFS process to something that worked on Freenet. For the Freenet keys linked in this post I'm using a proxy that retrieves data directly from Freenet. This uses the SCGIPublisher plugin on a local Freenet node. The list of whitelisted keys usable are at freenet.cd.pn. There is also a gateway available at d6.gnutella2.info. The keys can also be used directly from a Freenet node, which is likely to be more performant than going through my underpowered proxy. Keep in mind that the "distributed, can't be taken down" aspect of the sites on Freenet is only when accessed directly through Freenet. It's quite likely my clearnet proxy won't be able to handle large amounts of traffic.

I started with the Pitkern/Norfuk Wikipedia Snapshot as that was relatively small. Once I got the scripts for that working I converted the Māori Wikipedia Snapshot. The lastest test I did was the Simple English Wikipedia Snapshot. This was much bigger so I did the version without images first. Later I plan to try the version with images when I've resolved some issues with the current process.

The Freenet keys for these mirrors are:

USK@m79AuzYDr-PLZ9kVaRhrgza45joVCrQmU9Er7ikdeRI,1mtRcpsTNBiIHOtPRLiJKDb1Al4sJn4ulKcZC5qHrFQ,AQACAAE/simple-wikipedia/0/
USK@jYBa5KmwybC9mQ2QJEuuQhCx9VMr9bb3ul7w1TnyVwE,OMqNMLprCO6ostkdK6oIuL1CxaI3PFNpnHxDZClGCGU,AQACAAE/maori-wikipedia/5/
USK@HdWqD7afIfjYuqqE74kJDwhYa2eetoPL7cX4TRHtZwc,CeRayXsCZR6qYq5tDmG6r24LrEgaZT9L2iirqa9tIgc,AQACAAE/pitkern-wikipedia/2/

The keys are 'USK' keys. These keys can be updated and have an edition number at the end of them. This number will increase as newer versions of the mirrors are pushed out. The Freenet node will often find the latest edition it knows about, or the latest edition can be searched for using '-1' as the edition number.

The approach I took for the mirroring follows the approach IPFS took. I used the ZIM archives provided by Kiwix and a ZIM extractor written in Rust. The archive was extracted with:

$ extract_zim wikipedia_en_simple_all_nopic.zim

This places the content in an out directory. All HTML files are stored in a single directory, out/A. In the 'simple english' case that's over 170,000 files. This is too many files in a directory for Freenet to insert. I wrote a script in bash to split the directory so that files are stored in '000/filename.html' where '000' is the first three digits of a SHA256 hash of the base filename, computed with:

$ echo "filename.html"|sha256sum|awk '{ print $1 }'|cut -c "1,2,3"

The script then went through and adjusted the article and image links on each page to point to the new location. The script does some other things to remove HTML tags that the Freenet HTML filter doesn't like and to add a footer about the origin of the mirror.

Another issue I faced was that filenames with non-ascii characters would get handled differently by Freenet if the file was inserted as a single file vs being inserted as part of a directory. In the later case the file could not be retrieved later. I worked around this by translating filenames into ascii. A more robust solution would be needed here if I can't track down where the issue is occurring.

This script to do the conversion is in my freenet-wikipedia githib repository. To convert a ZIM archive the steps are:

$ wget http://download.kiwix.org/zim/wikipedia_pih_all.zim
$ extract_zim wikipedia_pih_all.zim
$ ./convert.sh
$ ./putdir.sh result my-mirror index.html

At completion of the insert this will output a list of keys. the uri key is the one that can be shared for others to retrieve the insert. The uskinsert key can be used to insert an updated version of the site:

$ ./putdir.sh result my-mirror index.html <uskinsert key>

The convert.sh script was a quick 'proof of concept' hack and could be improved in many ways. It is also very slow. It took about 24 hours to do the simple english conversion. I welcome patches and better ways of doing things.

The repository includes a bash script, putdir.sh, which will insert the site using the Freenet ClientPutDiskDir API message. This is a useful way to get a directory online quickly but is not an optimal way of inserting something the size of the mirror. The initial request for the site downloads a manifest containing a list of all the files in the site. This can be quite large. It's 12MB for the Simple English mirror with no images. For the Māori mirror it's almost 50MB due to the images. The layout of the files doesn't take into account likely retrieval patterns. So images and scripts that are included in a page are not downloaded as part of the initial page request, but may result in pulling in larger amounts of data depending on how that file was inserted. A good optimisation project would be to analyse the directory to be inserted and create an optimal Freenet insert for faster retrieval. pyFreenet has a utility, freesitemgr, that can do some of this and there are other insertion tools like jSite that may also do a better job.

My goal was to do a proof of concept to see if a Wikipedia mirror on Freenet was viable. This seems to be the case and the Simple English mirror is very usable. Discussion on the FMS forum when I announced the site has been positive. I hope to improve the process over time and welcome any suggestions or enhancements to do that.

What are the differences between this and the IPFS mirror? It's mostly down to how IPFS and Freenet work.

In Freenet content is distributed across all nodes in the network. The node that has inserted the data can turn their node off and the content remains in the network. No single node has all the content. There is redundancy built in so if nodes go offline the content can still be fully retrieved. Node space is limited so as data is inserted into Freenet, data that is not requested often is lost to make room. This means that content that is not popular disappears over time. I suspect this means that some of the wikipedia pages will become inaccessible. This can be fixed by periodically reinserting the content, healing the specific missing content, or using the KeepAlive plugin to keep content around. Freenet is encrypted and anonymous. You can browse Wikipedia pages without an attacker knowing that you are doing so. Your node doesn't share the Wikipedia data, except possibly small encrypted chunks of parts of it in your datastore, and it's difficult for the attacker to identify you as a sharer of that data. The tradeoff of this security is retrievals are slower.

In IPFS a node inserting the content cannot be turned off until that content is pinned by another node on the network and fully retrieved. Nodes that pin the content keep the entire content on their node. If all pinned nodes go offline then the content is lost. All nodes sharing the content advertise that fact. It's easy to obtain the IP address of all nodes that are sharing Wikipedia files. On the positive side IPFS is potentially quite a bit faster to retrieve data.

Both IPFS and Freenet have interesting use cases and tradeoffs. The intent of this experiment is not to present one or the other as a better choice, but to highlight what Freenet can do and make the content available within the Freenet network.

Bluish Coder

Distributed Wikipedia Mirrors in Freenet

Tags