Using BitTorrent to Distribute the Debian Archive

This project will be to expand on the BitTorrent application to work effectively with large, constantly updating collections of files such as the Debian archive.

BitTorrent is a peer-to-peer file sharing application designed to reduce the costs of hardware, hosting and bandwidth resources for the original distributor by allowing downloading peers to share downloaded data with others. This project proposes to create a backend or proxy to the Debian package distribution tool apt, which will allow for the downloading of packages from other users of Debian in a BitTorrent-like manner, thus reducing the costs incurred by the archive's host.

This project was accepted for the 2007 Google Summer of Code. More information is available on the website, and on the Debian wiki.

Benefits to Debian

The benefits of this project are clear, both to the Debian project and its mirrors, as well as any other developer wanting to host a popular archive but concerned with bandwidth costs. Upon completion and widescale deployment of the service, the bandwidth and hardware costs of providing a very large Debian archive to hundreds of thousands of users will be dramatically reduced.

These costs are currently being reduced by the voluntary mirroring system Debian uses to help distribute packages. This system comes with some drawbacks though, especially as the size of the archive grows. Some mirrors are already feeling the burden of the size, which has led Debian to introduce partial mirroring. It also creates multiple central points of failure for users, as most depend on a single mirror, and does a poor job of evenly distributing the load, as some mirrors may be chosen more often by users. Finally, non-primary mirrors may be slow to update to new versions of packages, and changes happen slowly as sources' lists must be updated manually by users.

However, using a BitTorrent-like system, these voluntary mirrors could simply join the swarm of downloaders for the archive: mirroring only as much data and contributing only as much bandwidth as they can, providing multiple redundancies and being automatically load balanced by the system, and using the bandwidth savings to update their packages more frequently. This will further allow for future growth, both in the size of the archive and in the popularity of Debian.

Project Details

Though the idea of implementing a BitTorrent-like solution to package distribution seems good, there are some problems with the current way that BitTorrent distributes files that make it unsuitable for the Debian archive. First, the Debian archive is a very large repository of packages, including many different versions and architectures. A normal user will only want to download a very small subset of the entire archive, whereas it is normal in BitTorrent to download the entire torrent. Secondly, the archive is made up of a distribution of file sizes, many of which are smaller than the smallest piece size used by BitTorrent today. Some enhancements will be needed to allow for the downloading of small packages without wasting large amounts of bandwidth. Finally, the archive is frequently updated, though only a very small portion of it at a time. BitTorrent is currently not designed to handle updates to files, nor multiple versions of files.

These limitations of the current BitTorrent systems will require modifications to improve. These modifications could take many directions, so the initial step in the project will be to discuss and plan a modified BitTorrent protocol to implement these necessary features. Some of the discussion has already occurred in a previous project (see Related Work below), and identified these concerns:

  • the packages are too small and there are too many to create individual torrents for each
  • the archive is too large to track efficiently as a single torrent
  • piece sizes are bigger than many packages, so avoiding wasted bandwidth is a concern
  • if multiple torrents may contain the same files (e.g. architecture:all packages), then some communication needs to occur between users of different torrents
  • the client can be informed of an updated package while trying to download an outdated one

As there are DFSG-free BitTorrent implementations available, the project should begin by reusing some of the already developed code that is available. My personal preference is to use the BitTornado client, as I am already familiar with it and it is the only command-line client available (at least that I am aware of) that remains DFSG-free.

Brief Biography

You can get a lot more information from my Resume (I will summarize some relevant parts below).

I am currently working on my Master's in Computing Science at Simon Fraser University. My work is focused mostly on peer-to-peer networking, and especially on BitTorrent and BitTorrent-like applications. My current research involves running many copies of a modified BitTornado client on the PlanetLab research testbed. My education has exposed me to many programming languages and a lot of programming experience in many diverse scenarios.

I am also a volunteer developer with the Debian project. I have been a user since 2000, and have been a developer for more than a year now. I am currently the maintainer of 2 packages, TorrentFlux and libphp-adodb, and I also co-maintain the BitTornado bittorrent client. I am currently (and have been for 11 months) in the New Maintainer queue to become an official Debian Developer.

I have many of the skills needed to complete this project. From my current work on my Master's and my package development of BitTornado, I have become very familiar with the inner workings of the BitTorrent protocol and with a typical client. In reading many papers on the subject, I have also become familiar with peer-to-peer protocol design and much of the published work both on BitTorrent and peer-to-peer systems in general.

Though familiar with, I am not, however, an expert on some of the systems Debian uses for package distribution. I have a lot of experience dealing with the dpkg and apt programs from a user's perspective, including examining the cache's both programs keep, and adding many repositories to an apt sources list. I have also setup my own local repositories to help with my Debian development work, using helpful tools such as debarchiver and reprepro. Though I am familiar with the directories and files of these repositories, I have no experience with the source code for apt.

Related Work

This project was attempted in last year's Google Summer of Code, but failed due to various reasons. The mentor for last year's project, Anthony Towns, still believes this project is a good idea, and has agreed to mentor it this year as well. The project is admittedly a difficult one, which is one of the reasons why it failed last year.

There are other peer-to-peer options, other than using BitTorrent, that could be explored for satisfying the requirements of this project. A more centralized solution (similar to the original Napster), or a file-based solution (similar to KaZaa), could be used as they already satisfy some of the shortcomings found in BitTorrent for this type of application. However, much of the peer-to-peer software is not open-source or DFSG-free, or does not possess the file sharing efficiencies present in BitTorrent. BitTorrent is still the most widely used distribution software for large amounts of data (as evidenced by all the Linux distributions that use it).

Project Schedule and Deliverables

The most difficult part of this project will be designing the solution. There are many possible methods that could be used to implement the desired functionality, and many questions that need answering. As such, defining schedules and deliverables for the rest of the project is difficult without knowing what form the final project will take. I am also wary of trying to do too much in the project, which is something that I think contributed to last year's failure. Therefore, here is a limited list of goals for the project:

  1. Discuss/plan the implementation of the new protocol, including statistical analysis of some of the implications of choosing particular options (will probably take place during the interim period before coding begins)
  2. Setup a testbed with (possibly) modified Packages files and a small repository
  3. Modify the client to receive communication from apt
  4. Modify the client to use the Packages files for torrents
  5. Modify the tracker to work across torrent boundaries
  6. Implement the solution to the piece size problem

That seems short, but is probably enough work for the entire summer, especially if it is done right. If time allows, or for future projects, here are some additional goals:

  1. Implement notification of new versions of packages
  2. Integrate more usefully with apt to reduce duplication of files and functionality
  3. Add distributed tracker using DHT
  4. Develop scripts to be used by mirrors, both to receive packages and to seed torrents