Apr '07-Aug '07 — Google Summer of Code

  • Student Developer: successfully completed the DebTorrent project working with Debian
  • Developed a new peer-to-peer distribution method for distributing packages to users
  • Worked independently, sometimes collaborating with other developers around the world
  • Mentored by Anthony Towns

The Google Summer of Code is a program that offers student developers stipends to write code for various open source projects. Open source, free software and technology-related groups apply to Google to fund several projects over a three month period. I chose to work with the Debian Project, which received over 100 applications for only 9 slots.

The Project

This project is to expand on the BitTorrent application to work effectively with large, constantly updating collections of files such as the Debian archive.

BitTorrent is a peer-to-peer file sharing application designed to reduce the costs of hardware, hosting and bandwidth resources for the original distributor by allowing downloading peers to share downloaded data with others. This project proposes to create a backend or proxy to the Debian package distribution tool apt, which will allow for the downloading of packages from other users of Debian in a BitTorrent-like manner, thus reducing the costs incurred by the archive's host.

Benefits to Debian

The benefits of this project are clear, both to the Debian project and its mirrors, as well as any other developer wanting to host a popular archive but concerned with bandwidth costs. Upon completion and widescale deployment of the service, the bandwidth and hardware costs of providing a very large Debian archive to hundreds of thousands of users will be dramatically reduced.

These costs are currently being reduced by the voluntary mirroring system Debian uses to help distribute packages. This system comes with some drawbacks though, especially as the size of the archive grows. Some mirrors are already feeling the burden of the size, which has led Debian to introduce partial mirroring. It also creates multiple central points of failure for users, as most depend on a single mirror, and does a poor job of evenly distributing the load, as some mirrors may be chosen more often by users. Finally, non-primary mirrors may be slow to update to new versions of packages, and changes happen slowly as sources' lists must be updated manually by users.

However, using a BitTorrent-like system, these voluntary mirrors could simply join the swarm of downloaders for the archive: mirroring only as much data and contributing only as much bandwidth as they can, providing multiple redundancies and being automatically load balanced by the system, and using the bandwidth savings to update their packages more frequently. This will further allow for future growth, both in the size of the archive and in the popularity of Debian.

Project Details

Though the idea of implementing a BitTorrent-like solution to package distribution seems good, there are some problems with the way that BitTorrent distributes files that make it unsuitable for the Debian archive. First, the Debian archive is a very large repository of packages, including many different versions and architectures. A normal user will only want to download a very small subset of the entire archive, whereas it is normal in BitTorrent to download the entire torrent. Secondly, the archive is made up of a distribution of file sizes, many of which are smaller than the smallest piece size used by BitTorrent today. Some enhancements are needed to allow for the downloading of small packages without wasting large amounts of bandwidth. Finally, the archive is frequently updated, though only a very small portion of it at a time. BitTorrent is currently not designed to handle updates to files, nor multiple versions of files.

These are some more concerns:

  • the packages are too small and there are too many to create individual torrents for each
  • the archive is too large to track efficiently as a single torrent
  • piece sizes are bigger than many packages, so avoiding wasted bandwidth is a concern
  • if multiple torrents may contain the same files (e.g. architecture:all packages), then some communication needs to occur between users of different torrents

Results

The project started by modifying a current open-source implementation of a BitTorrent client: BitTornado. The modifications were extensive, both to address the concerns above, and to add new functionality to integrate with standard Debian components:

  • Added variable-sized pieces capability to avoid wasting download bandwidth on unneeded pieces.
  • Break very large packages into multiple pieces based on separate information not included in Packages files.
  • Add a backup HTTP download from a mirror when no peers can be found for a package.
  • Add automatic starting of torrents when Packages files are downloaded.
  • Add proxying capability to listen for HTTP requests from APT and automatically start downloading the desired packages.
  • Packaged for Debian so it can be easily installed.
  • Add support for a new debtorrent APT transport method, and support HTTP 1.1 connections, including persistent connections and pipelining, to speed up the communication with APT.
  • Maintain unique piece numbers so that the daily modification of a small part of the archive does not fracture the peer population into multiple torrents.