In this post, we delve headfirst into the BitTorrent protocol, understanding the process of downloading a torrent by building a minimal torrent client from scratch.

Note
You can look at the source code or jump to the last part to see it in action.

BitTorrent was developed in the early 2000s as a peer-to-peer protocol for sharing files over the internet in a distributed manner. Traditional file-sharing systems often relied on a centralized server to facilitate the transfer of data. A client would request data from the server, and the server would respond by providing the requested content. In contrast, participants, also called peers, in the BitTorrent network connect directly with each other over the internet, sharing pieces of data, leading it to be called a peer-to-peer protocol.

Over the past 20 years, though multiple features have been added to the BitTorrent protocol through BEPs (BitTorrent Enhancement Proposals), we will implement just the original spec.

We’ll use a Debian ISO torrent file to test our implementation. It’s a large file at 627 MB, but not prohibitively so and would help us steer clear of any legal and ethical issues.

The .torrent file Link to heading

A .torrent file, also called the metainfo file, contains the details of the downloadable files and information about the tracker for finding peers in a format called Bencode. The .torrent file we’re about to download looks like this:

d8:announce41:http://bttracker.debian.org:6969/announce7:comment35:"Debian CD from cdimage.debian.org"10:created by13:mktorrent 1.113:creation datei1690028920e4:infod6:lengthi657457152e4:name31:debian-12.1.0-amd64-netinst.iso12:piece lengthi262144e6:pieces50160:<binary blob>
Info

Before we delve into building the client, let’s understand the Bencode format. Bencode is a compact binary encoding used in BitTorrent to represent dictionaries, lists, and integers in a human-readable manner. It’s simple and efficient for transferring metadata about torrents.

Here’s a breakdown of Bencode’s data structures:

  • Integers: Encoded between i and e. For example, i3e represents the integer 3.
  • Byte Strings: Prefixed with their length, followed by a colon. For instance, 8:announce represents the byte string "announce".
  • Lists: Enclosed in l and e. For example, l4:spam4:eggse represents the list ["spam", "eggs"].
  • Dictionaries: Enclosed in d and e, where keys are byte strings and appear in sorted order. For example, d3:cow3:moo4:spam4:eggse represents the dictionary {"cow": "moo", "spam": "eggs"}.

The content of a metainfo file is a bencoded dictionary, containing the keys listed below. All character string values are UTF-8 encoded

  • info: A dictionary that describes the file(s) of the torrent
  • announce: The announce URL of the tracker (string)
  • announce-list: (optional) The list of announce URLs (lists of strings).
  • creation date: (optional) The creation time of the torrent, in standard UNIX epoch format (integer)
  • comment: (optional) Free-form textual comments of the author (string)
  • created by: (optional) Name and version of the program used to create the .torrent (string)

Info Dictionary Link to heading

  • piece length: Number of bytes in each piece (integer)
  • pieces: String consisting of the concatenation of all 20-byte SHA-1 hash values, one per piece

Single File Mode

  • name: The filename. This is purely advisory. (string)
  • length: Length of the file in bytes (integer)

Multi File Mode

  • name: The name of the directory in which to store all the files (string)
  • files: A list of dictionaries, one for each file. Each dictionary in this list contains the following keys:
    • length: Length of the file in bytes (integer)
    • path: A list containing one or more string elements that together represent the path

When pretty-printed, the .torrent file is much more legible:

d
  8:announce
    41:http://bttracker.debian.org:6969/announce
  7:comment
    35:"Debian CD from cdimage.debian.org"
  10:created by
    13:mktorrent 1.1
  13:creation date
    i1690028920e
  4:info
    d
      6:length
        i657457152e
      4:name
        31:debian-12.1.0-amd64-netinst.iso
      12:piece length
        i262144e
      6:pieces
        50160:<binary blob containing hashes of each piece>
    e
e

Printed like this, we can already see the name and size of the file, the URL of the tracker, a comment, the creation date, the piece length, and a binary blog that contains the SHA-1 hashes of each piece. The piece length is usually a power of 2 and is typically chosen based on the total amount of file data in the torrent. The most common sizes are 256 kB, 512 kB, and 1 MB. Every piece in a torrent file is of equal length except for the final piece, which is irregular. The number of pieces is thus determined by ceil(size/piece_size). All we have to do is to download these pieces from our peers, check their integrity against the hashes in the metainfo file, concatenate them together and we’ve got ourselves the desired file!

While we could write a bencode parser from scratch, I decided to use the Bento library for our needs.

def parse_file(path) do
  with {:ok, file} <- File.read(path),
       {:ok, torrent} <- Bento.decode(file),
       {:ok, info} <- Map.fetch(torrent, "info"),
       {:ok, name} <- Map.fetch(info, "name"),
       {:ok, pieces} <- Map.fetch(info, "pieces"),
       {:ok, piece_length} <- Map.fetch(info, "piece length"),
       {:ok, trackers} <- get_trackers(torrent) do

    download_dir = Application.get_env(:torrex, :download_dir)
    files = get_files(info, Path.join(download_dir, name))
    size = Enum.reduce(files, 0, fn {_name, length}, total -> length + total end)
    piece_map = get_piece_map(piece_length, files, get_hashes(pieces))
    {:ok, bencoded_info} = Bento.encode(info)
    info_hash = :crypto.hash(:sha, bencoded_info)

    {:ok, %Torrent{
        name: name,
        info_hash: info_hash,
        files: files,
        size: size,
        trackers: trackers,
        piece_length: piece_length,
        pieces: piece_map,
    }}
  end
end

There are a lot of helper functions involved to convert the parsed file into our desired format. I computed the SHA-1 hash of the bencoded info dictionary, which is called the infohash. It uniquely identifies the files in a torrent. All torrents containing identical files will have the same infohash. I also split the pieces and joined them with the files to make a map in the form of

%{piece_num => {hash, [{file_path, file_offset, size}]}}

so that each piece can be written to disk without looking up any other information.

Communicating with the tracker Link to heading

Once we have the information about the file and its tracker, we can announce our presence in the BitTorrent network as a peer and get a list of other peers. All we have to do is make a GET request to the announce URL given in the .torrent file with a few query parameters:

  • info_hash: Urlencoded 20-byte SHA-1 hash of the bencoded info dictionary
  • peer_id: Urlencoded 20-byte string used as a unique ID for the client
  • port: The port number that the client is listening on
  • uploaded: The total amount uploaded
  • downloaded: The total amount downloaded
  • left: The number of bytes this client still has to download
  • compact: Setting this to 1 indicates that the client accepts a compact response. The peers’ list is replaced by a peers string with 6 bytes per peer. The first four bytes are the host, and the last two bytes are the port.
  • event: If specified, must be one of started, completed, stopped
    • started: The first request to the tracker must include the event key with this value
    • completed: Must be sent to the tracker when the download completes
    • stopped: Must be sent to the tracker if the client is shutting down gracefully
  • trackerid: (optional) If a previous announce contained a tracker id, it should be set here
defp build_request(url, event, info_hash, peer_id, port) do
  {:ok, torrent} = TorrentTable.get_torrent(info_hash)

  query = %{
    info_hash: info_hash,
    peer_id: peer_id,
    uploaded: torrent.uploaded,
    downloaded: torrent.downloaded,
    left: torrent.left,
    port: port,
    compact: 1
  }

  query =
    query
    |> add_tracker_id(url)
    |> add_event(event)
    |> URI.encode_query()

  url <> "?" <> query
end

The tracker responds with a “text/plain” document consisting of a bencoded dictionary with the following keys:

  • failure reason: If present, then no other keys may be present. The value is a human-readable error message as to why the request failed (string).
  • warning message: (optional) Similar to failure reason, but the response still gets processed normally. The warning message is shown just like an error.
  • interval: Interval in seconds that the client should wait between sending requests to the tracker
  • min interval: (optional) Minimum announce interval. If present clients must not reannounce more frequently than this.
  • tracker id: A string that the client should send back on its next announcements.
  • complete: Number of peers with the entire file, i.e. seeders (integer)
  • incomplete: Number of non-seeder peers, aka leechers (integer)
  • peers: The peers value will be a string consisting of multiples of 6 bytes. The first 4 bytes are the IP address and the last 2 bytes are the port number. All in network (big-endian) notation.

Our Debian tracker responds with:

d
  8:interval
    i900e
  5:peers
    300:<binary blob of peer IPs and ports>
  6:peers6
    0:
e

We parse the binary blob of peer list into a list of {ip, port} tuple and add to our list of previously known peers if we have any:

defp parse_peers(peers, peer_list \\ [])

defp parse_peers(<<a, b, c, d, port::16, rest::binary>>, peers) do
  parse_peers(rest, [{{a, b, c, d}, port} | peers])
end

defp parse_peers(<<>>, peers), do: peers

Communicating with peers (Peer Wire Protocol) Link to heading

Once we have a list of peers, we can connect to them and start sharing pieces of the torrent. The peer wire protocol consists of an initial handshake. After that, peers communicate via an exchange of length-prefixed messages.

Connecting to a peer:

{:ok, socket} = :gen_tcp.connect(ip, port, [:binary, active: false])

Handshake Link to heading

The handshake is a required message and must be the first message transmitted by the client. The handshake is 49 + length(pstr) bytes long.

handshake: <pstrlen><pstr><reserved><info_hash><peer_id>
  • pstrlen: string length of <pstr>, as a single raw byte
  • pstr: string identifier of the protocol, “BitTorrent protocol” in our case
  • reserved: eight (8) reserved bytes for extensions, that we won’t implement
  • info_hash: 20-byte SHA1 hash of the info dictionary in the metainfo file
  • peer_id: 20-byte string used as a unique ID for the client.

We construct the handshake like this:

<<byte_size(pstr)::size(8), pstr::bytes, 0::size(64), info_hash::bytes, peer_id::bytes>>

After sending the handshake message, the peer should reply with a similar message, with the same infohash so we know that we are talking about the same torrent.

defp complete_handshake(socket, info_hash) do
  with {:ok, <<len::size(8)>>} <- :gen_tcp.recv(socket, 1),
       {:ok, pstr} <- :gen_tcp.recv(socket, len),
       {:ok, _reserved} <- :gen_tcp.recv(socket, 8),
       {:ok, info_hash_recv} <- :gen_tcp.recv(socket, 20),
       {:ok, peer_id} <- :gen_tcp.recv(socket, 20) do

    case {pstr, info_hash_recv} do
      {^pstr, ^info_hash} -> {:ok, peer_id}
      _ -> :error
    end
  end
end

Client State Link to heading

A client must maintain state information for each connection that it has with a remote peer:

  • choked: Whether or not the remote peer has choked this client. When a peer chokes the client, it is a notification that no requests will be answered until the client is unchoked.
  • interested: Whether or not the remote peer is interested in something this client has to offer. This is a notification that the remote peer will begin requesting blocks when the client unchokes them.

The client also needs to keep track if it’s interested or not in the remote peer, and if it has the remote peer choked or unchoked. So, the real list looks something like this:

  • am_choking: This client is choking the peer
  • am_interested: This client is interested in the peer
  • peer_choking: Peer is choking this client
  • peer_interested: Peer is interested in this client

Client connections start as “choked” and “not interested”:

  • am_choking = 1
  • am_interested = 0
  • peer_choking = 1
  • peer_interested = 0

A block is downloaded by the client when the client is interested in a peer, and that peer is not choking the client. A block is uploaded by a client when the client is not choking a peer, and that peer is interested in the client.

Messages Link to heading

All the remaining messages in the protocol take the following form:

message: <length prefix><message id><optional payload>

The length prefix is a four-byte value. message id is a single byte. The payload is message dependent.

<len = 0000>                                 : keep-alive
<len = 0001>  <id = 0>                       : choke
<len = 0001>  <id = 1>                       : unchoke
<len = 0001>  <id = 2>                       : interested
<len = 0001>  <id = 3>                       : not interested
<len = 0005>  <id = 4><piece index>          : have
<len = 0001+X><id = 5><bitfield>             : bitfield
<len = 0013>  <id = 6><index><begin><length> : request
<len = 0009+X><id = 7><index><begin><block>  : piece
<len = 0013>  <id = 8><index><begin><length> : cancel

To receive a message, we read the initial four bytes to know the length of the message. If it’s not a keep-alive message, we read the message id next and handle the rest of the message based on the id.

defp receive_message(%{socket: socket} = state) do
  case receive_msg(:gen_tcp.recv(socket, 4, 1_000), socket) do
    {:ok, id, len} ->
      case handle_message(id, len, state) do
        {:ok, state} ->
          Process.send_after(self(), :next_tick, 0)
          {:noreply, state}

        {:downloading, state} ->
          {:noreply, state, {:continue, :downloading}}

        _message ->
          :gen_tcp.close(state.socket)
          {:stop, :normal, state}
      end

    :keep_alive ->
      Process.send_after(self(), :next_tick, 0)
      {:noreply, state}
  end
end

defp receive_msg({:ok, <<0::size(32)>>}, _) do
  :keep_alive
end

defp receive_msg({:ok, <<len::size(32)>>}, socket) do
  with {:ok, <<id::size(8)>>} <- :gen_tcp.recv(socket, 1) do
    {:ok, id, len - 1}
  end
end

Bitfields Link to heading

A bitfield is a byte string, where each bit acts as a flag. For our use, the bitfield tells us which pieces are available to download, or which have already been downloaded. Each bit represents a piece, with 1 indicating the availability and 0 indicating the absence of the piece. The high bit in the first byte corresponds to the piece index 0. Spare bits at the end are set to 0. For our need, we use a MapSet to store the indices of the pieces available and convert to and from an actual bitfield with some helper functions as and when needed.

defp make_mapset(bitfield, index \\ 0, data \\ MapSet.new())

defp make_mapset(<<i::size(1), bitfield::bitstring>>, index, data) do
  data = if i == 1, do: MapSet.put(data, index), else: data
  make_mapset(bitfield, index + 1, data)
end

defp make_mapset(<<>>, _, data) do
  data
end

defp make_bitfield(bitfield, pieces) do
  pieces =
    case rem(pieces, 8) do
      0 -> pieces
      r -> pieces + 8 - r
    end

  0..(pieces - 1)
  |> Stream.map(&if &1 in bitfield, do: 1, else: 0)
  |> Enum.reduce(<<>>, fn i, acc -> <<acc::bitstring, i::size(1)>> end)
end

Since we use a MapSet for storing our bitfield information, a simple MapSet.difference/2 tells us the pieces which can be requested from the peer.

def handle_call({:next_piece, peer_bitfield}, _from, %{bitfield: bitfield} = state) do
  diff = MapSet.difference(peer_bitfield, bitfield) |> MapSet.difference(state.downloading)

  case diff |> Enum.take_random(1) do
    [] ->
      {:reply, {:error, :no_available_piece}, state}

    [index] ->
      size =
        if index == state.num_pieces - 1, do: state.last_piece_length, else: state.piece_length

      downloading = MapSet.put(state.downloading, index)

      {:reply, {:ok, index, size}, %{state | bitfield: bitfield, downloading: downloading}}
  end
end

Putting it all together Link to heading

Now that we have all the building blocks for our torrent client ready, we can start joining those blocks together. A peer worker can ask our TorrentControl process which next piece should be requested, by providing it with the bitfield of the peer.

def handle_info(:next_tick, %{status: :idle, am_interested: true, peer_choking: false} = state) do
  case TorrentControl.next_piece(state.control_pid, state.peer_bitfield) do
    {:ok, index, size} ->
      request(index, 0, size, state)
      data = [0] |> Stream.cycle() |> Enum.take(trunc(:math.ceil(size / 16_384)))
      receive_message(%{state | status: :downloading, piece: {index, data, size}})

    {:error, :no_available_piece} ->
      {:stop, :normal, state}
  end
end

When we request data from peers, we are actually requesting a block. The block size shouldn’t be greater than 16 KB, or there’s a risk of clients rejecting our request. Since a block is much smaller than a piece, we need multiple requests to download a piece from the peer. We can break up the piece into chunks of 16 KB blocks to request from the peer.

defp make_request_data(index, begin, len, data \\ <<>>)

defp make_request_data(_index, _begin, 0, data) do
  data
end

defp make_request_data(index, begin, len, data) do
  req_len = min(len, 16 * 1024)
  data = data <> <<13::32, 6::8, index::32, begin::32, req_len::32>>
  make_request_data(index, begin + req_len, len - req_len, data)
end

We continue saving the blocks in a list as we receive them from the peer as piece messages. Once the complete piece is downloaded, we send the binary data to the FileIOWorker process for the torrent.

# Handles the piece message
defp handle_message(7, len, %{socket: socket, piece: {index, data, size}} = state) do
  with {:ok, <<^index::32, begin::32, block::binary>>} <- :gen_tcp.recv(socket, len) do
    data = List.replace_at(data, div(begin, 16 * 1024), block)
    size = size - byte_size(block)
    {:downloading, %{state | piece: {index, data, size}}}
  end
end

# When all the blocks of a piece are downloaded
def handle_continue(:downloading, %{piece: {index, data, 0}} = state) do
  FileWorker.save_piece(state.file_worker_pid, index, IO.iodata_to_binary(data))
  state = %{state | status: :idle, piece: {nil, nil, nil}}
  Process.send_after(self(), :next_tick, 0)
  {:noreply, state}
end

The file worker computes the SHA-1 hash of the binary data, compares it with the hash of the requested piece, and if it’s a match, writes it to the file on disk. It also notifies the TorrentControl process that the piece has been saved and should not be requested any further from any peer. If the hashes do not match, we mark the piece as failed, and put it back in the queue to be downloaded again.

defp write_piece(index, piece, %{pieces: pieces, info_hash: info_hash} = state) do
  {hash, piece_info} = Map.get(pieces, index)

  case :crypto.hash(:sha, piece) do
    ^hash ->
      FileUtils.write_piece(piece, piece_info)
      bitfield = MapSet.put(state.bitfield, index)
      TorrentControl.notify_saved(state.control_pid, index)
      {:noreply, %{state | bitfield: bitfield}}

    _ ->
      TorrentControl.failed_piece(state.control_pid, index)
      {:noreply, state}
  end
end

When we add a torrent to our client, we spawn a supervisor instance of TorrentSupervisor just to manage this torrent. The supervisor further spawns a Tracker GenServer for communicating with the tracker, a PeerManager Supervisor to spawn PeerWorker GenServers through the PeerPool, and FileIOWorker GenServer to write or read the file on disk. Our supervision tree looks somewhat like this:

Finishing up Link to heading

With a bit of glue code, we can combine this with the Phoenix Framework to display it as a web page using LiveView. Here it is running in its full glory:

To keep it short, I’ve included only a few important snippets here. You can view the full implementation if you’re interested.

This post has been heavily derived from a similar blog post Building a BitTorrent client from the ground up in Go.