Extension to store annotations

We might extend the revfile format in a future version to also store annotations. This is not implemented yet.

In previous versions, the index file identified texts by their SHA-1 digest. This was unsatisfying for two reasons. Firstly it assumes that SHA-1 will not collide, which is not an assumption we wish to make in long-lived files. Secondly for annotations we need to be able to map from file versions back to a revision.

Texts are identified by the name of the revfile and a UUID corresponding to the first revision in which they were first introduced. This means that given a text we can identify which revision it belongs to, and annotations can use the index within the revfile to identify where a region was first introduced.

We cannot identify texts by the integer revision number, because that would limit us to only referring to a file in a particular branch.

I'd like to just use the revision-id, but those are variable-length strings, and I'd like the revfile index to be fixed-length and relatively short. UUIDs can be encoded in binary as only 16 bytes. Perhaps we should just use UUIDs for revisions and be done?

Annotations

Annotations indicate which revision of a file first inserted a line (or region of bytes).

Given a string, we can write annotations on it like so: a sequence of (index, length) pairs, giving the index of the revision which introduced the next run of length bytes. The sum of the lengths must equal the length of the string. For text files the regions will typically fall on line breaks. This can be transformed in memory to other structures, such as a list of (index, content) pairs.

When a line was inserted from a merge revision then the annotation for that line should still be the source in the merged branch, rather than just being the revision in which the merge took place.

They can cheaply be calculated when inserting a new text, but are expensive to calculate after the fact because that requires searching back through all previous text and all texts which were merged in. It therefore seems sensible to calculate them once and store them.

To do this we need two operators which update an existing annotated file:

  1. Given an annotated file and a working text, update the annotation to mark regions inserted in the working file as new in this revision.
  2. Given two annotated files, merge them to produce an annotated result. When there are conflicts, both texts should be included and annotated.

These may be repeated: after a merge there may be another merge, or there may be manual fixups or conflict resolutions.

So what we require is given a diff or a diff3 between two files, map the regions of bytes changed into corresponding updates to the origin annotations.

Annotations can also be delta-compressed; we only need to add new annotation data when there is a text insertion.

(It is possible in a merge to have a change of annotation when there is no text change, though this seems unlikely. This can still be represented as a "pointless" delta, plus an update to the annotations.)

Index file

In a proposed (not implemented) storage with annotations, the index file is a series of fixed-length records:

byte[16]     UUID of revision
byte[20]     SHA-1 of expanded text (as binary, not hex)
uint32       flags: 1=zlib compressed
uint32       sequence number this is based on, or -1 for full text
uint32       offset in text file of start
uint32       length of compressed delta in text file
uint32[3]    reserved

Total 64 bytes.

The header is also 64 bytes, for tidyness and easy calculation. For this format the header must be bzr revfile v2\n padded with \xff to 64 bytes.

The first record after the header is index 0. A record's base index must be less than its own index.

The SHA-1 is redundant with the inventory but stored just as a check on the compression methods and so that the file can be validated without reference to any other information.

Each byte in the text file should be included by at most one delta.

Deltas

In a proposed (not implemented) storage with annotations, deltas to the text are stored as a series of variable-length records:

uint32        idx
uint32        m
uint32        n
uint32        l
byte[l]       new

This describes a change originally introduced in the revision described by idx in the index.

This indicates that the region [m:n] of the input file should be replaced by the text new. If m==n this is a pure insertion of l bytes. If l==0 this is a pure deletion of (n-m) bytes.

Open issues