The following is an excerpt from a talk I gave in September at the COST 32 meeting in Antwerp, Belgium. It explains at a high level the workings of Juxta’s collation procedure.
Users of UNIX based operating systems may be familiar with the operating system command diff. Certainly most computer programmers have used some version of this program. Using diff, programmers can identify variations between versions of source code files to detect modifications by themselves or other programmers. They can detect additions, deletions and replacements of code within a source file. Diff has been long used by programmers for this purpose. There are many implementations and variations but the essential functionality is always the same: line-by-line comparison of text files. If humanities researchers also wish to conduct a line-by-line comparison of text files, why not employ one of the many implementations of this tool?
Unfortunately, a Difference Algorithm geared for use by programmers is not satisfactory for comparing variations in natural language texts. The reason is that the text of a computer program is structured differently than that of prose text. Computer programs are written in lines and hard carriage returns (inserted into text by hitting the ‘Return’ key on a computer keyboard) usually punctuate the end of a single line of code. Comparing one line to another is sufficient for a programmer to determine what changed. But in natural language texts, the positioning of hard carriage returns is less predictable. Poetry is a best-case scenario, but once you start looking at prose you find hard carriage returns only at the ends of entire paragraphs. Having the algorithm report that an entire paragraph changed when really only one word or perhaps just a comma was omitted is not helpful. There may be additions and deletions within the paragraph, but all we can detect is broadly that something changed.
Juxta’s Difference Algorithm is tuned to the needs of collating natural language. It identifies differences in the text down to the word and punctuation level. It is highly accurate over large spans of prose and has been tested on Mark Twain, medieval Brand, Hamlet and of course the poetry of Dante Gabriel Rossetti.
The Difference Algorithm employed by Juxta was first described in the April 1978 edition of the Journal of the Association for Computing Machinery in an article entitled: “A Technique for Isolating Difference Between Files”. Donald C. Lindsay of Carnegie Mellon University later implemented this algorithm in the C programming language. This version worked on a pair of text files and operated on a line-by-line basis. A few bug fixes and improvements were made to this version and for years it floated through the pre-Web Internet. Then in 1997 Ian F. Darwin took the C version of this code and translated it to Java, keeping the same functionality with an eye toward make the source more “pedagogic”. This was the version received by Nick Laiacona at Applied Research in Patacriticism at the University of Virginia in 2005.
The algorithm as received had little separation of concerns within the code and was written more as a demonstration than as a reusable software component. For example, file format and the visual output code were built into the same methods that performed the actual algorithm. One of the first challenges of this project was to untangle all of this (without breaking anything!) so that the code could be a reusable module within the context of Juxta. The diff algorithm itself is now a standalone package which can be loaded into any Java based software by a software programmer, regardless of whether that software looks or feel like Juxta.
A number of changes were made to the structure and functionality of the code at this point. First, core sub-routines were named, encapsulated and separated into a structure described below. The original version constructed a symbol table and then promptly destroyed the table’s contents through the workings of the code. The symbol table is computationally expensive to build, so this was an inefficient way of operating when comparing many copies of the same text. The code was changed to make it non-destructive and thus reusable. Also, the original version was only capable of producing a plain text report as output. This was changed so that the output is now a data structure, which can be manipulated by the calling program in any way it sees fit. The code was also made so that it can operate on texts loaded in memory, which divorces the core algorithm from concerns over file format. Perhaps most importantly, the algorithm was made to operate on tokens composed of individual words and punctuation marks instead of entire lines of text.
The Difference Algorithm code is composed with the following discrete objects. These objects are chained together and the output of one is the input of the next.
Symbol Table – The symbol table takes the raw texts and turns them into a hash table of symbols with indices stored to either one or both of the texts. Every word’s disposition is established: does it appear once in the base, once in the witness, once in each, or more than once in each. Symbols that appear once are used as landmarks in the following step.
Correlator – Takes the symbol table scans from the points where both match to establish where there are “blocks” of symbols which are the same between both files.
Collector – The collector then takes this series of blocks and “interprets” them as one of three difference types: addition, deletion, or change. For example, a block of symbols that appears in the base but not in the witness would be considered a deletion. Each difference is captured and added to the Difference Set.
Difference Set – This is the final output of the process, which contains a complete list of all the differences encountered, their location and type.
During the development of this tool we used Microsoft Corporation’s MS Word Compare Documents function as a baseline. This algorithm is excellent, but unfortunately is proprietary, closed-source; only works on Word documents and only collates two documents at a time. Nevertheless, it is very accurate and we used it for checking our work and also as inspiration for some of the visualizations found in Juxta.
When we first ran the difference algorithm, we found that we were doing pretty good but not as good as Microsoft. Sometimes the algorithm was able to isolate the change area in a very small span of text but at other times the spans were too wide. While the algorithm was correct that a change did occur within the paragraph or multi-line area that it had marked, humanities researchers at ARP wanted it to be more specific as to what exactly had changed.
The diff algorithm’s strategy involves finding words that are unique within the texts. This has its limitations because it depends on the richness of the vocabulary employed in the target text. Also, if you write enough in any language in which spelling is regularized, you end up using the same words over and over again. The algorithm code is capable of computing this ratio so that it can actually report a confidence level in its own results. While this is important, it is better to do better.
At this point we introduced a multi-pass approach, which gave us a significant improvement in accuracy. This approach is easy to understand. Remember that there are three types of differences that we identify: additions, deletions and changes. The definition of the first two is pretty obvious; the passage is either in one text or another. What are changes? Changes are areas where we are not exactly sure what happened. We know that the text changed and exactly where and for how long. But that’s it. We can take these areas of ambiguity and resolve them by running the whole diff algorithm again on these blocks of change. This treats these changed passages as texts in their own right for the purposes of collation and greatly improves the symbol/length ratio in that moment. This change brought us to word level accuracy, which was our goal.