The Difference Algorithm

The following is an excerpt from a talk I gave in September at the COST 32 meeting in Antwerp, Belgium. It explains at a high level the workings of Juxta’s collation procedure.

Users of UNIX based operating systems may be familiar with the operating system command diff. Certainly most computer programmers have used some version of this program. Using diff, programmers can identify variations between versions of source code files to detect modifications by themselves or other programmers. They can detect additions, deletions and replacements of code within a source file. Diff has been long used by programmers for this purpose. There are many implementations and variations but the essential functionality is always the same: line-by-line comparison of text files. If humanities researchers also wish to conduct a line-by-line comparison of text files, why not employ one of the many implementations of this tool?

Unfortunately, a Difference Algorithm geared for use by programmers is not satisfactory for comparing variations in natural language texts. The reason is that the text of a computer program is structured differently than that of prose text. Computer programs are written in lines and hard carriage returns (inserted into text by hitting the ‘Return’ key on a computer keyboard) usually punctuate the end of a single line of code. Comparg one line to another is sufficient for a programmer to determine what changed. But in natural language texts, the positioning of hard carriage returns is less predictable. Poetry is a best-case scenario, but once you start looking at prose you find hard carriage returns only at the ends of entire paragraphs. Having the algorithm report that an entire paragraph changed when really only one word or perhaps just a comma was omitted is not helpful. There may be additions and deletions within the paragraph, but all we can detect is broadly that something changed.

Juxta’s Difference Algorithm is tuned to the needs of collating natural language. It identifies differences in the text down to the word and punctuation level. It is highly accurate over large spans of prose and has been tested on Mark Twain, medieval Brand, Hamlet and of course the poetry of Dante Gabriel Rossetti.

The Difference Algorithm employed by Juxta was first described in the April 1978 edition of the Journal of the Association for Computing Machinery in an article entitled: “A Technique for Isolating Difference Between Files”. Donald C. Lindsay of Carnegie Mellon University later implemented this algorithm in the C programming language. This version worked on a pair of text files and operated on a line-by-line basis. A few bug fixes and improvements were made to this version and for years it floated through the pre-Web Internet. Then in 1997 Ian F. Darwin took the C version of this code and translated it to Java, keeping the same functionality with an eye toward make the source more “pedagogic”. This was the version received by Nick Laiacona at Applied Research in Patacriticism at the University of Virginia in 2005.

The algorithm as received had little separation of concerns within the code and was written more as a demonstration than as a reusable software component. For example, file format and the visual output code were built into the same methods that performed the actual algorithm. One of the first challenges of this project was to untangle all of this (without breaking anything!) so that the code could be a reusable module within the context of Juxta. The diff algorithm itself is now a standalone package which can be loaded into any Java based software by a software programmer, regardless of whether that software looks or feel like Juxta.

A number of changes were made to the structure and functionality of the code at this point. First, core sub-routines were named, encapsulated and separated into a structure described below. The original version constructed a symbol table and then promptly destroyed the table’s contents through the workings of the code. The symbol table is computationally expensive to build, so this was an inefficient way of operating when comparing many copies of the same text. The code was changed to make it non-destructive and thus reusable. Also, the original version was only capable of producing a plain text report as output. This was changed so that the output is now a data structure, which can be manipulated by the calling program in any way it sees fit. The code was also made so that it can operate on texts loaded in memory, which divorces the core algorithm from concerns over file format. Perhaps most importantly, the algorithm was made to operate on tokens composed of individual words and punctuation marks instead of entire lines of text.

The Difference Algorithm code is composed with the following discrete objects. These objects are chained together and the output of one is the input of the next.

Symbol Table – The symbol table takes the raw texts and turns them into a hash table of symbols with indices stored to either one or both of the texts. Every word’s disposition is established: does it appear once in the base, once in the witness, once in each, or more than once in each. Symbols that appear once are used as landmarks in the following step.

Correlator – Takes the symbol table scans from the points where both match to establish where there are “blocks” of symbols which are the same between both files.

Collector – The collector then takes this series of blocks and “interprets” them as one of three difference types: addition, deletion, or change. For example, a block of symbols that appears in the base but not in the witness would be considered a deletion. Each difference is captured and added to the Difference Set.

Difference Set – This is the final output of the process, which contains a complete list of all the differences encountered, their location and type.

During the development of this tool we used Microsoft Corporation’s MS Word Compare Documents function as a baseline. This algorithm is excellent, but unfortunately cbd hemp oil benefits is proprietary, closed-source; only works on Word documents and only collates two documents at a time. Nevertheless, it is very accurate and we used it for checking our work and also as inspiration for some of the visualizations found in Juxta.

When we first ran the difference algorithm, we found that we were doing pretty good but not as good as Microsoft. Sometimes the algorithm was able to isolate the change area in a very small span of text but at other times the spans were too wide. While the algorithm was correct that a change did occur within the paragraph or multi-line area that it had marked, humanities researchers at ARP wanted it to be more specific as to what exactly had changed.

The diff algorithm’s strategy involves finding words that are unique within the texts. This has its limitations because it depends on the richness of the vocabulary employed in the target text. Also, if you write enough in any language in which spelling is regularized, you end up using the same words over and over again. The algorithm code is capable of computing this ratio so that it can actually report a confidence level in its own results. While this is important, it is better to do better.

At this point we introduced a multi-pass approach, which gave us a significant improvement in accuracy. This approach is easy to understand. Remember that there are three types of differences that we identify: additions, deletions and changes. The definition of the first two is pretty obvious; the passage is either in one text or another. What are changes? Changes are areas where we are not exactly sure what happened. We know that the text changed and exactly where and for how long. But that’s it. We can take these areas of ambiguity and resolve them by running the whole diff algorithm again on these blocks of change. This treats these changed passages as texts in their own right for the purposes of collation and greatly improves the symbol/length ratio in that moment. This change brought us to word level accuracy, which was our goal.

Work Flows and Wish Lists: Reflections on Juxta as an Editorial Tool

I have had the opportunity to use Juxta Commons for several editorial projects, and while taking a breath between a Juxta-intensive term project last semester and my Juxta-intensive MA thesis this semester, I would like to offer a few thoughts on Juxta as an editorial tool.

For my term project for Jerome McGann’s American Historiography class last semester, I conducted a collation of Martin R. Delany’s novel, Blake, or, The Huts of America, one of the earliest African American novels published in the United States.Little did I know that my exploration would conduct me into an adventure as much technological as textual, but when Professor McGann recommended I use Juxta for conducting the collation and displaying the results, that is exactly what happened. I input my texts into Juxta Commons, collated them, and produced HTML texts of the individual chapters, each with an apparatus of textual variants, using Juxta’s Edition Starter. I linked these HTML files together into an easily navigable website to present the results to Professor McGann. I’ll be posting on the intriguing results themselves next week, but in the meantime, they can also be viewed on the website I constructed, hosted by GitHub: Blake Project home.

Juxta helped me enormously in this project. First, it was incredibly useful in helping me clean up my texts. My collation involved an 1859 serialization of the novel, and another serialization in 1861-62. The first, I was able to digitize using OCR; the second, I had to transcribe myself. Anyone who has done OCR work knows that every minute of scanning leads to (in my case) an average of five or ten minutes of cleaning up OCR errors. I also had my own transcription errors to catch and correct. By checking Juxta’s highlighted variants, I was able to—relatively quickly—fix the errors and produce reliable texts. Secondly, once collated, I had the results stored in Juxta Commons; I did not have to write down in a collation chart every variant to avoid losing that information, as I would if I were machine- or sight-collating. Juxta’s heat-map display allows the editor to see variants in-line, as well, which saves an immense amount of time when it comes to analyzing results: you do not have to reference page and line numbers to see the context of the variants. Lastly, Juxta enabled me to organize a large amount of text in individual collation sets—one for each chapter. I was able to jump between chapters and view their variants easily.

As helpful as Juxta was, however, I caution all those new to digital collation that no tool can perfectly collate or create an apparatus from an imperfect text. In this respect, there is still no replacement for human discretion—which is, ultimately, a good thing. For instance, while the Juxta user can turn off punctuation variants in the display, if the user does want punctuation and the punctuation is not spaced exactly the same in both witnesses, the program highlights this anomalous spacing. Thus, when 59 reads

‘ Henry, wat…

and 61 reads

‘Henry, wat…

Juxta will show that punctuation spacing as a variant, while the human editor knows it is the result of typesetting idiosyncrasies rather than a meaningful variant. Such variants can carry over into the Juxta Edition Builder, as well, resulting in meaningless apparatus entries. For these reasons, you must make your texts perfect to get a perfect Juxta heat map and especially before using Edition Starter; otherwise, you’ll need to fix the spacing in Juxta and output another apparatus, or edit the text or HTML files to remove undesirable entries.

Spacing issues can also result in disjointed apparatus entries, as occurred in my apparatus for Chapter XI in the case of the contraction needn’t. Notice how because of the spacing in needn t and need nt, Juxta recognized the two parts of the contraction as two separate variants (lines 130 and 131):

spacing 2

This one variant was broken into two apparatus entries because Juxta recognized it as two words. There is really no way of rectifying this problem except by checking and editing the text and HTML apparatuses after the fact.

I mean simply to caution scholars going into this sort of work so that they can better estimate the time required for digital collation. This being my first major digital collation project, I averaged about two hours per chapter (chapters ranging between 1000 and 4000 words each) to transcribe the 61-62 text and then collate both witnesses in Juxta. I then needed an extra one or two hours per chapter to correct OCR and transcription errors.

While it did take me time to clean up the digital texts so that Juxta could do its job most efficiently, in the end, Juxta certainly saved me time—time I would have spent keeping collation records, constructing an apparatus, and creating the HTML files (as I wanted to do a digital presentation). I would be remiss, however, if I did not recommend a few improvements and future directions.

As useful as Juxta is, it nevertheless has limitations. One difficulty I had while cleaning my texts was that I could not correct them while viewing the collation sets; I had, rather, to open the witnesses in separate windows.

screenshot windows

The ability to edit the witnesses in the collation set directly would make correction of digitization errors much easier. This is not a serious impediment, though, and is easily dealt with in the manner I mentioned. The Juxta download does allow this in a limited capacity: the user can open a witness in the “Source” field below the collation visualization, then click “Edit” to enable editing in that screen. However, while the editing capability is turned on for the “Source,” you cannot scroll in the visualization—and so navigate to the next error which may need to be corrected.

A more important limitation is the fact that the Edition Starter does not allow for the creation of eclectic texts, texts constructed with readings from multiple witnesses; rather, the user can only select one witness as the “base text,” and all readings in the edition are from that base text.

screenshot Edition Starter

Most scholarly editors, however, likely will need to adopt readings from different witnesses at some point in the preparation of their editions. Juxta’s developers need to mastermind a way of selecting which reading to adopt per variant; selected readings would then be adopted in the text in Edition Starter. For the sake of visualizing, I did some screenshot melding in Paint of what this function might look like:

mockup

Currently, an editor wishing to use the Edition Starter to construct an edition would need to select either the copy-text or the text with the most adopted readings for the base text. The editor would then need to adopt readings from other witnesses by editing the the output DOCX or HTML files. I do not know the intricacies of the code which runs Juxta. I looked at it on GitHub, but, alas! my very elementary coding knowledge was completely inadequate to the task. I intend to delve more as my expertise improves, and in the meantime, I encourage all the truly code-savvy scholars out there to look at the code and consider this problem. In my opinion, this is the one hurdle which, once overcome, would make Juxta the optimal choice as an edition-preparation tool—not just a collation tool. Another feature which would be fantastic to include eventually would be a way of digitally categorizing variants: accidental versus substantive; printer errors, editor corrections, or author revisions; etc. Then, an option to adopt all substantives from text A, for instance, would—perhaps—leave nothing to be desired by the digitally inclined textual editor. I am excited about Juxta. I am amazed by what it can do and exhilarated by what it may yet be capable of, and taking its limitations with its vast benefits, I will continue to use it for all future editorial projects.


Stephanie Kingsley is a second-year English MA student specializing in 19th-century American literature, textual studies, and digital humanities. She is one of this year’s Praxis Fellows [see Praxis blogs] and Rare Book School Fellows. For more information, visit http://stephanie-kingsley.github.io/, and remember to watch for Ms. Kingsley’s post next week on the results of her collation of Delany’s Blake.

Re-visualizing in Juxta: Newsdiffs.org

The Juxta R&D team recently did a demo of Juxta Commons at a DH Day conference at NC State University, and one of the attendees brought the site NewsDiffs.org to our attention. It’s a great resource, tracking changes to online, “published” articles from some of the largest media outlets out there. But, while NewsDiffs brings together a bunch of different versions of these online articles, their visualizations are only helpful for two versions at a time.

As an experiment, I took several versions from this example and plugged them into Juxta Commons. When combined in the heat map, the results were truly surprising.


In this example, the newest version (captured on November 6, 2012) is the base witness, with the previous revisions made on the day of the articles release included in the set. Just imagine: readers visiting the New York Times article at 11:15am would have read a very different set of opening paragraphs than those checking in at 11:45am.

 

 

A Preview of Juxta Commons

The NINES R&D team is happy to announce a new phase of testing for Juxta online: Juxta Commons. We’re entering our final phase of intensive testing on this new site for using Juxta on the web, which breaks down the processes of the desktop application so you always have access to your raw source files and your witnesses, in addition to your comparison sets. We’ve even added more ways to work with XML, as well as an option to import and export files encoded in TEI Parallel Segmentation.

We have invited a group of scholars and users to try out Juxta Commons for the next two months, and share their experiences online. They’ll be exploring new ways to add source files, filter XML content and browse our newly-updated visualizations, previewed in the gallery above.

If you would like to be a part of this group of testers, please leave a comment below, and we’ll get in touch with you.

Beta-release of Juxta includes online sharing

Calling all beta testers!

Over the past few months, NINES and the developers of Juxta have been busy adapting the application for use on the web. In order to expand our testing capabilities, we’re releasing a version of the desktop client that offers users the ability to share comparison sets online.

If you have any sets of witnesses to a particular work that you would like to collate and share, we invite you to sign up and download the beta version  to try out some of our online features. Please keep in mind that this is a trial version of the web-service, and may be subject to changes and updates over the next few months. Joining us now ensures that your feedback will make the full release of the software better than we could manage in-house.

 Please help us make Juxta better!

Working with non-Roman alphabets in Juxta

Now that Juxta 1.3 has been refined and released, the development team at NINES has been discussing new directions for the software. First and foremost is the adaptation of Juxta’s collating power for texts in languages other than English. Comparisons of texts in French and Italian work pretty well, but we’re still investigating the necessary diacritics to make such operations more exact. However, it seems that scholars working with non-Roman alphabets have been left out of the conversation.

Do any Juxta users out there have any experiences with foreign language collation to share with us?

server hiccups

For all those who had trouble accessing the site this week, we’re happy to announce it’s up and running again! We apologize for the delay and encourage you to access the manual and software download pages once more.

Juxta 1.3 Released

Juxta 1.3 is now available for download here. It has the following new features:

1) Search over all documents.

Juxta Search
A search box has been added to the toolbar, making it possible to find instances of a word or phrase within all documents in the comparison set. Those results are listed in the Search pane at the bottom of the screen (see image above). Clicking on a line in that pane will display the document and the search results. Note that Juxta will remember the last searches that were performed and show them in the search drop down list.

2) Line numbers appear for the witness and base texts.

Line Numbers

Now, when the “toggle line numbers” menu item is selected, the line numbers appear alongside the witness text, in addition to the those coresponding to the base text.

3) “Moves”: the ability to correlate similar passages that are differently located in two documents.

Juxta Moves

The Passages feature from the last version has been reworked into the new, “Moves” feature. In the side-by-side collation view, the user may select text in both the base and the witness documents representing a passage identified as having moved (1). The move button (2) will become enabled at that point.

Juxta Move Completed
Click here to create the move. You will see an outline of the passages (3) and a line connecting them, with an entry made in the Moves pane (4). Clicking the entry brings the move into view.

Altogether, these features represent a significant improvement to Juxta as a textual collation tool. Download it and give it a try today!

Juxta 1.2.2 Released

Juxta 1.2.2 is now available for download. The major new feature in this release is an improved fragment selection mechanism and the ability to easily preview files before collating them. This functionality is accessed via the “Files” tab on the left hand panel, depicted below.

juxta-frag1.jpg
Clicking on the “Files” tab brings up a tree of the files in the currently selected base directory. Clicking on the file icon allows the scholar to select a directory from which to select files for collation.

juxta-frag2.jpg
Double clicking on files with a “txt” or “xml” extension opens them in a preview mode. The scholar can then choose to import the entire file into the collation or to highlight a fragment and pull just the highlighted fragment into the collation. Fragments carry with them the metadata and lineation from the source text, if any. This new functionality replaces the old fragment selection mode with a more integrated solution.