Anastasia and Collate Blog

WordPress database error: [Table './sdeblog/wp_users' is marked as crashed and should be repaired]
SELECT * FROM wp_users WHERE ID = '2' LIMIT 1

February 6, 2007

The History of Collate

Filed under: History, Anastasia: finding another — @ 9:29 am

WordPress database error: [Table './sdeblog/wp_users' is marked as crashed and should be repaired]
SELECT * FROM wp_users WHERE ID = '2' LIMIT 1

Collate 0 — Collate 1 — Collate 2

There have actually been three versions of Collate, up to now. The very first, Collate 0 if you like, I wrote in Spitbol on the DEC Vax in Oxford between 1986 and 1989. I wrote this to collate 44 manuscripts of the Old Norse narrative sequence Svipdagsmal, which I was editing for my doctoral thesis. I prepared full transcripts of each manuscript on a Macintosh computer, and then transferred them to the Vax (itself, I remember, not so straightforward a task in those days of the floppy disc). I collated the transcripts using the Spitbol program, and created various kinds of output. One of these outputs became the apparatus for the critical edition included in my thesis. Another output was translated into a relational database, which I used to explore the relationships between the manuscripts. To optimize this, information about just what manuscript had what variant in the database was held in a matrix, with rows representing each variant, and columns representing each manuscript. Thus:

1 0 1
1 1 0

showed that manuscripts A and C agree at the first variant (both having variant ‘1′, while B has ‘0′); manuscripts A and B agree at the second variant (both having variant ‘1′ while C has ‘0′). This matrix has a historical importance: this was the data given to the participants in the ‘textual criticism challenge’ of 1991 which established, firstly, that phylogenetic methods were far ahead of any other kinds of analysis in applicabililly to the analysis of textual traditions, and, secondly, that phylogenetic analysis could prove genuinely useful in establishing historical relations within a textual tradition.

Collate 0 consisted of around 1200 lines of Spitbol code. Spitbol was (and is: versions of the program are still maintained) a rather beautiful program, built around pattern-matching algorithms. It had some very neat string matching and storage facilities (including, a nifty table facility with hash and key tools). You could write functions within it, but by modern standards, its data models were crude: everything was a string, and that was that. Oxford was then a stronghold of Spitbol (and Snobol) programming: Susan Hockey taught a course in Snobol (I think) and I remember many animated discussions with her and with Lou Burnard about what I was trying to do.

Collate 0 established several approaches to collation which I retained in the later versions of Collate, and which indeed will (I think) be part of CollateXML:

  1. Collation should be based on full transcripts of the manuscripts. This seems obvious now; it was less so then
  2. One should collate all the versions at once, at the same time, rather than (say) running many pair-wise comparisons and then melding the many comparisions into one
  3. The text needed to be divided into collateable blocks. This required some system of marking the blocks: I adopted the COCOA system, then used by the Oxford Concordance Program, for this
  4. Other textual features (notably, abbreviation) needed markup
  5. Some kind of regularization facility was needed to filter out ’spelling’ from ’substantive’ variation

Collate 0 was successful in two key ways:

  1. I managed to finish my thesis, and got my doctorate, despite spending countless hours, often in the dead of the night, deep in Oxford University Computing Services on 6 Banbury Road (and briefly in a OUCS annex in South Parks road) peering at the green symbols on the darkened Vax terminal, and endlessly tinkering with and re-running the Spitbol program
  2. I wrote two articles for Literary and Linguistic Computing about this work. On the strength of these, and with Susan Hockey’s guidance and help, I submitted a grant application to the Leverhulme Trust to carry on this work.

This grant proposal was successful, and in September 1989 I started work on what became Collate 1. Only one person had ever used, and probably could ever use, Collate 0: me. Rather a lot of computer programs, I have since discovered, are only ever used by the person who wrote them (including, indeed, some made with much public money). Our proposal to the Leverhulme trust specified that our collation tool could be used by many other people. This meant a real graphical user interface, not the command-line tool which Collate 0 was. Indeed, one needed a graphic interface because I was by then convinced (and, i still believe) that scholarly collation is an interactive activity. I found that in Collate 0 I spent endless hours manipulating the collation output by tinkering with the program itself, and by compiling complex regularization tables to smooth out idiosyncratic spellings from the tables. This was extremely clumsy. I determined that in Collate 1, we would have the computer make a first guess at the collation for any part of the text, a block at a time. The scholar would examine that collation, and then intervene in a point-and-click way to adjust the collation as needed. For medieval texts, some form of spelling regularization was required. In Collate 0 the regularizations were held in separate files, which were loaded at runtime: so you had to run the collation, look at the results, see what needed to be changed, open and edit the files (with a VI line editor, no easy thing), then reload and run the collation again — and so on. In Collate 1, I wanted to point at what word we wanted regularized to what, and to see the result instantaneously. Similarly, I now knew that any automatic system was going to make decisions about precisely what collated with what which a scholar would find unsatisfactory. Take the collation of ‘a cat’ against ‘cat’. Should we regard this as replacement of one word by a phrase, or of identify of one word (’cat’) in each souce and addition of another word (’a') in one source? In Collate 0, such intervention was done in the nastiest possible way: by hardwiring various gotchas into the collation code itself. In Collate 1, this should be done again by some kind of user-intervention, working in a graphic userface.

This was September 1989 and if you wanted to make a program for personal computers with interactive point-and-click facililities there was only one choice for it: the Macintosh. Microsoft had attempted two versions of Windows up to then, but neither appeared sufficiently stable for a neophyte programmer. By comparision, programmer tools for the Mac were well advanced. Also, I knew Macintosh computers very well, as I had used a succession of Macs for writing my thesis. Apple Computer donated a Macintosh SE (I think) to the project, we purchased a C programming compiler — Lightspeed C, which became Think C quite soon — and we were started. In the early days we did not even have a hard disc. The SE had two floppy disk drives, which made it a truly luxurious machine in those days: you could have the program and some data on one floppy disc drive, and the operating system and other data on another floppy disc drive. Much of the time was spent juggling data and programs between discs, ejecting and inserting disc after disc, sometimes hundreds of times a day (so much so, that someone even adapted the pop-up mechanism from a toaster to automate insertion and removal of discs).

The choice of C meant a complete ground-up rewrite of the program, within a windows/icons/menus/pointer (WIMP!) environment. So Collate 1 began, with the first versions released in 1991. This retained the fundamental features of Collate 0 referred to above (collation by blocks based on full transcripts, basic markup) with newer tools: a ‘live’ collation mode combined with point-and-click adjustment of regularization and setting variants; expanded and more flexible markup, including notation of layout features such as pages, columns, lines and text ornamentation; output formatted for TeX processing using the Edmac macros for complex critical edition layout. In a series of talks in 1990 and 1991 — at the New Chaucer Society conference in Canterbury; the ALLC conference in Phoenix, Arizona; in Austin, Texas; at Georgetown University in Washington; at the Society for Textual Scholarship in New York; especially, at the CHUG meeting in Providence — I described the unfolding Collate, and recruited its first enthusiatic and hopeful users. Some of these users are still with Collate, many years on: Don Reiman and Neil Fraistat incorporated it into the work they did on their Johns Hopkins Shelley edition; hardly a week since has passed without a message (admonitory, exhortatory, or plain friendly) from Michael Stone; and after fifteen years Prue Shaw was finally able in 2006 to publish her edition of Dante’s Monarchia, built with Collate.

Collate 1 established the user interface still basic to the current Collate 2, which has retained all the major features outlined above. Collate 2 also is built on the same C code as Collate 1. There is no ‘clean break’ between Collate 1 and 2 as there is between Collate 0 (written in Spitbol) and Collate 1 (written in C) — and as there will be between the current Collate 2 and its successor (which I now think of as CollateXML, and which I now contemplate will be written in Java, ‘now’ being January 2007). However, various developments in the early 1990s led to such a drastic reshaping and enlargement of Collate 1 that I came to think of this as ‘Collate 2′. These developments, in no special order, were:

  1. The onset of the Text Encoding Initiative. Oxford, through Susan Hockey and Lou Burnard (in those days, the Tony Blair and Gordon Brown of UK humanities computing), was the European leader of the TEI. I found myself drawn into the TEI orbit, even becoming the absurdly underqualified chair of the Scholarly Apparatus workgroup (which included Robin Cover, Ian Lancashire, Bob Kraft and Peter Shillingsburg, so you can see how junior I should have felt). I also attended meetings of the primary source transcription workgroup, though for some reason this has never been recognized in the TEI documentation, and I ended up writing almost the whole of the chapters on textual apparatus and transcription encoding in the TEI (though again, this has never been clearly acknowledged). Through the TEI I learnt about SGML, and became completely convinced that structural markup (though not hierarchical markup) is key to useful scholarly work in the digital age.
  2. The appearance of the web. Oxford was one of the very first sites to mount a web server (as early as late 1992, if I recall rightly) and I attended the first web conference, held at CERN in April 1994, when the web was still small enough for a meeting of server administrators to be held under a tree on the lawn outside the CERN lecture halls.
  3. The development of the Canterbury Tales project. In our proposal to the Leverhulme Trust we stated that we would use the manuscritps of the Wife of Bath’s Prologue as test material. Susan Hockey and I did not think very deeply about this choice: we were just looking for something that was not Old Norse (our other choice of test material was the Old Norse Solarljod — and this year, finally, my and Carolyne Larrington’s edition of this should appear in the massive new edition of Old Norse skaldic poetry), which was in about the right number of manuscripts, seemed to present interesting problems, and would be fun to work with.
  4. The demands of other Collate users. The key group here was the Institute for New Testament Research, Munster. I first met this group in 1996: in 1997 I started working with them intensively on the Nestle-Aland Greek New Testament, and through them met David Parker and the scholars he was working with in Birmingham.
  5. Collaboration with researchers in evolutionary biology. I had already discovered the power of phylogenetic methods through Robert O’Hara: particularly, his entry to the ‘textual criticism challenge’ in 1991, showing how these methods worked with the Old Norse Svipdagsmal tradition. Robert and I developed this into several articles but were unable to carry it much further. However, in 1996 I met, through Linne Mooney, Chris Howe of the Cambridge University Department of Molecular Biology. As a professional evolutionary biologist, he was able to bring many more resources to this enquiry — particularly, he brought in a series of remarkable individual researchers to the work, each contributing new perspectives.

In different ways, these forced me to refine what Collate did, and to develop new capacities for it, to such an extent that Collate became a new program. The key change was that I came to think that the aim of Collate was not to help scholars prepare print editions, but to help them make electronic editions. This had many consequences. Particularly, it meant that Collate had to prepare materials for inclusion in an electronic edition. This meant first of all SGML — and later, XML and HTML. This meant also extended parsing facilities. I did not go so far as adapting Collate to collate files fully encoded with SGML. Collate now had a body of users with many files encoded in the Collate format and content to go on using that format and I would have had considerable difficulty persuading them to move over to full SGML. But I did tighten the Collate encoding model to make it closer to SGML, and then added comrehensive facilities to translate Collate encoded files to SGML (and also XML, HTML and other systems). I also folded two full SGML parsers into the program: both Pierre Richard’s YASPMAC and James Clarke’s SP. These were used particularly for translating SGML encoded apparatus files into other forms, particularly into NEXUS files for analysis by evolutionary biology programs.

While these extended Collate’s grasp, the requirements of its most demanding users forced it in other directions. One of these demanding users was the Canterbury Tales Project. As we moved onto larger sections of text, and particularly sections where no two manuscripts had the same lines in the same order, I discovered we needed a much more powerful system for dealing with witnesses which had the text blocks in many different orders. ‘Block maps collation’ was, and is, Collate 2’s solution to this. But perhaps the biggest shift of all was one that many users may not see at all. This is the adoption of ‘parallel segmentation collation’, directly as a result of the experience of working with Munster scholars and with evolutionary biologists. I explain at some length exactly how these two groups led us to abandon the ‘base text collation’ we used before 1998 in favour of ‘parallel segmentation collation’ in the article ‘Collation Rationale’ included in the Miller’s Tale CD-ROM.

Adopting this model forced changes on many areas of the program: particularly, on the ‘Set Variants’ module, and also on the kinds of analysis and variant display we could now achieve. Perhaps most of all, it puts us in reach of a yet more sophisticated mode of collation: what I describe as ‘multiple progressive alignment’ in the ‘Collation Rationale’ article. Briefly: once we have aligned the variation across the witnesses into parallel segments, one could then go a step further and analyse the witness groupings within the segments. This is standard practice in analysis of variant DNA sequences in evolutionary biology but I have not implemented this in Collate 2: here, indeed, is a task for the next Collate.

Collate 2 was formally released in 1996, and has been continually refined since then. The development of Collate 1 and 2 now spans over seventeen years, from late 1989 to 2007, and there is C code within Collate dating back to the very beginning of Collate 1. This is an eon in the software world. Further, what was a great benefit in the software world in 1989 — the availabiltiy of the Macintosh interface for interface programming — had by 2007 become a cul-de-sac. The introduction of Macintosh OS X from 2000 on rendered the future of Macintosh Classic applications very dubious. I could, in theory, port Collate to OS X and a few times after 2000 I began to experiment with such a port. I discovered, very quickly, that this would be a huge task. The Collate code has grown to around 180 files, amounting to around 120,000 lines of code. Perhaps most discouraging: there are over 80 dialogue windows in Collate, managing the user’s interaction with the program. Some of these — notably, the regularization and set variants windows — have extremely complex execution flows built in them, refined over more than a decade’s experience. One might abandon some of these: but many of these windows would have to be hand-made anew in the OS X environment. Further, OS X changed many aspects of the graphic environment inhabited by Classic, and one would have to go through the code line-by-line at some points changing the old for the new. Many of these changes would involve complex reprogramming. And at the end: one would have a program which still ran on only one operating system.

Other things, too, had changed. The mantra of ‘write once, run everywhere’ had taken root, and a new generation of tools (notably, the Java programming environment) had arisen to support this aim. It is now a real possibility to write a complex graphic user interface program which runs identically, and as if native, on multiple platforms. Further, the XML world has matured, with a speed that would seem unimaginable to the very slow pace of development of applications for its predecessor, SGML. And most decisively, perhaps: a model of open-source collaborative programming has developed. All the time that I wrote Collate 1 and 2, the authoring model for software was modelled on that for books: a single person wrote the software, and then it was sold. But since the mid 90s, the open source movement, built on voluntary collaboration, has gathered pace. This is particularly so in the university and research worlds, where the news that you might even be considering writing software to sell is met with disbelief — so that funding bodies routinely now insist that software code be open source. Within the XML world too, another model of programming has also developed: away from the all-inclusive this-application-will-do-it-all to a federated world of individual co-operating programs. This is particularly true in the web world: a simple user request may invoke one program to work out how to respond, which then summons data from a relational database, combines this with other data from an XML database (using XQuery and other X applications), blends into XML, which an XML formatter then transforms to HTML, which the server then passes back to the requester.

This leaves us, then, with a set of directions we can follow for CollateXML:

  1. It will have all the functionality of Collate 2; particularly, it may support interactive user-adjustable collation
  2. It will be written in a modular form, so that (for example) applications which want to use collation services but not to offer interactive adjustment of collation can embed the collation services in their own environment apart from the user interface
  3. It will handle native XML, both with and without a schema or DTD. However, it should employ its own data interface, independent of XML, so that future or other markup languages (including, indeed, the existing Collate markup) could be readily supported by the program. I am known for predicting the demise of XML: an event which will occur when computer science departments recognize that the overlapping hierarchy problem is not a ‘residual’ difficulty, but a fundamental feature of text.
  4. It will be written co-operatively, in an open source environment
  5. The best bet for its development appears to be Java. The range of XML tools already offered by Java gives us an excellent platform — as, too, the remarkable string-processing library Java offers. Combine this with its high modularity, its excellent support for graphic interfaces, and its popularity with XML developers (not least, the eXist world) and we have an extremely compelling case.

So far, the history of Collate. In the next post, I will set out a design for CollateXML, based on what I have learnt of collation systems over the years.

All this means: the next version of Collate must be open source.


  1. This is a test comment!!!

    Comment by test — February 6, 2007 @ 10:33 am

    WordPress database error: [Table './sdeblog/wp_users' is marked as crashed and should be repaired]
    SELECT * FROM wp_users WHERE ID = '2' LIMIT 1

  2. The History of Collate…

    Trackback by University Update — February 6, 2007 @ 11:09 am

    WordPress database error: [Table './sdeblog/wp_users' is marked as crashed and should be repaired]
    SELECT * FROM wp_users WHERE ID = '2' LIMIT 1

RSS feed for comments on this post. TrackBack URL

Leave a comment

You must be logged in to post a comment.

Powered by WordPress