From the account of the collation, we are dealing with something very different from ’string comparison’. Indeed, the base unit of the collation is the word: we collate words, not strings. Words may be concatenated, or divided: but words are the basis of it all. (This was the form used by Collate).
For each witness, we need the following information:
- Its sigil
- Its location (in Collate0-2 this was simply a file name; in CollateXML it might be a url, an xquery or xpath expression, etc)
- Collation specifications for this witness. See below.
- For each collateable block: two collateable object arrays. See below
- For each collateable block: an array of correspondences with the base. See below.
The collation specifications for variant identification
These will control the way what is recorded as a variant against the base. Settings include:
a. case. settings will be collate/ignore.
if collate: Collation will treat differences of case as variants.
if ignore: Collation will not treat differences of case as variants.
b. xml. ignore xml. Settings will be: all/none/nomininated
If none: all xml encoding surrounding, within or between the words will be ignored
If all: all xml encoding will be collated, including empty elements, surrounding, within, and between the words
If nominated: only specified xml elements will be nominated. The details of the xml elements to be collated will be held in a further structure (see below).
c. xmlcollate: null unless xml=nominated. This structure is a series of elements to be collated, as follows:
i. gi: the gi of the element to be collated (including namespace)
ii. attributes: Values are all/none/nominated. If all: all attributes and their values are to be collated; if none, all attribute values are ignored, and only element names are collated; if nominated, details of attributes to be collated are held in a further structure
iii. collateattributes: null unless attributes=nominated. This structure is a series of attribute names which will be collated for this element (this could be further elaborated, perhaps, to set conditions: report as variant if the attribute is a particular value)
d. punctuation. Settings will be all/none/nominated
if all: collate all punctuation, as identified by the isPunctuation method
if none: collate no punctuation, as identified by the isPunctuation method
if nominated: collate only specific punctuation identified by the isPunctuation method
The specifications object must also have at least one method: isPunctuation. For a particular pair of strings, this should identify whether differences between them are purely punctuation (in which case, they might or might not be variants) or not.
Two other methods might be required:
isCaseDifference: if it is found that Java’s native methods for ignoring case difference when comparing strings are not adequate.
adjustXML: for some contexts, we may need to do more than simply ignore/not ignore XML. Consider:
One might here wish to ignore the &per; element and treat this as ‘experience’.
The collation specifications for text alignment
The model here proposed, of separating text alignment from variant identification, presumes that optimal text alignment would be achieved by ignoring differences of case, punctuation and xml. Thus, at the alignment stage, we would use the minimal set of collation specifications for comparison of witnesses with the base.
Hierarchical setting of collation specifications
One would expect that for most collations, one would have identical specifications for all witnesses. In programming terms: one would set the specifications for the class of witnesses, which would then inherit a uniform set of specifications. This design permits that the uniform specification would be overruled for specific witnesses.
The collateable object arrays
The key to Collate0-2 was that it did not collate text strings: it collated word objects. For each witness, it held the words of the text in an array of word objects, numbered from 0 to xxx, and all collation took place against these word objects, with information about variants found stored in tables of numbers referring to these arrays. I propose that CollateXML retain, refine and extend this model.
Collate0-2 accepted ‘plain text’ and converted this to word object arrays as it collated. As it did so, it might remove (depending on various settings) punctuation or other characters from the text to be collated. Thus ‘april / that’ would become:
word 1: April
word 2: that
Notice that the ‘/’ is here removed. At a later point, Collate0-2 converted the text to
<w n=”1″>April</w> / <w n=”2″>that</w>
This is rather unsatisfactory. The relationship between the numbering of the words in the word object array and that in the converted XML depends on rather fragile assumptions about what is and is not a word. I propose instead that CollateXML recommend that for word-by-word collation, input must be in full XML form, with all discreet elements marked as follows:
<w n=”1″>April</w> <w n=”2″>/</w> <w n=”3″>that</w>
This has several implications. It means that, because of the problem of overlapping hierarchies, treatment of elements spanning across words has to be as follows:
<w n=”1″><hi>April</hi></w> <w n=”2″><hi>/</hi></w> <w n=”3″><hi>th</hi>at</w>
<hi><w n=”1″>April</w> <w n=”2″>/</w> <hi><w n=”3″>th</hi>at</w>
The advantage of the explicit labelling of every collateable object in the original text as a <w> element with an ‘n’ attribute is that it makes linking of the collation with the original text absolutely explicit. The ‘n’ attribute on each <w element can be used to denote each word in the collateable object array, and then used to link to the corresponding <w element in the original. (One might — might — use xPath to achieve the same result: that is a matter for discussion.)
I said we need TWO collateable object arrays for each witness. The first array, as specified above, is to hold the original text: call this textOriginal. But in fact, this is not the text which will be actually collated. The second array is the text which will be actually collated: call this textCollateable. TextCollateable will have identical structure and initially identical content to textOriginal.
The reason for the two arrays is to make regularization possible. Regularization was one of the great strengths of Collate0-2, and the approach here suggested is based closely on how Collate0-2 worked. As the scholar collates, he or she will see cases where it is necessary to filter out spelling or other non-significant variation. This may involve alteration of word division. Thus, we might be collating:
base: the man Cat
wit1: theman cat
It appears that in wit1 one will want to change the word division for ‘theman’ and regularize ‘cat’ to ‘Cat’. Thus, textOriginal would hold for wit1:
while textCollateable must be altered to:
Notice that this will mean keeping an offset pointer at each word, indicating for each array what is the corresponding word in the other array.
Putting this together, we require the following information for each word object in each collateable object array:
- the word itself (including, XML encoding)
- the n number for the word, to relate to the n number on the corresponding <w> element in the original
- the offset to the corresponding word in the other array. Thus: for word 1 in textCollatable the offset would be 0; for word 2 and word 3 it would be -1. For word 1 in textOriginal the offset would be 0; for word 2 it would be +1.