allow declaring a record canonical, to ensure that there are no explicit derivations
Derive all derivable fields solely from provided canonical fields.
Infer any empty canonical fields, if possible, from provided derived fields.
An attempt to represent names as a set of canonical atomic fields.
An attempt to represent names as a set of canonical atomic fields. Being really comprehensive and accurate about this is not possible due to too many cultural variations and ambiguities. Still this should cover most of the cases we care about re authorship of journal articles.
A name is not a fixed thing; it is a probabilistic cloud of strings, all denoting the same person. Here we don't cover the case that a person changes names completely; in that case there are two disjoint clouds of strings, so that should be modeled by allowing a Person to have multiple PersonNames.
Here we try to model different representations of "the same name". Variations may include: omitting some components; using initials for some components; reordering; etc. The most "different" case to model is that of married names vs. maiden names. Since one or both of these may appear, but the other name components are not affected, we consider this a case of multiple surnames within one name.
Subclasses propagate name fragments around the various representations, in an attempt to provide some reasonable value for each field.
Here we want to take multiple name variants as input and coordinate them into a single record. For instance, if we assert that Amanda Jones and A. Jones-Archer are the same person, then we should later recognize Amanda Archer as a valid variant.
A name format specification, for use both in formatting outputs and for forming expectations when parsing inputs.
A name format specification, for use both in formatting outputs and for forming expectations when parsing inputs.
Shamelessly yoinked from edu.umass.cs.iesl.scalacommons
This could be a crf...
This could be a crf...
Shamelessly yoinked from edu.umass.cs.iesl.scalacommons
Shamelessly yoinked from edu.umass.cs.iesl.scalacommons
Infer any empty canonical fields, if possible, from provided derived fields.
The approach is to generate full names from the derived fields, and then parse those full names back to canonical fields. On the one hand, that risks losing information. On the other hand, this is generally used upstream of a merge where the "correct" derived fields will take priority anyway. Also, this may help clean up mistagged data.
In cases of single-value fields, nonempty explicit data overrides implicit data resulting from full-name parsing. Thus, e.g., an explicit Mr. overrides an implicit Dr. Should there be precedence rules?
Set-valued fields are just merged.
Note relationship with PersonName.merge(). Really we want to a) derive canonical fields only from derived fields; b) merge those with existing canonical fields.