[Personen-Identifikatoren in medizinischen Forschungsnetzen: Evaluation des Personen-Identifikator-Generators im Kompetenznetz Pädiatrische Onkologie und Hämatologie]
Jutta Glock 1Ralf Herold 2
Klaus Pommerening 1
1 Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Johannes Gutenberg University of Mainz, Mainz, Germany
2 Coordination and Management Group, Competence Network Paediatric Oncology and Haematology, Charité - Universitätsmedizin Berlin, Campus Virchow-Klinikum, Berlin, Germany
Zusammenfassung
Die Gesellschaft für Pädiatrische Onkologie und Hämatologie (GPOH) und das Kompetenznetz Pädiatrische Onkologie und Hämatologie führen zahlreiche klinische Studien durch. Deren übergreifende Auswertung erfordert eine zuverlässige Identifizierung der rekrutierten Patienten. Hierzu wird ein Personen-Identifikator-Generator (PID-Generator) eingesetzt, der den Teilnehmern dieser Studien eindeutige, pseudonyme und unumkehrbare PIDs zuordnet.
Das im PID-Generator implementierte Matchverfahren wurde unter Verwendung der GPOH-Konfiguration getestet. Zur Überprüfung der korrekten Verarbeitung von PID-Anfragen wurden fiktive Daten verwendet (Funktionstests), während Testdatensätze zur Beurteilung der Matchergebnisse eingesetzt wurden. Außerdem wurden über 44.000 Datensätzen des Deutschen Kinderkrebsregisters (DKKR) mit PIDs versehen und die dazugehörigen Patientenliste ausgewertet, welche die PIDs, teilweise verschlüsselte Datenfelder und Informationen über die PID-Generierung für jeden Datensatz enthält.
Alle Funktionstests lieferten die erwarteten Resultate. Weder die 14.915 Testdatensätze noch die DKKR-Daten erzeugten Homonymfehler. In den Testdatensätze traten sechs Synonyme aufgrund fehlerhafter Geburtsdaten auf. Bei der Verarbeitung der DKKR-Daten traten 22 Synonyme beim Vergleich mit der aus 2579 Datensätzen bestehenden Patientenliste auf. Von den resultierenden 45.693 Einträgen in der Patientenliste wurden ungefähr 7% zweimal und weniger als 1% häufiger eingegeben.
Die Synonymfehlerrate ist maßgeblich von der Qualität der Eingabedaten sowie von der Häufigkeit von Mehrfacheingaben abhängig. Abhängig von den Ansprüchen an die Minimierung der Homonym- und Synonymfehlerraten können daher zusätzliche datenqualitätssichernde Maßnahmen erforderlich sein. Die Ergebnisse zeigen, dass der PID-Generator ein geeignetes Mittel zur zuverlässigen Identifizierung von Studienteilnehmern innerhalb medizinischer Forschungsnetze ist.
Schlüsselwörter
Patienten-Identifikator, Medical Record Linkage, Evaluation
Introduction
Pseudonymous but unambiguous identification of patients is important in medical research networks, especially for participants in clinical trials carried out by different network members. Therefore, the Society for Paediatric Oncology and Haematology (GPOH) and the Competence Network Paediatric Oncology and Haematology initiated and use a personal identifier (PID) generator, which can be used to maintain a comprehensive patient list, match personal data and create PIDs. A PID is composed of 8 alphanumeric characters, whereas the four letters 'B', 'I', 'O', and 'S' are omitted to avoid confusion with the digits '8', '1', '0', and '5'. The patient list consists of different database tables that contain information about each PID request such as partly encrypted input data items, the matching outcome and of course the PID itself. Figure 1 [Fig. 1] shows the PID request procedure. The PID generator implements a deterministic record linkage method ("matching") that can be adapted by a number of configuration options. Detailed information is available on the Internet from [1]. The development of the PID generator was commissioned to the German Telematikplattform für Medizinische Forschungsnetze (TMF) [2] and it has become part of the generic privacy concept of the TMF, which describes how to handle patient data within medical research networks. Since 2003 it has been used for new patients in Paediatric Oncology and Haematology. Furthermore, another three Competence Networks in Medicine are currently using or testing the PID generator, namely the Kompetenznetz Rheuma (Systemic Inflammatory Rheumatic Diseases), Angeborene Herzfehler (Congenital Heart Defects) and Herzinsuffizienz (Heart failure). In the study reported here, we examined the functionality of the PID generator with the configuration used by the GPOH (GPOH configuration). We also evaluated the quality of the overall matching procedure and described how the PID generator performs when supplied with data records of the German Childhood Cancer Registry (GCCR), in which more than 44,000 records have been assigned PIDs. Finally, we reviewed the current patient list in Paediatric Oncology and Haematology and assessed the frequency of multiple PID requests for the same person.
Figure 1: PID request procedure

Material and methods
Configuration of the PID generator
Data fields
An important factor in matching personal data is the choice of data fields for comparison. On the one hand, it is essential to choose fields that allow unambiguous identification of an object; on the other hand, the information must be available for (nearly) all persons and should be very reliable [3]. Fields like last name, first name, date of birth, and place of birth or residence are often used for this purpose, as this kind of information is usually readily available and combinations of these fields are usually unique [3], [4], [5], [6]. In the configuration of the PID generator, each field can be declared mandatory or optional, depending on its role in the matching procedure. A predefined data type (e.g. TEXT, SEX, NAME) is assigned to each field to allow adequate data processing such as normalization, decomposition of names, generation of phonetic codes, etc. and plausibility checks [7], [8], [9]. In addition, certain fields, such as last name and alternative name, can be defined as exchangeable, if appropriate.
The fields of the GPOH configuration are shown in Table 1 [Tab. 1]. Two configuration files, with and without encryption, were used to compare the outcomes of the matching procedure. One of the files includes specifications for encryption with the AES (Advanced Encryption Standard) [10] algorithm and for generation of hash codes with MD5 (Message Digest Algorithm 5) [11] for the fields lname (last name), aname (alternative name), fname (first name), and bd (day of birth).
Table 1: Fields defined in the configuration of the Society for Paediatric Oncology and Haematology 

Matching strategy
Apart from the data fields, defining database search filters and results can configure the matching strategy itself. For each PID request, the database is searched for a matching data record. The user marks the input record as 'sure' or 'unsure'. 'Sure' indicates that mistakes, including common spelling mistakes of names, are ruled out, whereas data is marked as 'unsure' if they are collected in an error-prone way, such as from handwritten forms. The search for a match usually begins with a more stringent filter. We suggest one that allows through only data marked 'sure' and agreeing exactly with the input data in all data fields, including the optional ones. Once a filter is processed, the search is either stopped, yielding a certain result, or it is continued with a less stringent filter until all the filters have been used sequentially. Once a result is attained, the PID of the match is sent, a new PID is generated or a warning is sent to the user.
The matching strategy of the GPOH is described in Figure 2 [Fig. 2]. Table 2 [Tab. 2] and Table 3 [Tab. 3] list the possible results.
Table 2: Possible match results in case of an unambiguous match

Table 3: Possible match results in case of multiple matches

Figure 2: GPOH definition of database filters

Functionality tests
Test data
For the functionality tests, 67 false data sets were created, so that each result defined in the configuration (Table 2 [Tab. 2] and Table 3 [Tab. 3]) could be attained at least once if the PID generator worked correctly and corresponded to the configuration file. This also implies that the expected results had been defined before the tests were run. The data records generated were designed in such a way that different scenarios (crosswise agreement in last name and alternative name, crosswise agreement in the components of names, phonetic similarity, incorrect and missing optional data) were considered.
Transaction
The PID requests were sent to the PID generator in batch mode, and the results were logged and compared with those expected.
Evaluation of the overall matching procedure
Test data
The test data records were derived from the reference database of Mainz University Hospital. A total of 14,915 data records were made available, all marked as 'unsure'. Apart from the required and optional data fields, information was also provided on place of birth, profession and health insurance. This information is useful for differentiating between homonyms and correct matches.
Error rates
To assess the quality of the matching procedure, the homonym and synonym error rates were estimated. The homonym error rate indicates how often the data records of different persons are wrongly matched, i.e. are assigned the same PID. This rate can be estimated from test data if the data is representative of a real application. 14,915 test cases were sent to the PID generator, and the number of matches was established. If the input data was free of duplicates, the number represented the number of homonym errors. If not, each match had to be checked manually to distinguish between true matches and homonym errors.
The quality of the input data has little effect on the result, as erroneous data is unlikely to lead to a homonym. The choice of data fields for matching is, however, relevant. Furthermore, the probability of true homonyms, i.e. of different persons actually having the same values for all chosen fields, increases with the amount of data [12]. This is because coincidental agreement in personal data is more likely to happen in a large group of people than in a smaller one. Ideally, the homonym error rate is equal to the rate of true homonyms. This can in turn be minimized by an appropriate combination of fields, i.e. one that is well suited to differentiate individuals.
The synonym error rate shows how often data records for the same person do not match, i.e. a person has been assigned two or more PIDs. It is more difficult to draw general conclusions about synonyms because their frequency depends mainly on the quality (correctness and completeness) of the input data. In the GPOH configuration, for instance, a single error in a birth date can prevent a match and result in a synonym. Precise calculation of the synonym error rate would thus require knowledge of the error rate for each data field and for each possible data source. As all this information is usually not available, a semiautomatic approach was used to assess the number of duplicates within the test data set. This allows calculation of the synonym error rate, but only for this specific data source.
Organizational aspects can also influence the synonym error rate, as only multiple data entry can result in synonyms [12], [13]. As changes in data, such as in last name or place of residence, can also result in synonyms, the combination of fields used for the matching procedure is also important in this respect.
Transaction
We assumed that the test set of 14,915 data records was largely free of duplicates. To assess the duplicates in the test data, they were sent to the PID generator in batch mode and processed sequentially. For each match produced, we manually determined whether a correct match had been found because of a duplicate or whether a homonym had been created. In addition, we logged the result for each request to trace which filter had produced the outcome. The tests were conducted with partly encrypted and unencrypted data fields, the encrypted fields being first name, last name and alternative name, their corresponding components and phonetic codes, and day of birth.
Semiautomatic estimation of duplicate rate
To estimate the synonym error rate, it is necessary to identify the duplicates in the input data. Therefore, queries were submitted to the patient list containing the test data to seek agreements and similarities in the name fields; the other fields were not taken into account. One query, for example, yielded all pairs with the same first components of last names and first names. Subsequently, the search was expanded, taking into account crosswise agreement in the first two components, exchangeability of first name and alternative name, and phonetic agreement. Duplicates that differ in the name and even by phonetic code cannot, however, be identified by this method.
Generation of personal identifiers for records from the German Childhood Cancer Registry
In order to assign PIDs to the GCCR data, all data records with valid values in the required data fields were extracted from the Registry, resulting in a dataset of 44,248 records, 572 of which already had a PID. This data was submitted to the GPOH PID generator in batch mode. At this time (3 June 2005), the patient list contained 2579 entries. The data fields lname, aname and fname, their components and phonetic codes and bd were encrypted according to the GPOH configuration.
Results
Functionality
The results of all the functionality tests corresponded to those expected. Processing of PID requests is thus assumed to work correctly and to correspond to the GPOH configuration.
Overall matching procedure
For 14,915 real data records, 14,913 PIDs and two matches were produced. The matches showed exact agreement in the compulsory data fields, with and without considering the optional fields. Manual evaluation showed that both matches were true, i.e. that no homonyms had been created.
The search for records with the same first components of last and first names yielded 208 pairs, which were checked manually for duplicates. This evaluation resulted in eight putative duplicates, two corresponding to the matches. The same result was obtained when the search was expanded to pairs showing phonetic or crosswise agreement in the name fields, including the alternative name. We thus conclude that a total of six synonyms were created. Further investigation showed that all were due to erroneous reporting of date of birth, implying that the results are in fact based on data for 14,907 persons.
Encryption of the fields lname, aname and fname, their components and phonetic codes, and bd had no influence on the matching outcome.
Generation of personal identifiers for records from the German Childhood Cancer Registry
In the comparison of the 44,248 data records from the GCCR with the 2579 entries previously in the patient list, one duplicate was identified, accounting for one of the 1571 matches created. Again, no homonyms were created in the input data, but 22 synonyms occurred. In addition, weak matches were found in 53 cases (result NEG_TNT), i.e. the matches were based on phonetic similarity but the optional data fields either disagreed or were not present. In 10 of these cases, the weak match was confirmed by the finding of PIDs that had already been assigned to the GCCR data.
Status of the Society for Paediatric Oncology and Haematology patient list
On 20 July 2005, the GPOH patient list contained 45,693 PIDs. The results logged since 3 June 2005 (starting with PID generation for the GCCR data) are listed in Table 4 [Tab. 4].
Table 4: Results obtained between 3 June 2005 and 20 July 2005

As mentioned above, the number of times a case is submitted influences the synonym error rate; we therefore also assessed the number of multiple submissions for the same person, as shown in Table 5 [Tab. 5]. In about 7% of all cases, a record had been submitted more than once; however, many of the multiple entries had been submitted sequentially, indicating that the requests had been initiated from the same place.
Table 5: Multiple PID requests (20 July 2005)

Discussion
The GPOH PID generator showed a low homonym error rate, as no homonyms occurred in 14,915 test data records. As mentioned above, the probability of a homonym increases with the number of data records stored in the patient list. The number of persons participating in a medical trial is, however, limited, and we conclude that the PID generator is well suited for distinguishing among persons in medical competence networks if configured appropriately.
As explained above, no general conclusion can be drawn about the synonym error rate. The GPOH PID generator gave good results in the test runs (six synonyms in 14,915 records), but this might be due to the small number of duplicates in the input data. The synonym rate was quite high considering only the number of duplicates (six out of eight duplicates). Probably the test data has been purged before and only those synonyms that are hard to detect remained. All the synonyms were caused by errors in date of birth. For this field, there is no error tolerance as provided for the name fields by phonetic codes, for example. Nevertheless, abandoning the date of birth field is not an acceptable solution to this problem; although synonyms due to erroneous dates could be prevented, many homonyms would be created because of frequent similarities in names. The semiautomatic search for putative duplicates showed 208 record pairs with exact agreement in the first components of first and last names, only eight of which were actually synonyms. This means that at least 200 homonyms would be created by the test data if the date of birth field was abandoned, and even more would be created if phonetic similarity and crosswise agreement of name components were considered.
Reduction of homonym errors and reduction of synonym errors are therefore competing goals, and their achievement depends mainly on the data fields chosen for the matching procedure. The synonym error rate also depends strongly on the quality of the input data and on the frequency of multiple submissions [12], [13].
The PID generator also showed good results for PID generation in the GCCR data. No homonyms were created among the 44,248 records. We cannot rule out homonyms with the 2579 prior entries, as the most informative data fields were encrypted for security reasons, making manual checks on the matches impossible. The 527 data records that already had a PID accounted for 22 synonyms, resulting in a synonym error rate of about 4%. As synonyms occur only for cases that have been entered multiple times, this error rate is acceptable if multiple submissions are rare.
The GPOH statistics showed that some records had been entered two or three times, and a few cases had been submitted up to six times, as shown in Table 5 [Tab. 5]. About 7% of all cases had been submitted more than once. When the time stamps of these cases were checked, however, it became obvious that most of the multiple requests were deliberately submitted one after another by the same person, perhaps to become familiar with the tool or merely out of curiosity. In addition, 527 of the double entries were for GCCR data that had already been assigned PIDs.
Up to now, there has been no case of ambiguous results, showing that the fields chosen are well suited for distinguishing between different persons. A few cases could not be assigned a PID or were assigned a PID after a manual check because only a weak match was found (phonetic similarity in names but different optional data fields). Thus, in the vast majority of cases, PID requests are processed successfully and the user receives a satisfactory result, namely a PID.
Different applications might have different requirements with regard to homonym and synonym error rates, and this should be considered in defining a matching strategy and data fields. Before configuring the PID generator, one should therefore become familiar with the processing of a PID request. One should also assess, for example, how many users will access it. If the quality of the input data is appropriate and if a well-designed configuration is supplied, the PID generator is a suitable tool for matching data records and for generating patient identifiers. The German Competence Network on Paediatric Oncology and Haematology uses the PID generator successfully for identifying participants of clinical trials.
Acknowledgements
We would like to thank the Telematikplattform für Medizinische Forschungsnetze for their support in developing the PID generator. This project was sponsored by the Federal Ministry for Education and Research (BMBF) within the Competence Network on Paediatric Oncology and Haematology, Grant Number 01 GI 99 67/ 3.
We also thank Elisabeth Heseltine for reviewing this report and especially for her very helpful assistance with the English.
References
[1] GPOH PID Service [homepage on the Internet]. [updated 2005 Sep 27; cited 2005 Oct 28]. Available from: https://mi.imsd.uni-mainz.de/psx/.[2] Pommerening K, Wagner M. Ein Pseudonymisierungsdienst für medizinische Forschungsnetze. Inf Biom Epidemiol Med Biol. 2001;32:251.
[3] Newcombe HB. Handbook of record linkage methods for health and statistical studies, administration, and business. New York: Oxford University Press; 1988.
[4] Newcombe HB, Fair ME, Lalonde P. The use of names for linking personal records. Am Stat Assoc. 1992;87:1193-204.
[5] Quantin C, Binquet C, Bourquard K, et al. A peculiar aspect of patients' safety: the discriminating power of identifiers for record linkage. Stud Health Technol Inform. 2004;103:400-6.
[6] Ross LL, Wajda A. Record linkage strategies. Part I: Estimating information and evaluating approaches. Meth Inform Med. 1991;30:117-23.
[7] Schmidtmann I, Appelrath HJ, Michaelis J, Thoben W. Empfehlungen an die Bundesländer zur technischen Umsetzung der Verfahrensweisen gemäß Gesetz über Krebsregister (KRG). Inf Biom Epidemiol Med Biol. 1996;27:101-10.
[8] Postel HJ. Die Kölner Phonetik - Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse. IBM-Nachrichten. 1969;19:925-31.
[9] Michael J. Doppelgänger gesucht - Ein Programm für kontextsensitive phonetische Textumwandlung. c't. 1999;25:252-61.
[10] Daemen J, Rijmen V. The design of Rijndael. Springer; 2002.
[11] Rivest R. The MD5 message-digest algorithm. RFC 1321, MIT LCS & RSA Data Security Inc.; 1992.
[12] Brenner H, Schmidtmann I. Determinants of homonym and synonym rates of record linkage in disease registration. Meth Inform Med. 1996;35:19-24.
[13] Brenner H, Schmidtmann I. Effects of record linkage errors on disease registration. Meth Inform Med. 1998;37:69-74.
 
                                                        


