Anonymising and sharing patient data

Patient data is extremely valuable for biomedical and healthcare research. Collecting and sharing patient data globally can lead to several benefits such as better understanding diseases, identifying patterns in public health and disease, developing and monotoring drugs and treatments, allowing researchers to build on the work of others efficiently and finding suitable candidates to take part in clinical trials. However, concerns about privacy have been a barrier for making patient data available. Data custodians are able to legally share patient data for research via (a) consent and (b) anonymization. It is difficult and time-consuming to rely on consent as the primary method for sharing data as it is not practical to obtain consent from several patients and there is evidence of consent bias (Kho et al., 2009). Ethics boards allow the sharing of patient data without consent for research purposes if it is anonymised (Willison et al., 2008). However, there is an expectation that this anonymization is adequate and used for legitimate purposes. Studies show that, when concerned about how their health data is used, patients adopt defensive privacy behaviours, such as giving inaccurate information and not seeking care (Malin et al., 2013). It is therefore crucial to increase public confidence by creating clear governance frameworks for accessing patient data and developing methods to safeguard patient identification.

The concept of anonymous or non-identifiable data can be ambiguous. In an effort to clarify inconsistencies, El Eman et al. (2015) describe in their article the key principles for anonymising health data while ensuring it remains suitable for relevant analysis. The group explains that ensuring anonymity technically means ensuring that there is a very small probability of assigning a correct identity to a record in a dataset. Existing guidelines divide the variables in a dataset into two groups: direct identifiers and quasi-identifiers. Direct identifiers allow direct recognition or communication with the corresponding individuals (e.g. names, addresses and social insurance numbers) while quasi-identifiers can indirectly identify individuals (e.g. date of birth, postal code and ethnicity). Both groups must be addressed during anonymization. The acceptable probability of re-identification of a record varies accordingly to how the data is being shared. For a public data release the probability needs to be low because there are no other controls in place. For non-public data, a higher probability is acceptable because other security and contractual controls would be already in place. If the probability of re-identification is high, perturbation techniques can be applied to reduce it. One of the simplest and quite often used ways to perturb data is to reduce its precision through generalisation, for example, generalizing a date of birth into a month and year of birth. Better and more complex computational methods of perturbation have been reviewed by Gkoulalas-Divanis et al. (2014), these methods can reduce the amount of distortion and produce higher data quality. However, knowing when to stop perturbing the data is important to balance privacy protection and data utility and to avoid that inadequate anonymization techniques slow down research, something that already happens by introducing other disproportionate measures on data protection. Anonymization is usually time limited to account for advances in technology and for availability of other data that can be used to re-identification.

Rare diseases

When it comes to rare diseases the anonymising and sharing of patient data becomes even more important. The rarity of diseases makes it difficult to gather information, to develop treatments and to conduct large clinical trials. Data sharing has such value in these cases that it is almost a necessity for the progress of rare disease research. Presence of a rare disease does not necessarily make data impossible to anonymise. If the dataset is a sample from a population of patients with that disease the probability of re-identification may still be very small (Eguale et al., 2005).

The VHL Alliance, in collaboration with the NORD and partially funded by the Myrovlytis Trust, has developed the CGIP Databank to collect detailed medical information on patients all over the world with BHD, VHL, HLRCC, SDH and other related tumours to help scientists and clinicians to discover possible factors contributing to disease progression, to help evaluate efficacy of novel therapies and to enable a more complete advice to patients. Therapies for these diseases are emerging and there is a need to go through clinical trials to evaluate their effectiveness. The databank may also be used to contact patients about trials, including those investigating new treatments, for which they may be eligible. Participation in a clinical trial will be based upon the voluntary consent of the patient.

The CGIP Databank is maintained on a secure server, and only authorised researchers and clinicians will be able to access to an anonymised dataset.

If you have BHD, VHL, HLRCC or SDHB and would like to find out more information or if you want to join the databank and help advance research please click here. If you have any questions or thoughts, please contact the VHL alliance on [email protected].

Eguale T, Bartlett G & Tamblyn R (2005). Rare visible disorders/ diseases as individually identifiable health information. AMIA Annu Symp Proc PMID: 16779234
El Eman K, Rodgers S & Malin BA (2015). Anonymising and sharing individual data. BMJ PMID: 25794882
Gkoulalas-Divanis A, Loukines G & Sun J (2014). Publishing data from electronic health records while preserving privacy: a survey of algorithms. J Biomed Inform. PMID: 24936746
Kho ME, Duffett M, Willison DJ, Cook DJ & Brouwers MC (2009). Written informed consent and selection bias in observational studies using medical records: systematic review. BMJ PMID: 19282440
Malin BA, El Emam K & O’Keefe CM (2013). Biomedical data privacy: problems, perspectives, and recent advances. J Am Med Inform Assoc PMID: 23221359
Willison DJ, Emerson C, Szala-Meneok KV, Gibson E, Schwartz L, Weisbaum KM Fournier F, Brazil K & Coughlin MD (2008). Access to medical records for research purposes: varying perceptions across research ethics boards. J Med Ethics PMID: 18375687