How to Securely Collect Data for Personalized Medicine

In my last blog, I discussed the rationale for applying privacy and security best practices to enable the benefits of personalized medicine while minimizing risks of breaches and other types of security incidents. One of these best practices involves walking through each step of the information lifecycle, from collection, to use, retention, disclosure and disposal. In this blog I take a look at the collection stage of the information lifecycle.

Collecting information for personalized medicine requires informed patient consent. Patients must be informed about the benefits, risks, who will have access to their data, how their data will be processed, and choices they have regarding their personal healthcare information. This includes both physical samples collected, such as saliva and blood samples, as well as the raw genome sequence data.

Research is needed to further the science of analyzing and deriving meaning from genetic information, and this research needs genetic data. Patients are typically presented with a choice of whether to participate in this type of research, and whether they want to authorize sharing of their genetic data, most often in de-identified form, with such researchers. Choices presented to the patient are typically either opt-in or opt out.

Opt in is where the patients data by default will not be shared with researchers unless they explicitly opt into sharing their data. Alternatively the patient may be presented with an opt-out choice where the default is for their data to be shared with researchers unless they explicitly opt out. These basic “all or nothing” opt-in / opt-out types of choices are often overly simplistic and don’t give the patient much control over their data. More sophisticated consent and choice mechanisms are required in the future for the patient to have greater control over who should have access to their data, for what purposes, how they can get access and participate, and so forth.

Some types of genetic research require more than fully de-identified data. An example is phenotype research which requires information about the patients environment, for example their zip code. This is location information about the patient and therefore Personally Identifiable Information (PII) which, when associated with healthcare information such as genetic information can cause the combination to be classified as Protected Health Information (PHI) under HIPAA and subject to legal and regulatory requirements, for example breach notification in the event of loss or theft. 

For this reason, tokenization is often used in the collection of genetic information for patients. Right from the time physical saliva or blood samples are taken they are often bar-coded to associate them with the patient, in contrast to labeling the samples with elements of the patient PII such as names, date of birth and so forth. Tokenization may also be used later to enable authorized access to limited PII, in addition to de-identified genetic data, in order to support more sophisticated research such as phenotype research.

Encryption may be used to protect the confidentiality of collected sensitive data at rest and in transit, including elements of PII stored in secure databases. Genetic data can take the form of very large data sets. For example a single raw genome sequence data can be several hundred GB or larger in size. Encrypting a volume of data such as this, while maintaining performance, requires hardware acceleration, such as Intel® AES-NI (Advanced Encryption Standard – New Instructions).

What types of privacy and security challenges and solutions do you see with the collection of data for personalized medicine?