The last decade has observed incredible advances in cancer research and medical care, in large part due to the rapid application of genome sequencing technologies, which themselves have undergone significant improvements in efficiency. Ever larger sets of cancer genome sequencing data are being generated and used for medical research, clinical trials, and increasingly, clinical care.

However, almost all of these data are siloed in individual research institutions, with few options to enable sharing or pooling of datasets even when researchers and clinicians are willing to work together, and especially across international borders. This approach has resulted in a reality where, despite the advances achieved in cancer genome sequencing, information now languishes in unconnected silos and has begun to stall the pace of progress of precision cancer care. Ultimately, maintaining a robust rate of improvements in care can only be achieved by sustained data sharing.

The Clinical Cancer Genome Task Team wrote a detailed perspective paper on the global need for somatic cancer mutation data sharing. Somatic mutations occur only in the tumor and differ from an individual’s regular genome, which was inherited from his or her parents and therefore do not compromise the patient’s identity if shared. Research datasets generated by projects such as the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), and national-level data repositories such as the NCI’s Genomic Data Commons and Genomics England, form a solid core of research-generated data. The rapid uptake of gene panel sequencing in routine clinical treatment provides an expanding new source of data. Together these make a compelling case that this is the moment in time for a global somatic cancer data sharing project.

Developed countries have begun cancer sequencing projects, and pool at least some of their data as part of the ICGC. These projects recruit and consent patients, sequence their tumors and follow their treatment and medical interventions to varying extents with dedicated staff and funding. Sequencing is done in a research setting and stored in special access-controlled data centers to comply with data privacy regulations.


At the same time, outside of the research enterprise, cancer genes are increasingly being sequenced for clinical reasons, e.g. to determine treatment strategies. This sequencing is most often done by clinical testing laboratories; however, even when large regions of the genome are sequenced as a part of clinical testing, only actionable mutations, i.e., mutations that can be used for immediate disease diagnosis and/or treatment, are reported back to the ordering clinician. For many patient tests, most sequencing results are not analyzed or used, and even the data that are reported back to the hospital rarely make their way back to research. Instead, research must depend on explicit (and very expensive) secondary data gathering efforts such as the projects mentioned above.

If instead of these fractured efforts, all tumor mutations identified by clinical sequencing flowed back to research, then a virtuous cycle would emerge: data from clinical activity would fuel research discoveries, which in turn would lead to more effective clinical care. The Cancer Gene Trust provides an international solution to enable rapid genome-wide somatic cancer data sharing based on a nimble infrastructure to allow local sites to maintain best practices for ethics processes and patient consent. It democratizes data analysis, allowing more experts to participate and compare results, and accelerate the translation of genomic findings towards a clinically useful timescale.


The Cancer Gene Trust (CGT) is a global network for rapidly storing and sharing somatic cancer data and associated clinical information. It is designed to enable discovery of an unprecedented volume of somatic cancer data by providing open access to a public subset of the data. It enables application builders to focus on using and interpreting the data instead of resolving disparate access methods from multiple sources, or failing entirely because data are simply not available in any format. Such uniform public discovery and access are unprecedented for clinical cancer data sets. Indeed, the CGT facilitates research studies and clinical care on a timescale not previously possible, while allowing data holders to maintain the privacy and security of individual data sources and the non-public subset of the data, and respecting individual patient consents and cultural data sharing preferences and expectations.

Our approach leverages open source technology to create a lightweight, global off-blockchain decentralized network controlled by “stewards” that make limited somatic mutation and related clinical data about a patient publicly available. A steward can be a hospital, a collection of hospitals, a national database, or any organization that manages health data of patients. The public data include DNA mutations, but are restricted to those mutations that occur only in the tumor (i.e., somatic mutations). We do not make germline DNA information public - it is always stored in a trusted private repository consistent with patient consent. The public data may also include other molecular tests of gene expression levels, protein activation, imaging, and general clinical data such as age of disease onset, cancer type, year of diagnosis, and as much treatment/drug information as the steward is able to share. While the genetic data may not reach the quality and reliability of the centralized research cohorts and data curation infrastructure built for government research projects with dedicated funds, the CGT enables aggregation of data from an order of magnitude more cases, and ultimately may have better clinical data. Data curation and harmonization systems built on top of the CGT provide gradations of data quality and reliability, incorporating user feedback.

In our model, no identifiable information is shared publicly. A steward is responsible to remove names, addresses, and other personal health information. If a steward wants to enable access requests or re-contact, a link to a patient’s identity can be maintained by including a random number in the submission whose association with the patient is known only to the local steward. The protected patient data, including the raw DNA and RNA sequencing reads, are held only by the steward.

alt text

Through this effort, every cancer patient’s tumor genetic tests can potentially be added to the CGT allowing for massive gains in research. On-ramps for data donation can be established at a number of levels, from local community and advocate-driven efforts to national programs. For example, patient data can be entered from local hospitals into the public CGT via an extension of current programs conducted by cancer registries. In the United States, these are programs accredited by the National Cancer Registry Association, and include the NCI Surveillance, Epidemiology, and End Results (SEER) program. The International Association of Cancer Registries (IACR) lists similar institutions in almost every country.

Worldwide there are 14 million new cancer cases each year, and many will be tested multiple times during the course of the disease. With uptake for even a small percentage of these cases, the CGT can expand rapidly and enable a high impact. Additionally, given the CGT’s ability to scale to accommodate millions of cases, within a relatively short time the CGT could be one of the largest sources of somatic cancer data.


An individual submission to the CGT may consist of somatic mutation, gene expression, imaging and related clinical data about an individual patient. The steward is responsible to ensure these data do not include identifiable information. Sharing data is not free; being a steward requires some work. Collecting this information in a hospital setting requires some time and effort. To minimize the cost and reduce the manual work to a minimum, we have developed prototype software for staff from two separate departments:

1) The clinic or clinical lab. This is the department from which the tissue sample is sent to the testing company, and that receives the results. We provide software so that the clinical lab’s technicians can extract from the test results only the mutations that occur specifically in the tumor, as well as gene expression levels, protein activation data and other clinical test results when available. The software extracts and submits the limited data to the CGT automatically with just a few clicks.

2) The cancer registrar. In most larger hospitals, a registrar collects general clinical data about patients, summarizes them and reports them to state or federal authorities to aid with epidemiological and related questions, e.g. to identify geographical regions with higher cancer incidence. Our second software module will automatically summarize and share the limited clinical cancer data that can be made public with the CGT. Again, to facilitate implementation by global stewards with variable resources, this should not require more than a few mouse clicks.

The public submission is identified by a code known as a “hash” to enable users and submitters to address and verify the submission uniquely and consistently anywhere in the world where it is stored. As an added benefit, a hash can easily be used as a reference to a submission that can be added into an existing electronic health record or cancer registry submission as a simple text field, or on a blockchain.

Link To Care

Because a trusted steward at a medical center will collect and maintain personal information about the patient as well as clinical information, the steward provides an essential secure link between researchers and patients. For example, this allows researchers to contact a patient with a specific rare variant, identified in the public CGT, through a trusted steward to invite them to be enrolled in a clinical trial. This method also facilitates obtaining further information about the patient’s genetic makeup or clinical history beyond what is available in the public CGT, consistent with patient’s consent.

The public CGT is globally networked, increasing the number of test records by an order of magnitude beyond what could be obtained by any single country. Such a resource enables a profound cultural shift in the way we use data to improve understanding and treatment of cancer. When it becomes easy to be a steward who shares data, those who do not will no longer have a technical excuse. Hypothetical examples of the translational use cases, adapted from the GA4GH’s cancer data sharing perspective paper, include:


To understand the informatics structure of the CGT, recall the early Internet. Instead of a top down deterministic network structure with centrally curated content like AOL, the TCP/IP protocol of the early Internet defined a decentralized way to transport and deliver packets of information. Networks peered with each other organically and on this substrate a thousand applications grew.

Below is a visual diagram of a demonstration network of nodes with test data showing peering relationships for data replication. Note some of the names are surrogates for illustration purposes only and running on test servers.

alt text

Each computational/data node in the CGT network stores submissions and communicates with all the other nodes via IPFS. If you ask any node for a specific submission (by specifying its hash code) it sees if it has it, and if not asks any other nodes that it talks to directly. Each node also adds its address and latest top level hash to the blockchain by sending a transaction to a contract using a distributed application. By querying this contract a node can find all other nodes as well as their complete set of submissions. This also guarantees the provenance of the data.

The CGT network model is designed to be robust in the sense that multiple nodes can disappear, but if their data (including older versions of the data) has been mirrored through direct or indirect peers as planned, it will still be accessible. This is extremely important for reproducibility of science done with the data, as well as making the system live on its own.


How do we defeat a disease that appears in millions of different guises? The same way we tackled the problem of sequencing millions of DNA bases: We work together. Together we can uncover cancer in all its different molecular forms, no longer letting the precious genetic and clinical information from countless personal battles slip through our fingers, never to inform the battles of others to come. Cancer patients desperately want their struggles to mean something, to make a difference. Through Cancer Gene Trust, patients, clinicians, researchers, and advocates can unite worldwide to uncover the molecular characterization of cancer at an enormous scale, and make the complete data readily available for research and clinical care.