Skip to main content

Abstract

Using the time-dependent dynamics of gene expression from immune cells in blood, we aimed to explore single gene expression trajectories as biomarkers for death after a diagnosis of breast cancer introducing a new statistical method denoted Difference in Time Development Statistics (DTDS). This shows as proof of principle that the gene expression profiles from immune cells in blood differed in the postdiagnostic period are dependent on later vital status.
The gene expression analyses of 394 breast cancer cases and age-matched controls were obtained from the Norwegian Women and Cancer (NOWAC) postgenome biobank (N = 50 000) performed in blood taken 0–8 years after a breast cancer diagnosis. The tube contained a protective buffer that preserved the mRNA in the blood. Cancer diagnosis and cause of death were based on linkage with the Norwegian Cancer Registry. The new statistical method was designed to test the difference in the time development between two strata using a non-parametric representation of the time development of the gene expression and used the area between the curves, i.e. the integral between the cures, as test statistics.
The time-dependent curves or trajectories exerted clearly non-linear changes with rapid transient mostly increasing fold changes, in cases who later died. Survivors had no changes. For cases who died this transient increase was followed by a regression towards the gene expression profiles of survivors. For 86 genes, the integrated area from 18 months to 8 years post diagnosis was highly significant (p<0.00001) among women who died. There were indications of stronger relationship in metastatic cases alone.

Introduction

In 2017, the number of cancer deaths in Norway exceeded that of cardiovascular deaths for the first time (Norwegian Institute of Public Health, Norway, 2018). While the number of cancer deaths has remained fairly stable over recent years, the number of cardiovascular deaths has decreased rapidly. This points to the urgent need for further improvements in cancer treatment for an ageing population. For women in Norway, breast cancer is the most common invasive cancer, constituting 23% of all cancers diagnosed among women in 2017 (Cancer Registry of Norway, 2018). Although significantly improved, the majority of breast cancer deaths are due to metastasis, not the tumor. One hundred years ago the survival for women with metastatic cancer was only 5% after five years, while today the ten-year survival rate of metastatic breast cancer is 85% (Reddy et al., 2018). In order to further improve cancer diagnosis, personalized treatment is moving forward (Jeibouei et al., 2019). Individualized treatment should be based on predictors for individual outcome. The potential of immune response has become evident through the recent use of immune therapy (Stroncek et al., 2017). Biomarkers in blood or liquid biopsies could be functional genomics i.e. transcriptomics or methylation, or metabolites or proteins.
We proposed the compilation of time trajectories of gene expression in blood from many independent case-control pairs as a potential liquid biopsy in order to study the impact of the immune system on carcinogenesis (Lund et al., 2016). A gene’s trajectory corresponds to a curve that represents the changes in gene expression as a function of time, consisting of differences of gene expression between many case-control pairs. Healthy controls establish the level of expression for genes not involved in carcinogenesis, and is assumed to be constant over time. Genes related to the immune system and/or carcinogenesis (expressed in cases) may change over time. Lack of a priori knowledge of the shape of the trajectories demands an agnostic approach (Spitz & Bondy, 2010) including adjustment for multiple testing (Reiner, Yekutieli, & Benjamini, 2003). Gene expression is analyzed as a potential biomarker of carcinogenesis/metastasis, and the statistical quantity of interest is the distribution of the gene expression as a function of time after diagnosis.
In a recent study of gene expression profiles in the years after diagnosis stratified on clinical stages significant differences in the overall gene expression profiles were found (Lund submitted PLOS).
The aim of this study is to explore single gene expression trajectories from immune cells in blood over the first years after diagnosis as predictors of later vital status, dead or alive. In order to use the cumulated evidence over time for clinical follow-up a new statistical method, denoted Difference in Time Development Statistics, was developed; see below.

Methods

This new statistical method, denoted Difference in Time Development Statistics (DTDS), is designed to test differences in time development in a non-parametric manner of two variables or the same variable for two different strata. In this paper, the method is used in order to identify genes where the gene expressions in blood samples have a different time development after diagnosis of breast cancer. The dataset consists of case-control pairs in which the case is diagnosed with the disease and the control is healthy. The data is the difference in log2 gene expression in blood samples between the case and the control. The gene expression profiles that are measured represent an aggregate of the transcriptional activity of all the blood cells at the time of blood collection. The DTDS method will be used on the postdiagnostic or clinical follow-up in the NOWAC postgenome cohort, where each blood sample, regardless of disease status, was collected at a random follow-up time. We will first describe the epidemiological design necessary for studies of the postdiagnostic trajectories, and then describe the statistical concepts.

Material

The overall NOWAC postgenome biobank

Recruitment for the prospective Norwegian Women and Cancer (NOWAC) study started in 1991 (Lund et al., 2008). Women were randomly sampled from the Norwegian population register in Statistics Norway. The women were mailed a letter of invitation and a questionnaire. Follow-up was based on linkage to the Cancer Registry of Norway and the register of deaths were based on the unique national birth number given to all Norwegian inhabitants. Repeat questionnaires were mailed with intervals of some years. In the years 2002–2006, women were invited to participate in a subcohort, the NOWAC Postgenome cohort study; for further details see Dumeaux et al., 2008. The main purpose was to establish a biobank suitable for analyses of functional genomics, in particular transcriptomics. Random samples of NOWAC women were drawn in weekly batches of 500 women until 50 000 women had responded positively. Women were invited to fill in another questionnaire and donate a blood sample at a health-care institution such as a GP’s office. The blood samples were sent overnight to the institute by special post for biological samples. The tube contained a protective buffer that preserved the mRNA in the blood (PAX gene blood RNA system), allowing frozen storage over time and optimizing sensitivity of the analysis.
The present analysis used a subsample of the NOWAC postgenome biobank participants. Women who had both filled in a questionnaire in 1996–1998 and had given a Postgenome blood sample were eligible, a total of 31 101 women. Since collection of blood was at random without knowledge of disease status, the procedure gave a uniform distribution of gene expression measurement over time.
In 2010, breast cancer cases diagnosed between 1996 and 2006 were identified through a linkage to the national cancer registry. An age-matched control was drawn at random from the same batch of 500 women. A total of 394 incident breast cancer cases were identified. Those rendered non-eligible were six technical outliers, seven cases with unknown metastases, seven cases with another incidence of breast cancer before blood collection, ten controls diagnosed with cancer before blood donation, and one who emigrated, leaving 363 case-control pairs for the present analyses.
A linkage to the register of vital status in Norway gave a complete follow-up after blood donation until the end of the study on 31.12.2014, or death or emigration. Causes of death according to different strata of metastatic/invasive cancer at time of diagnoses are given below.
In order to update changes in clinical stage or a second breast cancer and to remove controls with an incidence of cancer, another linkage was performed in 2018 with the Cancer Registry of Norway. For six women with metastases and ten cases with another incidence of breast cancer, the updated information was used to change the start of follow-up.
Of the 363 case-control pairs, 85 were omitted since the follow-up time for the cases that are observed before 18 months from diagnosis are heavily influenced by the treatment. We therefore first analyze a dataset of 39 cases who died from cancers and compare them with 239 cases who did not die of cancer, i.e. a total dataset of 278 case-controls. Later, we reduce this to a dataset consisting of 23 cases with metastatic breast cancer who died of breast cancer and 79 cases with metastatic breast cancer who did not die of cancer; see Table 9.1.
Table 9.1. Further classification of the data and specification of the two analyzed datasets with 278 and 102 case-control pairs
StrataDied of breast cancerDied from non-breast cancerSum died of cancerSurvivedDied, not cancerSum, not died of cancerSum
Metastatic32436973100136
Invasive115162056211227
Sum439523029311363
Dataset one where data before 18 months are excluded
Metastatic  27  82109
Invasive  12  157169
Sum  39  239278
Dataset two where data before 18 months are excluded
Metastatic23  79  102

Statistical methods

The dataset consists of two strata of women with breast cancer in which the cases died or did not die of cancer and the observation time is the time after the last diagnosis. For each gene and stratum, we estimate the differences between cases and controls in gene expression as a smooth function using a moving window in time. We then estimate the differences in the time development between the two strata by calculating the area between the two estimated curves for the smoothed gene expression for the two strata. If there is a systematic difference in the level or the time development of the gene expression between the two strata, this area is large. We will test three hypotheses. The first hypothesis, H0A, concerns individual gene trajectories, while H0B looks at all genes together. We also predict the vital stage, dead or alive, of each case using cross-validation. H0C states that this prediction is independent of vital stage.

H0A: Identify genes with different time development

We first focus on identifying genes with a different time development. Let X c , g be the difference in log2 gene expression for case-control pair c = 1,2, ..., M for gene g = 1,2, ... N g . Further, let t c be the time of observation relative to diagnosis for the case in the case-control pair c. We assume X c , g ~N(f s ( c ) ,g (t c ), σ g ) where σ g is the standard deviation and s(c) is the stratum of case c. We estimate the function f s , g (t) by taking an average of the observations X c , g from stratum s(c) in an interval that includes the n nearest observations in time, i.e. the n/2 observations with largest t c but t c  < t and the n/2 observations with smallest t c but t c  > t. The number n is a tradeoff between precision and resolution. It should be large enough that the estimate in an interval should not depend on a single data point and at least smaller then M/4 in order to get resolution in time. If there is a large difference in the time development between the two strata, the test statistic or area V g  = |f a , g  – f b , g | = ∫|f a , g (t) – f b , g (t)|dt describing the area between the curves, will be large where the two strata are denoted a and b, respectively. This estimate is the sum of the absolute value of the differences in average gene expression between the two strata in equally spaced time points assuming the controls have similar values. The integral is evaluated in a time interval where there are observations from both strata.
We make N g , independent hypotheses, i.e. one hypothesis for each gene:
H0A: For gene g, the time development of X c , g is independent of the stratum s(c), i.e. f a , g  = f b , g .
For each gene g, we compare the observed V g , o with the variable V from a simulated distribution where we use a standardization of the same variables X c , g for all the genes simultaneously, but where we randomize the strata s(c) of the cases. We maintain the observations for each gene and the number of observations from each stratum. From the N u simulations, we estimate the probability distribution g(v) = P(V > v) that is independent of the genes. Based on this, we find a p-value p g  = p(V > V g , o ) = (k + 1)/(N u N g  + 1) for each gene g if k of the N u N g simulations have V > V g , o . We correct for multiple testing using the (Benjamini & Hochberg, 1995).
We estimated the functions f s , g with a moving average, where the window size is one-quarter of the respective datasets, i.e. 9 and 59 points, respectively. These functions were evaluated in regularly spaced points, making it easy to evaluate the functions when the observations for each stratum changes position in time. The integral was evaluated in the largest interval such that there were data points from the two strata before and after the interval making the interval equal to (547, 2255) days after diagnosis. The method was applied on a dataset with N g  = 8400 genes. The analysis is performed for standardized gene expressions for each gene
where the standard deviation σ g is taken over the case-controls pairs for each gene. This normalization is necessary in order to compare area between the curves since we want to focus on the differences in time development and not in the mean value and the variance. The results were based on simulations with N u  = 1000 realizations.

H0B: Identify difference in gene development for all genes simultaneously

We will also make a weaker hypothesis where we analyze all the genes simultaneously:
H0B: For all genes, the time development of X c , g is independent of the stratum s(c), i.e. for all genes f a , g  = f b , g .
Note that we only make one hypothesis here. We perform the same N u simulations as in the hypothesis test for H0A, but we use the test statistics V (1), o  > V (2), o  > ··· which is the V g , o variables for all the genes that are sorted in decreasing order. From the simulation, we find the probability for the ordered variables g m (v) = P(V ( m ) > v) for m = 1,2, ..., and the p-value for the hypothesis test p ( m ) = P(V ( m ) > V ( m ), o  = (k + 1)/(N u  + 1) if k of the N u simulations have V ( m ) > V ( m ), o . Here, we have many highly correlated test statistics V ( m ), o for m = 1,2, ..., for testing the same hypothesis H0B. The integer m is chosen by the user. m = 1 means that we are only interested in the most extreme gene and m = 10 means that we are interested in the 10 most extreme genes. This method is most interesting for 3 < m < 50, i.e. where no single gene is significant, but several/many genes have deviating values and where we avoid the multiple testing problem. Ideally, m should be decided before the data is analyzed, but this is not as critical as when alternative test statistics are independent of each other.

H0C: Prediction of strata

It is also possible to use the same technique in order to predict the stratum of a case. The idea is to find out if the observations X c , g for g = 1,2, ..., N g is closest to f a , g (t c ) or f b , g (t c ) for the genes with lowest p-values in the hypothesis test H0A above. Our ambition is only to find the quality of the prediction, not to make a diagnosis for each case. Hence, we make the following hypothesis:
H0C: The prediction P a , c , that the case c belongs to stratum a, is independent of the stratum s(c).
We test the hypothesis using cross-validation. The case-control pairs are divided into the D 1, D 2, ... D Nd groups, which are described in more detail further down. For each of the pairs c ϵ D k we find
where f a , g , k (t c ) is the estimated gene expression for gene g and stratum a at time t c , i.e. the time of observation X c , g based on all the data except the data in D k . This is based on the assumption that X c , g ~N(f s ( c ) ,g ,k(t c ), σ g). The weight w g may be set equal to , possibly modified based on the correlation between the gene expressions for different genes and how significant this gene is for the prediction. How important gene g is for the prediction is estimated from p g , k , the p-value for hypothesis test HOA estimated from all the data except D k . The prediction that the observation X c , g is from stratum a is then
This model gives probabilities that are approximately uniform in (0,1), see Figure 9.1. If we had assumed X c , g ~N(f s ( c ) ,g,k (t c ), σ g ) independently for each gene g, then most P c , a would be close to 0 or 1, which does not correspond to our ignorance in the classification. We use the test statistics
which is the L1 distance between the prediction for stratum a and the indicator for stratum a. We may randomize P c , a between the observations and find a distribution for S. The p-value for the hypothesis test H0C is found from the distribution for S, i.e. p = P(S < S o ).
We used cross-validation and therefore needed to divide the dataset into separate groups. The 39 case-control pairs where the case died of cancer and 239 case-control pairs where the case did not die of cancer were divided into 13 groups, D 1, D 2, ... D 13. The data in each stratum was divided into three time periods for each of the two strata with an (almost) equal number of observations. Each of the 13 groups had (almost) the same number of observations from each stratum in each of the three time periods. For each group k we find the values p g , k from the hypothesis test H0A based on all the data except the data in D k based on 1000 randomizations of the strata s(c).
Figure 9.1. Log2 gene expression from the 12 case-control curves with the smallest p-value of the 8400 genes. The 12 p-values less than 0.00001. The figure uses normalized data as is used in the test statistics. The black continuous and the red dashed curves are the log2 gene expression from the case-controls who survived and died, respectively.

Results

The data used in all the analyses are the differences in log2 gene expression between cases and controls in the period after diagnosis that are shown in Table 9.1, i.e. 278 case-control pairs.

Testing H0A

Results from testing the H0A hypothesis are shown in Table 9.2. The function of the top 10 is shown in Chapter 10.
Table 9.2. The 50 genes with smallest p-value from the 39+239 dataset with metastatic and invasive cancer. The columns show the name, p-value, q-value and area between the smooth curves between the cases who survived and died.
Genep-valueq-value∫|fa,g(t) – fb,g(t)|dt
CCM22.38e-070.00161997
C14orf457.14e-070.00161883
LOC6508981.07e-060.00161869
ARL4A1.07e-060.00161867
CBX31.36e-060.00161842
LOC6467831.90e-060.00161823
FSTL41.90e-060.00161820
C5orf301.90e-060.00161816
LOC3892931.90e-060.00161812
BPGM2.07e-060.00161798
RBM42.17e-060.00161789
AHSP2.28e-060.00161788
CA13.21e-060.00201775
RP11-529I10.43.33e-060.00201766
ISCA1L3.69e-060.00211757
NCBP14.42e-060.00231739
DARC8.09e-060.00401697
HPS19.43e-060.00431686
TSTA39.65e-060.00431686
PDSS11.16e-050.00461668
STOM1.19e-050.00461667
DHX291.21e-050.00461666
RBBP41.41e-050.00511658
RNF111.51e-050.00511653
FZD11.51e-050.00511652
RIPK41.75e-050.00531643
RBM281.81e-050.00531639
XK1.88e-050.00531636
KIAA01741.92e-050.00531633
LOC6465081.92e-050.00531633
GYPB1.94e-050.00531632
MGC130572.06e-050.00531627
LOC6496042.06e-050.00531627
BNIP3L2.28e-050.00551618
TRIM102.29e-050.00551616
SLC14A12.36e-050.00551615
C14orf1242.41e-050.00551615
EWSR12.88e-050.00621603
TRAK22.89e-050.00621603
SELK3.34e-050.00701592
HMBS3.39e-050.00701590
NUDT13.67e-050.00711585
SRRD3.79e-050.00711583
WDR893.81e-050.00711583
NR1D13.85e-050.00711581
SLITRK13.91e-050.00711579
HEMGN3.96e-050.00711577
DNAJB14.24e-050.00741570
LOC6490444.31e-050.00741569
PPIA4.66e-050.00751563
The q-values are the FDR-corrected p-values. From 8400 genes, 733 genes had q-values below the given threshold for hypothesis test H0A (97 with q<0.01 and 636 with q<0.05). The result shows that many genes have a different time development between the two strata. The reduced dataset with only metastatic breast cancer is too small to get significant results on this test. Figure 9.2 shows the functions f died,g (t) and f survived,g (t) separately for each of the 12 most significant genes of the 8400 genes (p<0.000001). The test statistics is the area between the pair of curves.
Figure 9.2. Log2 gene expression from the case-controls curves for the 12 genes with smallest p-value of the 8400 genes. The black continuous and the red dashed curves are the log2 gene expression from the case-controls who survived and died, respectively.
As shown for most genes, the gene expression increases. Noticeably, f survived,g (t) is almost constant and close to 0 in the entire period while f died,g (t) is closer to 1 or –1 in the period (1000,1500) days and then for many genes closer to 0 after 1500 days. The normalization (1) implies that the data for each gene have average value 0 and standard deviation 1 in order to compare data between genes. Since the stratum that survived is much larger, it is natural that the average of these curves is smoother and close to 0. The statistical test shows that deviation in averages value between the strata is significant for many genes. The p-value depends on whether there is a systematic difference in level or time variation of the gene expression, not the size of the difference in average value between the strata since this is removed in the standardization.

H0B: identify difference in gene development for all genes simultaneously

We also want to test all the genes simultaneously. Since we only make one test, it is easier to reject the hypothesis for a smaller dataset. First, we test hypothesis H0B on the dataset with 39 and 239 case-control pairs. There is only one hypothesis, but we have many different test statistics, one for each of m ordered V(m) test statistics for the area between the two curves. The different test statistics indicate whether there is a strong difference in the time development in one or a few genes compared to a smaller difference in many genes. The test for each of the ordered variables is highly correlated. Table 9.3 shows the p-values from the H0B.
Table 9.3. The p-values from hypothesis test H0B for each of the ordered variables. The upper row is from the 39+239 dataset with metastatic and invasive cancer and the lower line is from the 23+79 dataset with metastatic cancer. “<0.001” means that we have not observed any simulated values above the observed value from the data. The lower row shows the p-values from hypothesis test H0B for the reduced dataset on metastatic breast cancer.
Ordered variables151025501005001000
39+2390.002<0.001<0.001<0.001<0.001<0.001<0.0010.002
23+790.0030.0610.0430.0360.0370.0340.0450.052
Notice that we get very significant results and that many genes have a different time development between the two strata.
This test is also performed on the reduced dataset with only metastatic breast cancers. There are only 23 and 79 case-control pairs in the two strata (Table 9.1), those with metastases who died of breast cancer and those who did not die of cancer, respectively. We still get significant p-values, but much larger values than in the larger data set with both metastatic and invasive cancer; see Table 9.4. The differently ordered variables are highly correlated and give typically p-values between 3–6%.

H0C: Prediction of strata

We also want to test whether it is possible to predict the stratum of each case by testing hypothesis H0C. The 13 different datasets leaving out one of the groups D k give a slightly different rankings of the importance of the different genes. The mean correlation between the rankings of the genes for these 13 datasets is 0.85. Table 9.4 shows that there is a large overlap in the most important genes in the 14 different datasets when we include the ranking using all the data. On average, four of the five genes with the lowest p-value when using the entire dataset were among the 10 smallest p-values in the reduced datasets. We have marked the five genes with the smallest p-values when using all the genes with colors. Notice that many of the same genes have small p-values for the different subsets.
Table 9.4. Ranking of the 10 most important genes when we leave out D k from the dataset. The lowest line is the most important genes when we use all the data.
We have tested different predictions methods, i.e. different choices of the weights w g , k . The different choices give highly correlated probabilities. We have found out that w g , k  = 1/p g , k for the 50 genes g with smallest p g , k value for each group is a quite robust choice. Figure 9.3 shows the predicted probabilities for each of the 278 case-control pairs after time of follow-up. Ideally, we wanted all the 39 red and yellow circles to be equal to 1 and the remaining circles equal to 0.
Figure 9.3. The prediction of dying from cancer for the cases who died of breast cancer, died of other types of cancers and died, but not from cancer, and cases who survived for each of the 278 case-control pairs. The figure to the left shows predictions for cases with invasive cancer, while the figure to the right shows prediction for cases with metastatic cancer.
Based on these variables, we find P c , a , S o and the p-value p = P(S < S o ) based on the 10 000 randomization of P c , a We find the p-value less than 0.004 indicating that the prediction is far from random. Table 9.5 gives another presentation of the prediction based on whether P c , a  > 0.3 or not.
Table 9.5. Prediction for each of the 278 cases based on a threshold equal 0.3
 SumP a,c  > 0.3P a,c  < 0.3
Cases who died of cancer392217
Cases who did not die of cancer23975164
Increasing the thresholds from 0.3 will decrease both the number of true positive and the number of false positive. The threshold 0.3 is chosen as a balance between false positive and false negative. This gives a sensitivity equal 0.56 and specificity equal 0.69. The mean prediction value for the 39 cases who died is 0.32 and the mean prediction value for the 239 cases who survived is 0.23. The predictions are also shown in the boxplot in Figure 9.4.
Figure 9.4. Boxplot of prediction of death from cross-validation of cases after 18 months from diagnosis. Horizontal lines describe 0.25, 0.5 and 0.75 quantiles. The number of cases and mean in the four categories are metastatic cancer where case died, no: 27, mean 0.33, invasive cancer where case died, no: 12, mean 0.29, metastatic cancer where case survives, no: 82, mean: 0.29, invasive cancer where case survives no: 157, mean: 0.22.
Notice that the invasive cases who died and the metastatic cases who survived have a relatively similar prediction which is between the prediction of the metastatic cases who died and the invasive cases who survived.

Discussion

We have shown that the trajectories of gene expression after diagnosis of breast cancer were mostly significantly upregulated for hundreds of genes in the years after a diagnosis of metastatic breast cancer compared to invasive cancer, as shown in Figure 9.4. These signals may be considered as signals of an upcoming death due to cancer. Fewer genes were downregulated. After some years, most upregulated genes levelled off while downregulated genes slowly returned to the normal expression level. Among women with invasive breast cancer, no significant trajectories were found. These results were based on a new statistical approach using the differences in the area between the trajectories of gene expression between diseased and healthy women.
For practical and economic reasons, only one single measurement at time of inclusion was available for each individual in the NOWAC postgenome cohort. Hence, the processual approach relies on the assumption that the gene expression in distinct individuals at different times before or after diagnosis is a consequence of the same carcinogenic process (Lund & Plancade, 2012). Semi-parametric models with time-varying covariates, e.g. the Cox model (Cox, 1972), cannot be estimated from a prospective design including only one unique measurement at time of inclusion, unless covariates are assumed to be constant over time. Consequently, this assumption would not allow us to address changes in gene expression over time.
The DTDS is a further development of the LITS method (Holden, Holden, Olsen, & Lund, 2017), where we used a moving window and summary statistics for all genes for each of the stratum and time period. The genes that were significant in each time interval varied between the intervals, making the LITS method not suitable for identifying genes with different time development. In contrast, the DTDS method is able to identify genes with different time development. Both methods use the same method for simulation and randomization of gene expressions between the case-control pairs with cases from the different strata.
The distribution of measurements of gene expression must follow a constant function, i.e. with measurements spaced over the time interval. Most cohort studies have repeated measurements, but usually they are collected for all participants with several years of spacing and can be used as repeated measurements only.
We cannot predict the outcome for single individuals, only on a group level. The results can be looked upon as a proof of concept for the idea that gene expression measured repeatedly over time after diagnosis can be used as a predictive test for the vital outcome.
Little is known about the changes in gene expression in the blood in the period after a breast cancer diagnosis, i.e. the time period after the primary treatment (Lund & Plancade, 2012). In the stratified analysis, both invasive and metastatic cases were compared to healthy women without known cancers. The consistent and highly significant differences between the two strata adds information that can be used toward a new hypothesis of metastatic breast cancer and its high lethality. For hundreds of genes, the integrated area between the two curves for each stratum accumulates during follow-up, indicating ongoing dysregulation of important genes. These strong changes in gene expression from the immune cells can be viewed as signals of upcoming death. The intention here was to explore the unknown trajectories of gene expression after diagnosis of breast cancer. The interpretation of each gene was outside the scope of our exploration. Still, some hypotheses can be put forward.

Human model of carcinogenesis—interpretation of highly expressed genes

No unifying theory exists for human carcinogenesis; the number of proposals is many (Vineis, Schatzkin, & Potter, 2010). To date, most mechanistic or pathways analyses have been experimental in-vitro or animal studies. With the increasing knowledge about human carcinogenesis in tumor tissues or in blood at time of diagnosis, some disturbing facts about the validity of the animal models for human carcinogenesis have been brought up. First, the biology of mice and men is comparatively different (Mak, Evaniew, & Ghert, 2014; Anisimov, Ukraintseva, & Yashin, 2005), and a controversial Nature editorial (“Of men, not mice”, 2013) advocated the need for human functional studies. Similarly, the translational value of mouse models in oncology drug development was recently questioned (Gould, Junittila, & de Sauvage, 2015). While cancer can be manufactured in mice quite easily, these models do not necessarily apply to humans (Mak et al., 2014). Consequently, an increasing number of studies use functional genomics as biomarkers, looking both at the exposure relationship and the outcome. While interesting, this approach lacks the distinct focus on the time-dependent process of carcinogenesis. Few, if any, prospective studies have been designed for longitudinal analyses of functional genomics related to the processes of carcinogenesis and metastasis.
Table 9.6. Annotated functions of the most significant genes from Table 9.2
CCM2Regulate angiogenesis and formation of new blood vessels
C14orf45Gene responsible for cilia orientation. One paper shows as low-expressed gene associated with poor survival in BC (higher number of cylia is necessary for improved migration of breast cancer cells)
ARL4AIncrease cell migration
CBX3Shown to be overexpressed in BC and associated with low survival, might block differentiation and promote self-renewal of cancer stem cells
FSTL4Shown to be involved in BC cell migration in mice. Was discussed in relation to late distant metastases in BC here without any conclusions (Mittempergher et al., 2013)
C5orf30Known to be expressed in BC and especially in lymph-node metastases. Promote inflammation and hypothesized to reduce immune response against cancer cells
RBM4Known tumor suppressor in BC
The interpretation of these genes points towards important changes in genes known to be affected at breast cancer, and in addition some more general ones.
During the different laboratory steps, several decisions had to be taken on level of noise and the use of specific distribution of noise. Further, since a gene maybe not expressed in all individuals, the percentage of cases or controls with sufficient signals had to be decided. The stronger the criteria moving towards hundred percent, the harder the exclusion.
The strength of the study is the unique biobank created with the purpose of gene expression analysis in peripheral blood. This gave a unique opportunity to study the immune response since the mRNA in blood came from immune cells. This opened for the view that the carcinogenic process not only included exposures to carcinogens, but also has an important counterforce in the immune system. This has been known for more than a hundred years, and today documented through the new immune therapies.
The design has been population-based with a complete follow-up on cancer incidence, emigration and death based on linkage to national registers using the unique national birth number given to all residents in Norway from 1960. In addition, we had access to updated information on metastases and second breast cancers in the time between inclusion and blood donation. This somewhat reduced the noise from carcinogenic processes hidden at the time of diagnosis.

Conclusion

In this systems epidemiology approach, we have given a proof of concept for the use of gene expression as an individualized biomarker of prognosis related to death or not. The design of NOWAC is population-based and the results should be validated in a more specific clinical setting. With improved technology and individual repeated measurements gene expression followed over time could offer a unique opportunity for personalized treatment of metastatic breast cancer.

Disclaimer

Some of the data in this article are from the Cancer Registry of Norway. The Cancer Registry of Norway is not responsible for the analysis or interpretation of the data presented.
Microarray service was provided by the Genomics Core Facility, Norwegian University of Science and technology, and NMC—a national technology platform supported by the functional genomics program (FUGE) of the Research Council of Norway.

Acknowledgements

We are impressed by and thankful to the women who donated blood for this cancer research project. Bente Augdal, Merete Albertsen, and Knut Hansen were responsible for all infrastructure and administrative issues. This study was supported by a grant from the European Research Council (ERC-AdG 232997 TICE) and a donation from Halfdan Jacobsen og frues legat (The Norwegian Cancer Society)
The funders had no role in the design of the study; in the collection, analyses and interpretation of the data; in the writing of the manuscript; or in the decision to submit for publication.

Authors’ contributions

EL is PI of the NOWAC Study and initiated the methodological collaboration; LH and MH developed the statistical methods. KSO addressed the gene function. J-CT and L-TB added clinical information. All authors have participated in the discussions and have read and approved the final manuscript.

References

Anisimov, V.N., Ukraintseva, S.V., & Yashin, A.I. (2005). Cancer in rodents: does it tell us about cancer in humans? Nature Reviews. Cancer, 5(10), 807–819. https://doi.org/10.1038/nrc1715
Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57, 289–300.
Cancer Registry of Norway. (2019). Cancer in Norway 2018 – Cancer incidence, mortality, survival and prevalence in Norway. Oslo, Norway: Cancer Registry of Norway. Retrieved from: https://www.kreftregisteret.no/globalassets/cancer-in-norway/2018/cin-2018.pdf
Cox, D.R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2), 187–220.
Dumeaux, V., Børresen-Dale, A.L., Frantzen, J.O., Kumle, M., Kristensen, V.N., Lund, E. (2008). Gene expression analyses in breast cancer epidemiology: the Norwegian Women and Cancer postgenome cohort study. Breast Cancer Research, 10(1), R13. https://doi.org/10.1186/bcr1859
Gould, S.E., Junttila, M.R., & de Sauvage, F.J. (2015). Translational value of mouse models in oncology drug development. Nature Medicine, 21(5), 431–439. https://doi.org/10.1038/nm.3853
Holden, M., Holden, L., Olsen, K.S., & Lund, E. (2017). Local in Time Statistics for detecting weak gene expression signals in blood – illustrated for prediction of metastases in breast cancer in the NOWAC Post-genome Cohort. Advances in Genomics and Genetics, 7, 11–28. https://doi.org/10.2147/AGG.S130004
Jeibouei, S., Akbari, M.E., Kalbasi, A., Aref, A.R., Ajoudanian, M., Rezvani, A., Zali, H. (2019). Personalized medicine in breast cancer: pharmacogenomics approaches. Pharmacogenomics and Personalized Medicine,12, 59–73. https://doi.org/10.2147/PGPM.S167886
Lund, E., Dumeaux, V., Braaten, T., Hjartåker, A., Engeset, D., Skeie, G., Kumle, M. (2008). Cohort profile: The Norwegian Women and Cancer Study--NOWAC--Kvinner og kreft. International Journal of Epidemiology, 37(1), 36–41. https://doi.org/10.1093/ije/dym137
Lund, E., & Plancade, S. (2012). Transcriptional output in a prospective design conditionally on follow-up and exposure: the multistage model of cancer. International Journal of Molecular Epidemiology and Genetics, 3(2), 107–114.
Lund, E., Holden, L., Bøvelstad, H., Plancade, S., Mode, N., Günther, C.C., ... Holden, M. (2016). A new statistical method for curve group analysis of longitudinal gene expression data illustrated for breast cancer in the NOWAC postgenome cohort as a proof of principle. BMC Medical Research Methodology, 16, 28. https://doi.org/10.1186/s12874-016-0129-z
Mak, I.W., Evaniew, N., & Ghert, M. (2014). Lost in translation: animal models and clinical trials in cancer treatment. American Journal of Translational Research, 6(2), 114–118.
Mittempergher, L., Saghatchian, M., Wolf, D.M., Michiels, S., Canisius, S., Dessen, P., …van’t Veer, L.J. (2013). A gene signature for late distant metastasis in breast cancer identifies a potential mechanism of late recurrences. Molecular Oncology, 7(5), 987–999. https://doi.org/10.1016/j.molonc.2013.07.006
Norwegian Institute of Public Health (n.d.). Causes of death & Life expectancy. [Internet]. Accessed: 10.10.2019. Retrieved from: http://www.fhi.no/en/hn/cause-of-death-and-life-expectancy/
Of men, not mice [Editorial]. (2013). Nature Medicine, 19(4), 379. Retrieved from: https://www.nature.com/articles/nm.3163
Reddy, S.M., Barcenas, C.H., Sinha, A.K., Hsu, L., Moulder, S.L., Tripathy, D., … Valero, V. (2018). Long-term survival outcomes of triple-receptor negative breast cancer survivors who are disease free at 5 years and relationship with low hormone receptor positivity. British Journal of Cancer, 118(1), 17–23. https://doi.org/10.1038/bjc.2017.379
Reiner, A., Yekutieli, D., & Benjamini, Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics (Oxford, England), 19(3), 368–375. https://doi.org/10.1093/bioinformatics/btf877
Spitz, M.R., & Bondy, M.L. (2010). The evolving discipline of molecular epidemiology of cancer. Carcinogenesis, 31(1), 127–134. https://doi.org/10.1093/carcin/bgp246
Stroncek, D.F., Butterfield, L.H., Cannarile, M.A., Dhodapkar, M.V., Greten, T.F., Grivel, J.C., … Seliger, B. (2017). Systematic evaluation of immune regulation and modulation. Journal for Immunotherapy of Cancer, 5, 21. https://doi.org/10.1186/s40425-017-0223-8
Vineis, P., Schatzkin, A., Potter, J.D. (2010). Models of carcinogenesis: an overview. Carcinogenesis, 31(10), 1703–1709. https://doi.org/10.1093/carcin/bgq087

Information & Authors

Information

Published In

Cover Image
Pages: 141162
Editor: Eiliv Lund [email protected], Department of Community Medicine, UiT The Arctic University of Norway, Tromsø, Norway
ISBN (Print): 978-82-15-04120-9
ISBN (Online): 978-82-15-04119-3

History

Published online: 8 December 2020
Publication date: 16 December 2020

Authors

Affiliations

Eiliv Lund
Department of Community Medicine, UiT The Arctic University of Norway, Tromsø, Norway; [email protected]
Marit Holden
Department of Statistical Analysis, Machine Learning and Image Analysis, Norwegian Computing Center, Oslo, Norway
Jean-Christophe Thalabard
MAP5, Universite Rene Descartes, Paris, France
Lill-Tove Rasmussen Busund
Institute for Medical Biology, UiT The Arctic University of Norway, Tromsø, Norway; [email protected]
Lars Holden
Norwegian Computing Center, Oslo, Norway

Metrics & Citations

Metrics

Citations

Export citation

Select the format you want to export the citations of this publication.

View Options

View options

PDF

Download PDF

Restore guest purchases

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

Figures

Tables

Share

Share

Share the article link

Share on social media

Share on Messenger