Gender-based analysis of Breast Cancer - Part II

Posted on Thu 09 July 2020 in Data Science

In the previous blog post, I ventured into gender based differences in breast cancer. In this post, I attempt to find out key predictors of breast cancer survival in the dataset obtained after propensity score matching in the previous blog post as this will help us understand better, the results obtained in the previous blog post.

In order to understand the effect of multiple variables on survival, Cox’ proportional hazard model is used. In other words, it helps in understanding the effect of various variables in the rate of an event happening (death in the given analysis) in a given time. I used Scikit-survival 1 2 3 to perform this part of the analysis. A caveat of a small size dataset, as this one (size=3534), is the risk of running into multicollinearity. To avoid that I checked variance inflation factors (VIF) and removed those variables with a high VIF before proceeding with any analysis.

Firstly, I concentrated to check the predictors for the Disease Specific Survival, DSS (Breast cancer specific survival). After building a Cox model using the given variables (which were used for getting the propensity score matched dataset), I checked its performance. This is done using area under the Receiver operating characteristics (ROC), also called Harrell’s concordance index or c-index. The c-index for our model was over 0.86, which is higher than a random model’s value of 0.5, so is quite satisfactory.

Fitting the Cox model to individual variables, provides score for each variable in determining the risk prediction. A score higher than 0.5 indicates predictive power, higher the score, higher the predictive power of the variable. As can be observed, gender with a score of 0.49 has no predictive power and therefore, we did not observe any difference in DSS (after propensity matching) in the previous blog-post.

No. Variable Score
1. SEER stage 4 0.7094
2. AJCC (TNM) Stage 70 (IV) 0.6879
3. SEER stage 1 0.6678
4. Surgery-yes 0.6432
5. AJCC (TNM) Stage 10 (I) 0.6279
6. Grade 3 0.6217
7. AJCC (TNM) Stage 32 (IIA) 0.6008
8. Grade 2 0.5900
9. Marital status-Married 0.5876
10. PR status - positive 0.5720
11. ER status - positive 0.5572
12. Age at diagnosis 0.5523
13. Primary site - 509 0.5496
14. Race - Black 0.5456
15. Insurance - Insured 0.5410
16. Grade 1 0.5366
17. Her2 - Positive 0.5363
18. Race - White 0.5349
19. AJCC (TNM) Stage 54 (IIIC) 0.5277
20. AJCC (TNM) Stage 53 (IIIB) 0.5265
21. Primary site - 504 0.5194
22. Primary site - 501 0.5154
23. Primary site - 505 0.5083
24. AJCC (TNM) Stage 52 (IIIA) 0.5072
25. Primary site - 500 0.5024
26. Primary site - 506 0.5020
27. Primary site - 503 0.5002
28. Primary site - 502 0.4992
29. Gender 0.4941

Key : Primary site - 500 = Nipple (areolar); 501 = Central portion (sub-areolar); 502 = Upper-inner quadrant; 503 = Lower-inner quadrant; 504 = Upper-outer quadrant; 505 = Lower-outer quadrant; 506 = Axiallary tail of the breast; 509 = Entire breast, multiple tumors in different subsites, diffuse

Next, I used Lifelines 4 CoxPHFitter as it gives the hazard ratio (HR) for all the variables along with their confidence interval (CI) and their p-value. The hazard ratio (HR) is the ratio of risk in group 1 to risk in another. Higher the hazard, more is the risk of an event (in in this case, breast cancer related death happening). As can be observed from this result that being married gives a protective advantage compared to unmarried as here the hazard ratio is 0.61, which is the hazard of married patients at time t/ hazard of unmarried patients at time t. Note: the value of column with marital status married is binary – 0 or 1. Also, for the current analysis, I have group singles, divorced and widowed together as unmarried. Additionally, surgery and lower grades, lower AJCC stages as well as ER and PR expression confer better prognosis.

No. Variable HR 95% CI P-value
1. Marital status-Married 0.61 0.48 - 0.78 <0.005
2. Insurance - Yes 1.06 0.77 - 1.46 0.72
3. Race - White 1.72 0.84 - 3.52 0.14
4. Race - Black 2.27 1.08 - 4.80 0.03
5. Gender - Male 1.02 0.82 - 1.28 0.85
6. Primary site - 500 0.77 0.42 - 1.40 0.39
7. Primary site - 501 0.93 0.66 - 1.31 0.66
8. Primary site - 502 1.52 0.79 - 2.91 0.21
9. Primary site - 503 1.39 0.49 - 3.92 0.54
10. Primary site - 504 0.82 0.51 - 1.32 0.41
11. Primary site - 505 1.10 0.53 - 2.26 0.80
12. Primary site - 506 17.35 2.31 - 130.45 0.01
13. Primary site - 509 1.01 0.69 - 1.49 0.95
14. Grade 1 0.58 0.15 - 2.27 0.44
15. Grade 2 0.63 0.18 - 2.19 0.47
16. Grade 3 1.23 0.36 - 4.25 0.74
17. SEER stage 1 1.21 0.69 - 2.11 0.51
18. SEER stage 4 1.38 0.80 - 2.39 0.25
19. Surgery - Yes 0.45 0.32 - 0.63 <0.005
20. AJCC (TNM) Stage 10 (I) 0.24 0.12 - 0.51 <0.005
21. AJCC (TNM) Stage 32 (IIA) 0.47 0.27 - 0.81 0.01
22. AJCC (TNM) Stage 52 (IIIA) 1.98 1.24 - 3.17 <0.005
23. AJCC (TNM) Stage 53 (IIIB) 2.20 1.35 - 3.56 <0.005
24. AJCC (TNM) Stage 54 (IIIC) 2.49 1.49 - 4.17 <0.005
25. AJCC (TNM) Stage 70 (IV) 6.05 3.09 - 11.85 <0.005
26. ER status - Positive 0.42 0.25 - 0.70 <0.005
27. PR status - Positive 0.64 0.44 - 0.95 0.03
28. Her2 - Positive 0.93 0.69 - 1.27 0.67
29. Age at diagnosis 1.02 1.01 - 1.03 <0.005

Next, on the similar lines, I checked for the predictors for Overall Survival (OS). The Cox model built for OS has less c-index (0.79) compared to the model built for DSS. Here, age at diagnosis has highest predictive powers among the variables tested and gender seems to have a slight predictive power, unlike the case in DSS.

No. Variable Score
1. Age at diagnosis 0.6372
2. SEER stage 4 0.6144
3. AJCC (TNM) Stage 70 (IV) 0.6032
4. SEER stage 1 0.6009
5. Surgery-yes 0.5966
6. AJCC (TNM) Stage 10 (I) 0.5896
7. Marital status-Married 0.5723
8. Grade 3 0.5707
9. Grade 2 0.5480
10. AJCC (TNM) Stage 32 (IIA) 0.5424
11. PR status - positive 0.5408
12. Insurance - Insured 0.5348
13. Primary site - 509 0.5335
14. AJCC (TNM) Stage 53 (IIIB) 0.5334
15. ER status - positive 0.5308
16. Race - Black 0.5298
17. Grade 1 0.5250
18. Race - White 0.5205
19. Her2 - Positive 0.5197
20. Gender 0.5184
21. Primary site - 504 0.5152
22. Primary site - 505 0.5104
23. AJCC (TNM) Stage 54 (IIIC) 0.5086
24. Primary site - 500 0.5045
25. AJCC (TNM) Stage 52 (IIIA) 0.5027
26. Primary site - 503 0.5025
27. Primary site - 501 0.5022
28. Primary site - 506 0.5009
29. Primary site - 502 0.5003

The HR and 95% Confidence Interval for various variables obtained for OS are listed below. Here also, one can observe influence of male gender on worse prognosis while being married, insured, ER, PR positive, surgery have better prognosis for OS.

No. Variable HR 95% CI P-value
1. Marital status-Married 0.67 0.57 - 0.80 <0.005
2. Insurance - Yes 0.77 0.61 - 0.97 0.02
3. Race - White 1.35 0.85 - 2.15 0.20
4. Race - Black 1.59 0.97 - 2.60 0.06
5. Gender - Male 1.30 1.11 - 1.52 <0.005
6. Primary site - 500 0.99 0.66 - 1.48 0.96
7. Primary site - 501 1.04 0.81 - 1.34 0.75
8. Primary site - 502 1.22 0.77 - 1.95 0.40
9. Primary site - 503 1.02 0.47 - 2.22 0.96
10. Primary site - 504 0.95 0.68 - 1.33 0.77
11. Primary site - 505 0.96 0.57 - 1.63 0.89
12. Primary site - 506 7.49 1.03 - 54.55 0.05
13. Primary site - 509 1.19 0.89 - 1.59 0.24
14. Grade 1 1.07 0.31 - 3.69 0.91
15. Grade 2 1.06 0.32 - 3.51 0.93
16. Grade 3 1.65 0.50 - 5.44 0.41
17. SEER stage 1 1.05 0.79 - 1.40 0.72
18. SEER stage 4 1.18 0.75 - 1.87 0.47
19. Surgery - Yes 0.42 0.32 - 0.54 <0.005
20. AJCC (TNM) Stage 10 (I) 0.58 0.39 - 0.87 0.01
21. AJCC (TNM) Stage 32 (IIA) 0.87 0.64 - 1.19 0.39
22. AJCC (TNM) Stage 52 (IIIA) 1.57 1.10 - 2.22 0.01
23. AJCC (TNM) Stage 53 (IIIB) 1.89 1.35 - 2.65 <0.005
24. AJCC (TNM) Stage 54 (IIIC) 1.63 1.08 - 2.45 0.02
25. AJCC (TNM) Stage 70 (IV) 3.61 2.09 - 6.25 <0.005
26. ER status - Positive 0.52 0.34 - 0.79 <0.005
27. PR status - Positive 0.76 0.57 - 1.01 0.06
28. Her2 - Positive 1.00 0.79 - 1.26 1.00
29. Age at diagnosis 1.05 1.04 - 1.06 <0.005

Some of the limitations of this entire analysis (Part I and Part II) are — (i) variables that were not available in SEER dataset such as chemotherapy, radiation therapy, hormone therapy, income were not considered, but can have an impact on the prognosis ; (ii) as it is a retrospective study and thus, selection bias cannot be avoided. Despite this, SEER is a valuable resource for studying prognosis factors for cancers. Moreover, using propensity score matching I have removed confounding factors and came to convincing results.


  1. Pölsterl, S., Navab, N., and Katouzian, A., Fast Training of Support Vector Machines for Survival Analysis. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, Lecture Notes in Computer Science, vol. 9285, pp. 243-259 (2015) 

  2. Pölsterl, S., Navab, N., and Katouzian, A., An Efficient Training Algorithm for Kernel Survival Support Vector Machines. 4th Workshop on Machine Learning in Life Sciences, 23 September 2016, Riva del Garda, Italy 

  3. Pölsterl, S., Gupta, P., Wang, L., Conjeti, S., Katouzian, A., and Navab, N., Heterogeneous ensembles for predicting survival of metastatic, castrate-resistant prostate cancer patients. F1000Research, vol. 5, no. 2676 (2016). 

  4. Cameron Davidson-Pilon, Jonas Kalderstam, Noah Jacobson, sean-reed, Ben Kuhn, Paul Zivich, … Abraham Flaxman. (2020, July 9). CamDavidsonPilon/lifelines: v0.24.16 (Version v0.24.16). Zenodo. http://doi.org/10.5281/zenodo.3937749