Gender-based analysis of Breast Cancer - Part II
Posted on Thu 09 July 2020 in Data Science
In the previous blog post, I ventured into gender based differences in breast cancer. In this post, I attempt to find out key predictors of breast cancer survival in the dataset obtained after propensity score matching in the previous blog post as this will help us understand better, the results obtained in the previous blog post.
In order to understand the effect of multiple variables on survival, Cox’ proportional hazard model is used. In other words, it helps in understanding the effect of various variables in the rate of an event happening (death in the given analysis) in a given time. I used Scikit-survival 1 2 3 to perform this part of the analysis. A caveat of a small size dataset, as this one (size=3534), is the risk of running into multicollinearity. To avoid that I checked variance inflation factors (VIF) and removed those variables with a high VIF before proceeding with any analysis.
Firstly, I concentrated to check the predictors for the Disease Specific Survival, DSS (Breast cancer specific survival). After building a Cox model using the given variables (which were used for getting the propensity score matched dataset), I checked its performance. This is done using area under the Receiver operating characteristics (ROC), also called Harrell’s concordance index or c-index. The c-index for our model was over 0.86, which is higher than a random model’s value of 0.5, so is quite satisfactory.
Fitting the Cox model to individual variables, provides score for each variable in determining the risk prediction. A score higher than 0.5 indicates predictive power, higher the score, higher the predictive power of the variable. As can be observed, gender with a score of 0.49 has no predictive power and therefore, we did not observe any difference in DSS (after propensity matching) in the previous blog-post.
No. | Variable | Score |
---|---|---|
1. | SEER stage 4 | 0.7094 |
2. | AJCC (TNM) Stage 70 (IV) | 0.6879 |
3. | SEER stage 1 | 0.6678 |
4. | Surgery-yes | 0.6432 |
5. | AJCC (TNM) Stage 10 (I) | 0.6279 |
6. | Grade 3 | 0.6217 |
7. | AJCC (TNM) Stage 32 (IIA) | 0.6008 |
8. | Grade 2 | 0.5900 |
9. | Marital status-Married | 0.5876 |
10. | PR status - positive | 0.5720 |
11. | ER status - positive | 0.5572 |
12. | Age at diagnosis | 0.5523 |
13. | Primary site - 509 | 0.5496 |
14. | Race - Black | 0.5456 |
15. | Insurance - Insured | 0.5410 |
16. | Grade 1 | 0.5366 |
17. | Her2 - Positive | 0.5363 |
18. | Race - White | 0.5349 |
19. | AJCC (TNM) Stage 54 (IIIC) | 0.5277 |
20. | AJCC (TNM) Stage 53 (IIIB) | 0.5265 |
21. | Primary site - 504 | 0.5194 |
22. | Primary site - 501 | 0.5154 |
23. | Primary site - 505 | 0.5083 |
24. | AJCC (TNM) Stage 52 (IIIA) | 0.5072 |
25. | Primary site - 500 | 0.5024 |
26. | Primary site - 506 | 0.5020 |
27. | Primary site - 503 | 0.5002 |
28. | Primary site - 502 | 0.4992 |
29. | Gender | 0.4941 |
Key : Primary site - 500 = Nipple (areolar); 501 = Central portion (sub-areolar); 502 = Upper-inner quadrant; 503 = Lower-inner quadrant; 504 = Upper-outer quadrant; 505 = Lower-outer quadrant; 506 = Axiallary tail of the breast; 509 = Entire breast, multiple tumors in different subsites, diffuse
Next, I used Lifelines 4 CoxPHFitter as it gives the hazard ratio (HR) for all the variables along with their confidence interval (CI) and their p-value. The hazard ratio (HR) is the ratio of risk in group 1 to risk in another. Higher the hazard, more is the risk of an event (in in this case, breast cancer related death happening). As can be observed from this result that being married gives a protective advantage compared to unmarried as here the hazard ratio is 0.61, which is the hazard of married patients at time t/ hazard of unmarried patients at time t. Note: the value of column with marital status married is binary – 0 or 1. Also, for the current analysis, I have group singles, divorced and widowed together as unmarried. Additionally, surgery and lower grades, lower AJCC stages as well as ER and PR expression confer better prognosis.
No. | Variable | HR | 95% CI | P-value |
---|---|---|---|---|
1. | Marital status-Married | 0.61 | 0.48 - 0.78 | <0.005 |
2. | Insurance - Yes | 1.06 | 0.77 - 1.46 | 0.72 |
3. | Race - White | 1.72 | 0.84 - 3.52 | 0.14 |
4. | Race - Black | 2.27 | 1.08 - 4.80 | 0.03 |
5. | Gender - Male | 1.02 | 0.82 - 1.28 | 0.85 |
6. | Primary site - 500 | 0.77 | 0.42 - 1.40 | 0.39 |
7. | Primary site - 501 | 0.93 | 0.66 - 1.31 | 0.66 |
8. | Primary site - 502 | 1.52 | 0.79 - 2.91 | 0.21 |
9. | Primary site - 503 | 1.39 | 0.49 - 3.92 | 0.54 |
10. | Primary site - 504 | 0.82 | 0.51 - 1.32 | 0.41 |
11. | Primary site - 505 | 1.10 | 0.53 - 2.26 | 0.80 |
12. | Primary site - 506 | 17.35 | 2.31 - 130.45 | 0.01 |
13. | Primary site - 509 | 1.01 | 0.69 - 1.49 | 0.95 |
14. | Grade 1 | 0.58 | 0.15 - 2.27 | 0.44 |
15. | Grade 2 | 0.63 | 0.18 - 2.19 | 0.47 |
16. | Grade 3 | 1.23 | 0.36 - 4.25 | 0.74 |
17. | SEER stage 1 | 1.21 | 0.69 - 2.11 | 0.51 |
18. | SEER stage 4 | 1.38 | 0.80 - 2.39 | 0.25 |
19. | Surgery - Yes | 0.45 | 0.32 - 0.63 | <0.005 |
20. | AJCC (TNM) Stage 10 (I) | 0.24 | 0.12 - 0.51 | <0.005 |
21. | AJCC (TNM) Stage 32 (IIA) | 0.47 | 0.27 - 0.81 | 0.01 |
22. | AJCC (TNM) Stage 52 (IIIA) | 1.98 | 1.24 - 3.17 | <0.005 |
23. | AJCC (TNM) Stage 53 (IIIB) | 2.20 | 1.35 - 3.56 | <0.005 |
24. | AJCC (TNM) Stage 54 (IIIC) | 2.49 | 1.49 - 4.17 | <0.005 |
25. | AJCC (TNM) Stage 70 (IV) | 6.05 | 3.09 - 11.85 | <0.005 |
26. | ER status - Positive | 0.42 | 0.25 - 0.70 | <0.005 |
27. | PR status - Positive | 0.64 | 0.44 - 0.95 | 0.03 |
28. | Her2 - Positive | 0.93 | 0.69 - 1.27 | 0.67 |
29. | Age at diagnosis | 1.02 | 1.01 - 1.03 | <0.005 |
Next, on the similar lines, I checked for the predictors for Overall Survival (OS). The Cox model built for OS has less c-index (0.79) compared to the model built for DSS. Here, age at diagnosis has highest predictive powers among the variables tested and gender seems to have a slight predictive power, unlike the case in DSS.
No. | Variable | Score |
---|---|---|
1. | Age at diagnosis | 0.6372 |
2. | SEER stage 4 | 0.6144 |
3. | AJCC (TNM) Stage 70 (IV) | 0.6032 |
4. | SEER stage 1 | 0.6009 |
5. | Surgery-yes | 0.5966 |
6. | AJCC (TNM) Stage 10 (I) | 0.5896 |
7. | Marital status-Married | 0.5723 |
8. | Grade 3 | 0.5707 |
9. | Grade 2 | 0.5480 |
10. | AJCC (TNM) Stage 32 (IIA) | 0.5424 |
11. | PR status - positive | 0.5408 |
12. | Insurance - Insured | 0.5348 |
13. | Primary site - 509 | 0.5335 |
14. | AJCC (TNM) Stage 53 (IIIB) | 0.5334 |
15. | ER status - positive | 0.5308 |
16. | Race - Black | 0.5298 |
17. | Grade 1 | 0.5250 |
18. | Race - White | 0.5205 |
19. | Her2 - Positive | 0.5197 |
20. | Gender | 0.5184 |
21. | Primary site - 504 | 0.5152 |
22. | Primary site - 505 | 0.5104 |
23. | AJCC (TNM) Stage 54 (IIIC) | 0.5086 |
24. | Primary site - 500 | 0.5045 |
25. | AJCC (TNM) Stage 52 (IIIA) | 0.5027 |
26. | Primary site - 503 | 0.5025 |
27. | Primary site - 501 | 0.5022 |
28. | Primary site - 506 | 0.5009 |
29. | Primary site - 502 | 0.5003 |
The HR and 95% Confidence Interval for various variables obtained for OS are listed below. Here also, one can observe influence of male gender on worse prognosis while being married, insured, ER, PR positive, surgery have better prognosis for OS.
No. | Variable | HR | 95% CI | P-value |
---|---|---|---|---|
1. | Marital status-Married | 0.67 | 0.57 - 0.80 | <0.005 |
2. | Insurance - Yes | 0.77 | 0.61 - 0.97 | 0.02 |
3. | Race - White | 1.35 | 0.85 - 2.15 | 0.20 |
4. | Race - Black | 1.59 | 0.97 - 2.60 | 0.06 |
5. | Gender - Male | 1.30 | 1.11 - 1.52 | <0.005 |
6. | Primary site - 500 | 0.99 | 0.66 - 1.48 | 0.96 |
7. | Primary site - 501 | 1.04 | 0.81 - 1.34 | 0.75 |
8. | Primary site - 502 | 1.22 | 0.77 - 1.95 | 0.40 |
9. | Primary site - 503 | 1.02 | 0.47 - 2.22 | 0.96 |
10. | Primary site - 504 | 0.95 | 0.68 - 1.33 | 0.77 |
11. | Primary site - 505 | 0.96 | 0.57 - 1.63 | 0.89 |
12. | Primary site - 506 | 7.49 | 1.03 - 54.55 | 0.05 |
13. | Primary site - 509 | 1.19 | 0.89 - 1.59 | 0.24 |
14. | Grade 1 | 1.07 | 0.31 - 3.69 | 0.91 |
15. | Grade 2 | 1.06 | 0.32 - 3.51 | 0.93 |
16. | Grade 3 | 1.65 | 0.50 - 5.44 | 0.41 |
17. | SEER stage 1 | 1.05 | 0.79 - 1.40 | 0.72 |
18. | SEER stage 4 | 1.18 | 0.75 - 1.87 | 0.47 |
19. | Surgery - Yes | 0.42 | 0.32 - 0.54 | <0.005 |
20. | AJCC (TNM) Stage 10 (I) | 0.58 | 0.39 - 0.87 | 0.01 |
21. | AJCC (TNM) Stage 32 (IIA) | 0.87 | 0.64 - 1.19 | 0.39 |
22. | AJCC (TNM) Stage 52 (IIIA) | 1.57 | 1.10 - 2.22 | 0.01 |
23. | AJCC (TNM) Stage 53 (IIIB) | 1.89 | 1.35 - 2.65 | <0.005 |
24. | AJCC (TNM) Stage 54 (IIIC) | 1.63 | 1.08 - 2.45 | 0.02 |
25. | AJCC (TNM) Stage 70 (IV) | 3.61 | 2.09 - 6.25 | <0.005 |
26. | ER status - Positive | 0.52 | 0.34 - 0.79 | <0.005 |
27. | PR status - Positive | 0.76 | 0.57 - 1.01 | 0.06 |
28. | Her2 - Positive | 1.00 | 0.79 - 1.26 | 1.00 |
29. | Age at diagnosis | 1.05 | 1.04 - 1.06 | <0.005 |
Some of the limitations of this entire analysis (Part I and Part II) are — (i) variables that were not available in SEER dataset such as chemotherapy, radiation therapy, hormone therapy, income were not considered, but can have an impact on the prognosis ; (ii) as it is a retrospective study and thus, selection bias cannot be avoided. Despite this, SEER is a valuable resource for studying prognosis factors for cancers. Moreover, using propensity score matching I have removed confounding factors and came to convincing results.
-
Pölsterl, S., Navab, N., and Katouzian, A., Fast Training of Support Vector Machines for Survival Analysis. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, Lecture Notes in Computer Science, vol. 9285, pp. 243-259 (2015) ↩
-
Pölsterl, S., Navab, N., and Katouzian, A., An Efficient Training Algorithm for Kernel Survival Support Vector Machines. 4th Workshop on Machine Learning in Life Sciences, 23 September 2016, Riva del Garda, Italy ↩
-
Pölsterl, S., Gupta, P., Wang, L., Conjeti, S., Katouzian, A., and Navab, N., Heterogeneous ensembles for predicting survival of metastatic, castrate-resistant prostate cancer patients. F1000Research, vol. 5, no. 2676 (2016). ↩
-
Cameron Davidson-Pilon, Jonas Kalderstam, Noah Jacobson, sean-reed, Ben Kuhn, Paul Zivich, … Abraham Flaxman. (2020, July 9). CamDavidsonPilon/lifelines: v0.24.16 (Version v0.24.16). Zenodo. http://doi.org/10.5281/zenodo.3937749 ↩