Updated information regarding what data is used can be found in the changelog page.

Data pre-processing

30 day time-window

Events for a given individual and a given phenocode will be merged if they are less than or equal to 30 days apart. For example if an individual as K11_APPENDACUT events at the following dates: 2000-01-01, 2000-01-20, 2000-02-10, 2000-02-28, then all these events will become one at date 2000-01-01.

This is done as an attempt to remove events that are follow-ups rather than initial diagnoses.


Unadjusted prevalence

Number of individuals having at least one event for a given phenocode, divided by the total number of individuals in the FinnGen study. No adjustment is done to account for the difference between the age distribution of the FinnGen cohort and the one of the Finnish population.

Recurrence within 6 months

Number of individuals having two events for the given phenocode less than 6 months apart, divided by the number of individuals having at least one event for the given phenocode.

Case fatality at 5-years

Number of individuals that died less than 5 years after the first event for the given phenocode, divided by the number of individuals having at least one event for the given phenocode.

Survival analyses between phenocodes

Most of the study follows the NB-COMO study.

Data pre-processing

  • Start of study: 1998-01-01
  • End of study: 2018-12-31
  • Prevalent cases removed from the study.
  • Ignore time before start of study for individuals having the prior-phenocode before the study starts.
  • Split time in unexposed and exposed periods.
  • Only consider endpoint pairs:
    • with at least 10 individuals for each cell of the contingency table of this endpoint pair.
    • with at least 25 individuals having the outcome endpoint.
    • where ICDs of both endpoints as well as there parents don't overlap.
    • where endpoints are not descendants of one another in the endpoint tree hierarchy.

Cox regression

The model used is: y ~ prior + birth_year + sex

If the endpoint is sex-specific, then the sex covariate is removed from the model.

Lagged hazard ratios are computed by considering only up to 1, 5, and 15 years of exposed time.

The regression are done using the lifelines library.


Due to the sensitive nature of the data, the age when entering and leaving the study has an accuracy of 1 year.

Drug Statistics

The drug score is computed in a 2-step process:

  1. Fit the data to the logistic model:
    y ~ sex + year-of-birth + year-of-birth^2 + year-at-endpoint + year-at-endpoint^2
  2. Use the fitted model to predict the probability for the following data:
    • sex = 0.5, assume an even number of females and males.
    • year-of-birth = 1960, the mean year of birth of the FinnGen cohort.
    • year-at-endpoint = 2018, predict the probability at the end of the study.

The resulting probability value is the drug score. The highest the drug score is, the more likely the drug is to be taken after the given endpoint.

Source code

Availabe on GitHub for both the data processing pipeline and the website.