← Home

Investigating Italian Population

AIM: make the best of data available from ISTAT.

In particular, I want to have:

  • the number of residents each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
  • compute the delta of residents each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
  • compute the Cohort Change Ratio (CCR) each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
In [1]:
import numpy as np
import pandas as pd
import requests
import matplotlib
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from istatapi import discovery, retrieval
import warnings 

pio.renderers.default = 'vscode+notebook'
warnings.filterwarnings('ignore')
requests.urllib3.disable_warnings() # avoid "InsecureRequestWarning: Unverified HTTPS request is being made to host 'sdmx.istat.it'. Adding certificate verification is strongly advised"

def get_colors(n, cmap_name="rainbow"):
    """Get colors for px colors_discrete argument, given the number of colors needed, n."""
    cmap = matplotlib.colormaps[cmap_name]
    colors = [cmap(i) for i in np.linspace(0, 1, n)]  # Generate colors
    colors_str = [f"rgba({int(color[0]*250)}, {int(color[1]*250)}, {int(color[2]*250)}, 1.0)" for color in colors]
    return colors_str

I need to put together data from two datasets, about the italian population each year, by age.

The datasets are:

  • 2001-2019 (DCIS_RICPOPRES2011)
  • 2020-latest (DCIS_POPRES1) - Updated by ISTAT around early April each year

For further information about these datasets check Notebook#0

In [2]:
ds = discovery.DataSet(dataflow_identifier="DCIS_RICPOPRES2011") 
ds.set_filters(tipo_dato="JAN", itter107 ="IT", stacivx="99", sesso="9", cittadinanza="TOTAL")
df1 = retrieval.get_data(ds) # Takes about 30s
df1.loc[:, lambda dfx: (~dfx.isna()).any(axis=0)] # Show the table, excluding columns with all NaNs
Out[2]:
DATAFLOW FREQ SESSO TIPO_DATO CLASSE_ETA CITTADINANZA ITTER107 TIME_PERIOD OBS_VALUE
0 IT1:164_164(1.1) A 9 JAN TOTAL TOTAL IT 2001-01-01 56995744
1225 IT1:164_164(1.1) A 9 JAN Y69 TOTAL IT 2002-01-01 590165
199 IT1:164_164(1.1) A 9 JAN Y17 TOTAL IT 2002-01-01 586101
1207 IT1:164_164(1.1) A 9 JAN Y68 TOTAL IT 2002-01-01 607099
1189 IT1:164_164(1.1) A 9 JAN Y67 TOTAL IT 2002-01-01 620010
... ... ... ... ... ... ... ... ... ...
1224 IT1:164_164(1.1) A 9 JAN Y68 TOTAL IT 2019-01-01 701821
1242 IT1:164_164(1.1) A 9 JAN Y69 TOTAL IT 2019-01-01 702312
1260 IT1:164_164(1.1) A 9 JAN Y7 TOTAL IT 2019-01-01 537809
1728 IT1:164_164(1.1) A 9 JAN Y93 TOTAL IT 2019-01-01 92323
1836 IT1:164_164(1.1) A 9 JAN Y99 TOTAL IT 2019-01-01 8985

1837 rows × 9 columns

In [3]:
ds = discovery.DataSet(dataflow_identifier="DCIS_POPRES1") 
ds.set_filters(tipo_inddem="JAN", itter107 ="IT", stacivx="99", sesso="9")
df2 = retrieval.get_data(ds)
df2.loc[:, lambda dfx: (~dfx.isna()).any(axis=0)] # Show the table, excluding columns with all NaNs
Out[3]:
DATAFLOW FREQ ETA ITTER107 SESSO STACIVX TIPO_INDDEM TIME_PERIOD OBS_VALUE OBS_STATUS
0 IT1:22_289(1.5) A TOTAL IT 9 99 JAN 2019-01-01 59816673 NaN
238 IT1:22_289(1.5) A Y38 IT 9 99 JAN 2019-01-01 743989 NaN
245 IT1:22_289(1.5) A Y39 IT 9 99 JAN 2019-01-01 767068 NaN
252 IT1:22_289(1.5) A Y4 IT 9 99 JAN 2019-01-01 498240 NaN
259 IT1:22_289(1.5) A Y40 IT 9 99 JAN 2019-01-01 808091 NaN
... ... ... ... ... ... ... ... ... ... ...
440 IT1:22_289(1.5) A Y63 IT 9 99 JAN 2025-01-01 854967 e
447 IT1:22_289(1.5) A Y64 IT 9 99 JAN 2025-01-01 824815 e
454 IT1:22_289(1.5) A Y65 IT 9 99 JAN 2025-01-01 799565 e
307 IT1:22_289(1.5) A Y46 IT 9 99 JAN 2025-01-01 812504 e
713 IT1:22_289(1.5) A Y99 IT 9 99 JAN 2025-01-01 14528 e

714 rows × 10 columns

Population is split in different groups by age, where "0" means from 0 (newborns) to 9 years old, "10" means from 10 to 19 years old, and so on. The last group is "100" which means, 100 or more.

In [4]:
dfp = (
    pd.concat([
        df1.query("CLASSE_ETA!='TOTAL'")[["TIME_PERIOD", "CLASSE_ETA", "OBS_VALUE"]], 
        df2.rename(columns={'ETA':'CLASSE_ETA'}).query("CLASSE_ETA!='TOTAL'")[["TIME_PERIOD", "CLASSE_ETA", "OBS_VALUE"]]
    ])
    .replace("Y_GE100", "Y100") # Remember that 100 is 100+, converting for simplicity
    .query("CLASSE_ETA!='TOTAL'")
    .assign(age= lambda x: x["CLASSE_ETA"].str.split("Y").str[-1].astype(int))
    .assign(year= lambda x: x["TIME_PERIOD"].dt.year)
    [["year", "age", "OBS_VALUE"]]
    .drop_duplicates() # remove 2019 duplicates, in both datasets
    .sort_values(["age", "year"])
    .reset_index(drop=True)
    .pivot(index='age', columns='year', values='OBS_VALUE') # make an age x year table
)
display(dfp) # NOTE: "year" means "at the beginning (January 1st) of the year"
year 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 ... 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
age
0 523007 535656 542323 558500 551127 556884 560700 570791 562322 553218 ... 482977 469956 455066 431007 414974 404956 400626 393920 380630 372541
1 529233 526577 538020 545596 559067 552870 560028 563021 569963 561694 ... 499707 482975 470412 455746 436785 423269 407634 404252 397193 382860
2 528131 533126 531043 540860 548487 560347 556146 562640 564317 569064 ... 510117 499100 482810 470813 457314 443571 425183 411247 407572 399781
3 518790 530847 537507 535545 543768 550808 563000 559193 564536 565219 ... 531804 508569 498644 482946 471696 464239 444638 428451 414581 410122
4 515957 522484 534064 542329 540131 545782 554754 565127 560899 565563 ... 539174 530684 507352 498240 483769 480475 464753 447374 431500 417424
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 16400 17254 17494 20055 21122 24346 25054 27290 28086 29780 ... 24270 37624 38707 39836 41447 40293 40802 40632 42037 44652
97 10981 11797 11980 12758 14071 15356 17559 18179 20203 21305 ... 13246 17960 27307 28842 29302 29016 29171 28420 29371 31846
98 6838 7643 7901 8505 8725 9888 10871 12349 12977 14740 ... 10142 9607 12695 19650 20560 20113 20528 19836 20214 21784
99 4866 4702 4971 5411 5682 5945 6643 7469 8547 9207 ... 8258 7145 6689 8985 13421 13635 13742 13403 13839 14528
100 6121 7093 7200 8095 8538 9122 9575 10158 11297 13106 ... 17698 16558 14984 14132 14804 17177 19714 20445 21211 23548

101 rows × 24 columns

In [5]:
# Pivot the tabel by year (of observation) vs age, and visualize it as a heatmap
(
    dfp
    .reset_index()
    .rename(columns={"index": "age"})
    .to_csv("../data/pop_by_age_year.csv", index=False)
)

fig = px.imshow(
    dfp, labels=dict(x="Year", y="Age", color="Population"), aspect="auto"
).update_layout(width=1000).show()
In [6]:
fig = px.line(
    dfp.reset_index().melt(id_vars=["age"], var_name="year", value_name="population"), # convert to long format
    x="age", 
    y="population", 
    title="Total population in Italy by age (TOTAL)",
    animation_frame="year",
    markers=False,
)
fig.update_layout(
    xaxis_title="Age",
    yaxis_title="Population by age group",
    yaxis_range=[0, 1e6],
    title=None,
    legend_title="Scenarios",
    margin=dict(l=10, r=10, t=10, b=10),
    width=780,
    height=420,
)
fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.write_html("../images_output/pop_by_age.html", auto_play=False)
print("Total population in Italy by age")
fig.show()
Total population in Italy by age

By sliding from 2002 to 2022 we can see that the population age is quite rigidly shifting: there is some sort of equilibrium between the different age groups, with the exception of the newborns, which are decreasing in number.

We can observe some characteristics of the population age distribution. Let's refer to 2002:

  • World War I was fought by italian soldiers between 1915 and 1918, the well we see between the ages of 82 and 85 (born in 1917-1920) could well reflect the precarious conditions of the italian population during and right after the war, leading to less births (or child deaths)
  • Same story with WWII: the well is between 56 and 60, reflecting the drop of births in 1942-1946
  • We could also consider adult soldiers who died in the wars: data is too scarce for capturing the effect of WWI but for WWII we can consider the estimated ~0.5M deaths among soldiers and civilians to be spread for the ages 75 to 87, i.e., considering people that at the time of the conflict were between 18 and 30 years old. We can not see a clear effect of this in the data, not as evident as the two wells highlighted in the previous points.
  • Baby boomers, have a clear peak in 1965-1970, leading to an aboundance of people that were between 32 and 37 years old in 2002.

Now I'll split the population in age groups of 5-years-wide. This is necessary for me to (1) use DCIS_DECESSI data (2) have larger groups / less noise.

Let's now see what it is the change of population agening each year:

  • e.g., compare the amount of people that were 80 in a certain year and that are 81 in the next year
  • this is callled the "Cohort Change Ratio" $ccr = func(age, year)$
  • we can expect that the more we age, the more negative is the percentage of those who live another year

NOTE: ccr(year) refers to the difference between January 1st of year, and January 1st of year+1.

In [7]:
dfpc = pd.DataFrame(columns=dfp.columns.tolist()[:-1], index=pd.Index(range(1, 100), name="age"))
for year in dfpc.columns:
    prev_year = dfp[year].shift(1).to_numpy()
    dfpc[year] = (dfp[year+1] - prev_year).dropna().astype(int)
    
dfpcr = pd.DataFrame(columns=dfp.columns.tolist()[:-1], index=pd.Index(range(1, 100), name="age"))
for year in dfpcr.columns:
    prev_year = dfp[year].shift(1).to_numpy()
    dfpcr[year] = (dfp[year+1] - prev_year) / prev_year

dfpc.to_csv("../data/cc_by_age_year.csv", index=True)
dfpcr.to_csv("../data/ccr_by_age_year.csv", index=True)

print("Cohort Change")
display(dfpc)
print("Cohort Change Ratio")
display(dfpcr)
Cohort Change
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
age
1 3570 2364 3273 567 1743 3144 2321 -828 -628 -1498 ... -233 -2 456 680 5778 8295 2678 3626 3273 2230
2 3893 4466 2840 2891 1280 3276 2612 1296 -899 -751 ... -1331 -607 -165 401 1568 6786 1914 3613 3320 2588
3 2716 4381 4502 2908 2321 2653 3047 1896 902 -929 ... -1091 -1548 -456 136 883 6925 1067 3268 3334 2550
4 3694 3217 4822 4586 2014 3946 2127 1706 1027 405 ... -726 -1120 -1217 -404 823 8779 514 2736 3049 2843
5 3917 4275 3623 4957 3768 4144 3209 1284 1165 875 ... -633 -859 -968 -234 904 9549 -1791 2387 3066 2654
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 -7930 -9651 -8823 -10702 -10095 -11092 -11137 -10462 -10239 -8614 ... -15800 -14840 -15861 -15761 -15847 -18591 -17158 -19053 -17151 -16047
96 -6109 -6817 -6899 -7868 -8161 -8495 -9414 -8854 -8311 -8664 ... -8419 -11717 -12847 -12390 -13436 -15356 -14058 -15423 -14415 -12791
97 -4603 -5274 -4736 -5984 -5766 -6787 -6875 -7087 -6781 -6984 ... -5267 -6310 -10317 -9865 -10534 -12431 -11122 -12382 -11261 -10191
98 -3338 -3896 -3475 -4033 -4183 -4485 -5210 -5202 -5463 -5539 ... -4341 -3639 -5265 -7657 -8282 -9189 -8488 -9335 -8206 -7587
99 -2136 -2672 -2490 -2823 -2780 -3245 -3402 -3802 -3770 -3949 ... -4128 -2997 -2918 -3710 -6229 -6925 -6371 -7125 -5997 -5686

99 rows × 23 columns

Cohort Change Ratio
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
age
1 0.006826 0.004413 0.006035 0.001015 0.003163 0.005646 0.004139 -0.001451 -0.001117 -0.002708 ... -0.000466 -0.000004 0.000970 0.001494 0.013406 0.019989 0.006613 0.009051 0.008309 0.005859
2 0.007356 0.008481 0.005279 0.005299 0.002290 0.005925 0.004664 0.002302 -0.001577 -0.001337 ... -0.002602 -0.001215 -0.000342 0.000852 0.003441 0.015536 0.004522 0.008863 0.008213 0.006516
3 0.005143 0.008218 0.008478 0.005377 0.004232 0.004735 0.005479 0.003370 0.001598 -0.001633 ... -0.002047 -0.003035 -0.000914 0.000282 0.001875 0.015143 0.002405 0.007686 0.008107 0.006257
4 0.007120 0.006060 0.008971 0.008563 0.003704 0.007164 0.003778 0.003051 0.001819 0.000717 ... -0.001345 -0.002106 -0.002393 -0.000810 0.001704 0.018612 0.001107 0.006153 0.007116 0.006858
5 0.007592 0.008182 0.006784 0.009140 0.006976 0.007593 0.005785 0.002272 0.002077 0.001547 ... -0.001139 -0.001593 -0.001824 -0.000461 0.001814 0.019739 -0.003728 0.005136 0.006853 0.006151
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 -0.245960 -0.263653 -0.233332 -0.247680 -0.231303 -0.232070 -0.231649 -0.215476 -0.209066 -0.213524 ... -0.242551 -0.223514 -0.232952 -0.223105 -0.221649 -0.253108 -0.234357 -0.252341 -0.229925 -0.207497
96 -0.261482 -0.280408 -0.255955 -0.271404 -0.251054 -0.253212 -0.256484 -0.239686 -0.218188 -0.223668 ... -0.257548 -0.237470 -0.249195 -0.237238 -0.244812 -0.275944 -0.256252 -0.275140 -0.255350 -0.222673
97 -0.280671 -0.305668 -0.270721 -0.298379 -0.272986 -0.278773 -0.274407 -0.259692 -0.241437 -0.234520 ... -0.284503 -0.259992 -0.274213 -0.254863 -0.264434 -0.299925 -0.276028 -0.303466 -0.277146 -0.242429
98 -0.303980 -0.330253 -0.290067 -0.316115 -0.297278 -0.292068 -0.296714 -0.286154 -0.270405 -0.259986 ... -0.299731 -0.274724 -0.293151 -0.280404 -0.287151 -0.313596 -0.292528 -0.320010 -0.288740 -0.258316
99 -0.312372 -0.349601 -0.315150 -0.331922 -0.318625 -0.328176 -0.312943 -0.307879 -0.290514 -0.267910 ... -0.333280 -0.295504 -0.303737 -0.292241 -0.316997 -0.336819 -0.316760 -0.347087 -0.302329 -0.281290

99 rows × 23 columns

In [8]:
fig = px.line(
    data_frame=dfpc,
    x=dfpc.index,
    y=dfpc.columns,
)
fig.update_layout(
    xaxis_title="Age",
    yaxis_title="Cohort Change",
    legend_title="Year",
    title=None,
    margin=dict(l=10, r=10, t=20, b=10),
    width=780,
    height=320,
)
print("Cohort Change, on each year of observation")
fig.write_html("../images_output/cohort_change.html")
fig.show()
Cohort Change, on each year of observation
In [9]:
fig = px.line(
    data_frame=dfpcr,
    x=dfpc.index,
    y=dfpc.columns,
)
fig.update_layout(
    xaxis_title="Age",
    yaxis_title="Cohort Change Ratio",
    yaxis_tickformat = ',.0%',
    legend_title="Year",
    title=None,
    margin=dict(l=10, r=10, t=20, b=10),
    width=780,
    height=320,
)
print("Cohort Change Ratio, on each year of observation")
fig.write_html("../images_output/cohort_change_ratio.html")
fig.show()
Cohort Change Ratio, on each year of observation

As expected we see that the decrease of people is very modest (<2%) till the age of 70.

Then visually, an 80 years old has a 3-5% change of not surviving the year, which increases to 13-17% when he is 90 years old.

Each year is shown separately, but we can not see a clear trend in the data with respect to the year of measurment, except for 2021, weighting the death toll of COVID-19:
here we can see that CCR dropped visually for 20-70 years old, but the drop is not that significant for 70+ elder people, where the noise due to the year is more significant than the drop due to the COVID-19.

The drop from Jan 2021 to Jan 2022 is also localized to 20-30 years old, becoming visibly indistinguishable for older people.

In [10]:
# get heatmap of the table in squared image
fig = px.imshow(
    dfpcr, labels=dict(x="Year of Observation", y="Age", color="%Pop.Growth"), aspect="auto"
    ).update_layout(width=1000)
fig.show()

Conclusions

  • Identified the trends of (de-)growth of the Italian population from 2002 to 2022, including the role of immigrants
  • Identified the distribution of population by age, and the agening of the baby boomers
  • Identified the Cohort Change Ratio (ccr)

Follow-up

  • Check if the reason of the ccr drop for elder people in 2003, 2005, 2012, 2015 is related to heatwaves the year before
  • Make a model of the ccr of population by age, to extrapolate the trends of future years
  • Check the evolution of the ratio between people in working and retirement age in the next years, testing different scenarios
← Home