Investigating Italian Population¶
AIM: make the best of data available from ISTAT.
In particular, I want to have:
- the number of residents each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
- compute the delta of residents each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
- compute the Cohort Change Ratio (CCR) each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
import numpy as np
import pandas as pd
import requests
import matplotlib
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from istatapi import discovery, retrieval
import warnings
pio.renderers.default = 'vscode+notebook'
warnings.filterwarnings('ignore')
requests.urllib3.disable_warnings() # avoid "InsecureRequestWarning: Unverified HTTPS request is being made to host 'sdmx.istat.it'. Adding certificate verification is strongly advised"
def get_colors(n, cmap_name="rainbow"):
"""Get colors for px colors_discrete argument, given the number of colors needed, n."""
cmap = matplotlib.colormaps[cmap_name]
colors = [cmap(i) for i in np.linspace(0, 1, n)] # Generate colors
colors_str = [f"rgba({int(color[0]*250)}, {int(color[1]*250)}, {int(color[2]*250)}, 1.0)" for color in colors]
return colors_str
I need to put together data from two datasets, about the italian population each year, by age.
The datasets are:
- 2001-2019 (
DCIS_RICPOPRES2011) - 2020-latest (
DCIS_POPRES1) - Updated by ISTAT around early April each year
For further information about these datasets check Notebook#0
ds = discovery.DataSet(dataflow_identifier="DCIS_RICPOPRES2011")
ds.set_filters(tipo_dato="JAN", itter107 ="IT", stacivx="99", sesso="9", cittadinanza="TOTAL")
df1 = retrieval.get_data(ds) # Takes about 30s
df1.loc[:, lambda dfx: (~dfx.isna()).any(axis=0)] # Show the table, excluding columns with all NaNs
ds = discovery.DataSet(dataflow_identifier="DCIS_POPRES1")
ds.set_filters(tipo_inddem="JAN", itter107 ="IT", stacivx="99", sesso="9")
df2 = retrieval.get_data(ds)
df2.loc[:, lambda dfx: (~dfx.isna()).any(axis=0)] # Show the table, excluding columns with all NaNs
Population is split in different groups by age, where "0" means from 0 (newborns) to 9 years old, "10" means from 10 to 19 years old, and so on. The last group is "100" which means, 100 or more.
dfp = (
pd.concat([
df1.query("CLASSE_ETA!='TOTAL'")[["TIME_PERIOD", "CLASSE_ETA", "OBS_VALUE"]],
df2.rename(columns={'ETA':'CLASSE_ETA'}).query("CLASSE_ETA!='TOTAL'")[["TIME_PERIOD", "CLASSE_ETA", "OBS_VALUE"]]
])
.replace("Y_GE100", "Y100") # Remember that 100 is 100+, converting for simplicity
.query("CLASSE_ETA!='TOTAL'")
.assign(age= lambda x: x["CLASSE_ETA"].str.split("Y").str[-1].astype(int))
.assign(year= lambda x: x["TIME_PERIOD"].dt.year)
[["year", "age", "OBS_VALUE"]]
.drop_duplicates() # remove 2019 duplicates, in both datasets
.sort_values(["age", "year"])
.reset_index(drop=True)
.pivot(index='age', columns='year', values='OBS_VALUE') # make an age x year table
)
display(dfp) # NOTE: "year" means "at the beginning (January 1st) of the year"
# Pivot the tabel by year (of observation) vs age, and visualize it as a heatmap
(
dfp
.reset_index()
.rename(columns={"index": "age"})
.to_csv("../data/pop_by_age_year.csv", index=False)
)
fig = px.imshow(
dfp, labels=dict(x="Year", y="Age", color="Population"), aspect="auto"
).update_layout(width=1000).show()
fig = px.line(
dfp.reset_index().melt(id_vars=["age"], var_name="year", value_name="population"), # convert to long format
x="age",
y="population",
title="Total population in Italy by age (TOTAL)",
animation_frame="year",
markers=False,
)
fig.update_layout(
xaxis_title="Age",
yaxis_title="Population by age group",
yaxis_range=[0, 1e6],
title=None,
legend_title="Scenarios",
margin=dict(l=10, r=10, t=10, b=10),
width=780,
height=420,
)
fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.write_html("../images_output/pop_by_age.html", auto_play=False)
print("Total population in Italy by age")
fig.show()
By sliding from 2002 to 2022 we can see that the population age is quite rigidly shifting: there is some sort of equilibrium between the different age groups, with the exception of the newborns, which are decreasing in number.
We can observe some characteristics of the population age distribution. Let's refer to 2002:
- World War I was fought by italian soldiers between 1915 and 1918, the well we see between the ages of 82 and 85 (born in 1917-1920) could well reflect the precarious conditions of the italian population during and right after the war, leading to less births (or child deaths)
- Same story with WWII: the well is between 56 and 60, reflecting the drop of births in 1942-1946
- We could also consider adult soldiers who died in the wars: data is too scarce for capturing the effect of WWI but for WWII we can consider the estimated ~0.5M deaths among soldiers and civilians to be spread for the ages 75 to 87, i.e., considering people that at the time of the conflict were between 18 and 30 years old. We can not see a clear effect of this in the data, not as evident as the two wells highlighted in the previous points.
- Baby boomers, have a clear peak in 1965-1970, leading to an aboundance of people that were between 32 and 37 years old in 2002.
Now I'll split the population in age groups of 5-years-wide. This is necessary for me to (1) use DCIS_DECESSI data (2) have larger groups / less noise.
Let's now see what it is the change of population agening each year:
- e.g., compare the amount of people that were 80 in a certain year and that are 81 in the next year
- this is callled the "Cohort Change Ratio" $ccr = func(age, year)$
- we can expect that the more we age, the more negative is the percentage of those who live another year
NOTE: ccr(year) refers to the difference between January 1st of year, and January 1st of year+1.
dfpc = pd.DataFrame(columns=dfp.columns.tolist()[:-1], index=pd.Index(range(1, 100), name="age"))
for year in dfpc.columns:
prev_year = dfp[year].shift(1).to_numpy()
dfpc[year] = (dfp[year+1] - prev_year).dropna().astype(int)
dfpcr = pd.DataFrame(columns=dfp.columns.tolist()[:-1], index=pd.Index(range(1, 100), name="age"))
for year in dfpcr.columns:
prev_year = dfp[year].shift(1).to_numpy()
dfpcr[year] = (dfp[year+1] - prev_year) / prev_year
dfpc.to_csv("../data/cc_by_age_year.csv", index=True)
dfpcr.to_csv("../data/ccr_by_age_year.csv", index=True)
print("Cohort Change")
display(dfpc)
print("Cohort Change Ratio")
display(dfpcr)
fig = px.line(
data_frame=dfpc,
x=dfpc.index,
y=dfpc.columns,
)
fig.update_layout(
xaxis_title="Age",
yaxis_title="Cohort Change",
legend_title="Year",
title=None,
margin=dict(l=10, r=10, t=20, b=10),
width=780,
height=320,
)
print("Cohort Change, on each year of observation")
fig.write_html("../images_output/cohort_change.html")
fig.show()
fig = px.line(
data_frame=dfpcr,
x=dfpc.index,
y=dfpc.columns,
)
fig.update_layout(
xaxis_title="Age",
yaxis_title="Cohort Change Ratio",
yaxis_tickformat = ',.0%',
legend_title="Year",
title=None,
margin=dict(l=10, r=10, t=20, b=10),
width=780,
height=320,
)
print("Cohort Change Ratio, on each year of observation")
fig.write_html("../images_output/cohort_change_ratio.html")
fig.show()
As expected we see that the decrease of people is very modest (<2%) till the age of 70.
Then visually, an 80 years old has a 3-5% change of not surviving the year, which increases to 13-17% when he is 90 years old.
Each year is shown separately, but we can not see a clear trend in the data with respect to the year of measurment, except for 2021, weighting the death toll of COVID-19:
here we can see that CCR dropped visually for 20-70 years old, but the drop is not that significant for 70+ elder people, where the noise due to the year is more significant than the drop due to the COVID-19.
The drop from Jan 2021 to Jan 2022 is also localized to 20-30 years old, becoming visibly indistinguishable for older people.
# get heatmap of the table in squared image
fig = px.imshow(
dfpcr, labels=dict(x="Year of Observation", y="Age", color="%Pop.Growth"), aspect="auto"
).update_layout(width=1000)
fig.show()
Conclusions¶
- Identified the trends of (de-)growth of the Italian population from 2002 to 2022, including the role of immigrants
- Identified the distribution of population by age, and the agening of the baby boomers
- Identified the Cohort Change Ratio (
ccr)
Follow-up¶
- Check if the reason of the
ccrdrop for elder people in 2003, 2005, 2012, 2015 is related to heatwaves the year before - Make a model of the
ccrof population by age, to extrapolate the trends of future years - Check the evolution of the ratio between people in working and retirement age in the next years, testing different scenarios