Investigating Italian Population¶
AIM: make the best of data available from ISTAT.
In particular, I want to have:
- the number of residents each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
- compute the delta of residents each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
- compute the Cohort Change Ratio (CCR) each year, by age (2D table: AGE x YEAR_OF_OBSERVATION)
import time
import numpy as np
import pandas as pd
import requests
import matplotlib
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import sdmx
import warnings
client = sdmx.Client("ISTAT")
pio.renderers.default = 'vscode+notebook'
warnings.filterwarnings('ignore')
requests.urllib3.disable_warnings() # avoid "InsecureRequestWarning: Unverified HTTPS request is being made to host 'sdmx.istat.it'. Adding certificate verification is strongly advised"
def get_colors(n, cmap_name="rainbow"):
"""Get colors for px colors_discrete argument, given the number of colors needed, n."""
cmap = matplotlib.colormaps[cmap_name]
colors = [cmap(i) for i in np.linspace(0, 1, n)] # Generate colors
colors_str = [f"rgba({int(color[0]*250)}, {int(color[1]*250)}, {int(color[2]*250)}, 1.0)" for color in colors]
return colors_str
I need to put together data from RICPOPRES dataset (riconstructed resident population) and the updated POPRES dataset (resident population since 2020).
For further information about these datasets check Notebook#00
# Concatenate all the dataframes about population on the first of January, from 3 datasets
all_popres_ids = [
"164_346_DF_DCIS_RICPOPRES1971_1", # 1952-1971
"164_347_DF_DCIS_RICPOPRES1981_1", # 1972-1981
"164_279_DF_DCIS_RICPOPRES1991_1", # 1982-1991
"164_305_DF_DCIS_RICPOPRES2001_1", # 1992-2001
"164_164_DF_DCIS_RICPOPRES2011_1", # 2001-2019
"22_289_DF_DCIS_POPRES1_1", # 2019-latest
]
dfs = []
for ds_id in all_popres_ids:
keys = {
"FREQ": "A",
"REF_AREA": "IT",
"DATA_TYPE": "JAN",
"AGE": [], # I want them all
"SEX": [], # 9 is total
}
if ds_id == "164_164_DF_DCIS_RICPOPRES2011_1": # the only one with the "CITIZENSHIP" dimension
keys["CITIZENSHIP"] = "TOTAL"
if ds_id == "22_289_DF_DCIS_POPRES1_1": # the only one with the "MARITAL_STATUS" dimension
keys["MARITAL_STATUS"] = "99"
dfs.append(
sdmx.to_pandas(client.data(resource_id=ds_id, key=keys)).reset_index() # takes about 30 sec.
)
print(f"Dataset {ds_id} has {dfs[-1].shape[0]} rows and {dfs[-1].shape[1]} columns.")
if ds_id != all_popres_ids[-1]: # avoid sleeping after the last dataset
time.sleep(120)
Population is split in different groups by age, where "0" means from 0 (newborns) to 9 years old, "10" means from 10 to 19 years old, and so on. The last group is "100" which means, 100 or more.
dfp_long = (
pd.concat(dfs, ignore_index=True)
[["TIME_PERIOD", "AGE", "SEX","value"]]
.query("AGE!='TOTAL'")
.replace("Y_GE100", "Y100") # Remember that 100 is 100+, converting for simplicity
.assign(AGE= lambda x: x["AGE"].str.split("Y").str[-1].astype(int))
.assign(YEAR= lambda x: x["TIME_PERIOD"]) # Remember, this YEAR means "at the beginning (January 1st) of the year"
[["YEAR", "AGE", "SEX","value"]]
.astype(int)
.assign(SEX= lambda x: x["SEX"].map({9: "T", 1: "M", 2: "F"}))
.drop_duplicates() # remove 2019 duplicates, in both datasets
.sort_values(["AGE", "YEAR", "SEX"])
.reset_index(drop=True)
)
display(dfp_long)
fig = px.line(
dfp_long,#.reset_index().melt(id_vars=["age"], var_name="year", value_name="population"), # convert to long format
x="AGE",
y="value",
color="SEX",
color_discrete_map={"T": "black", "M": "blue", "F": "red"},
title="Total population in Italy by age (TOTAL)",
animation_frame="YEAR",
markers=False,
)
fig.update_layout(
xaxis_title="Age",
yaxis_title="Population by age group",
yaxis_range=[0, 1e6],
title=None,
legend_title="Sex",
margin=dict(l=10, r=10, t=10, b=10),
width=780,
height=420,
)
# Initialize the animation at the last frame (current year)
last = fig.frames[-1]
for extra in last.data[len(fig.data):]:
fig.add_trace(extra)
for i, tr in enumerate(last.data):
payload = tr.to_plotly_json() if hasattr(tr, "to_plotly_json") else dict(tr)
fig.data[i].update(**payload)
if "sliders" in fig.layout and fig.layout.sliders:
fig.layout.sliders[0].active = len(fig.frames) - 1
fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.write_html("../images_output/pop_by_age.html", auto_play=False)
print("Total population in Italy by age")
fig.show()
# Pivot the tabel by year (of observation) vs age, and visualize it as a heatmap
dfp = (
dfp_long
.query("SEX=='T'") # only total population
.pivot(index='AGE', columns='YEAR', values='value') # make an age x year table
.rename(columns={"index": "AGE"})
)
dfp.to_csv("../data/pop_by_age_year.csv", index=True)
fig = px.imshow(
dfp, labels=dict(x="YEAR", y="AGE", color="Population"), aspect="auto",
).update_layout(width=1000).show()
By sliding from 2002 to 2022 we can see that the population age is quite rigidly shifting: there is some sort of equilibrium between the different age groups, with the exception of the newborns, which are decreasing in number.
We can observe some characteristics of the population age distribution. Let's refer to 2002:
- World War I was fought by italian soldiers between 1915 and 1918, the well we see between the ages of 82 and 85 (born in 1917-1920) could well reflect the precarious conditions of the italian population during and right after the war, leading to less births (or child deaths)
- Same story with WWII: the well is between 56 and 60, reflecting the drop of births in 1942-1946
- We could also consider adult soldiers who died in the wars: data is too scarce for capturing the effect of WWI but for WWII we can consider the estimated ~0.5M deaths among soldiers and civilians to be spread for the ages 75 to 87, i.e., considering people that at the time of the conflict were between 18 and 30 years old. We can not see a clear effect of this in the data, not as evident as the two wells highlighted in the previous points.
- Baby boomers, have a clear peak in 1965-1970, leading to an aboundance of people that were between 32 and 37 years old in 2002.
Now I'll split the population in age groups of 5-years-wide. This is necessary for me to (1) use DCIS_DECESSI data (2) have larger groups / less noise.
Let's now see what it is the change of population agening each year:
- e.g., compare the amount of people that were 80 in a certain year and that are 81 in the next year
- this is callled the "Cohort Change Ratio" $ccr = func(age, year)$
- we can expect that the more we age, the more negative is the percentage of those who live another year
NOTE: ccr(year) refers to the difference between January 1st of year, and January 1st of year+1.
dfpc = pd.DataFrame(columns=dfp.columns.tolist()[:-1], index=pd.Index(range(1, 100), name="age"))
for year in dfpc.columns:
prev_year = dfp[year].shift(1).to_numpy()
dfpc[year] = (dfp[year+1] - prev_year).dropna().astype(int)
dfpcr = pd.DataFrame(columns=dfp.columns.tolist()[:-1], index=pd.Index(range(1, 100), name="age"))
for year in dfpcr.columns:
prev_year = dfp[year].shift(1).to_numpy()
dfpcr[year] = (dfp[year+1] - prev_year) / prev_year
dfpc.to_csv("../data/cc_by_age_year.csv", index=True)
dfpcr.to_csv("../data/ccr_by_age_year.csv", index=True)
print("Cohort Change")
display(dfpc)
print("Cohort Change Ratio")
display(dfpcr)
fig = px.line(
data_frame=dfpc,
x=dfpc.index,
y=dfpc.columns,
)
fig.update_layout(
xaxis_title="Age",
yaxis_title="Cohort Change",
legend_title="Year",
title=None,
margin=dict(l=10, r=10, t=20, b=10),
width=780,
height=320,
)
print("Cohort Change, on each year of observation")
fig.write_html("../images_output/cohort_change.html")
fig.show()
dfpcr.loc[:,2000:]
dfpcr_plot = dfpcr.loc[:,1992:] # Before this year, data is noisy due to the datasets' intersection
fig = px.line(
data_frame=dfpcr_plot,
x=dfpcr_plot.index,
y=dfpcr_plot.columns,
)
fig.update_layout(
xaxis_title="Age",
yaxis_title="Cohort Change Ratio",
yaxis_tickformat = ',.0%',
legend_title="Year",
title=None,
margin=dict(l=10, r=10, t=20, b=10),
width=780,
height=320,
)
print("Cohort Change Ratio, on each year of observation")
fig.write_html("../images_output/cohort_change_ratio.html")
fig.show()
As expected we see that the decrease of people is very modest (<2%) till the age of 70.
Then visually, an 80 years old has a 3-5% change of not surviving the year, which increases to 13-17% when he is 90 years old.
Each year is shown separately, but we can not see a clear trend in the data with respect to the year of measurment, except for 2021, weighting the death toll of COVID-19:
here we can see that CCR dropped visually for 20-70 years old, but the drop is not that significant for 70+ elder people, where the noise due to the year is more significant than the drop due to the COVID-19.
The drop from Jan 2021 to Jan 2022 is also localized to 20-30 years old, becoming visibly indistinguishable for older people.
# get heatmap of the table in squared image
fig = px.imshow(
dfpcr_plot, labels=dict(x="Year of Observation", y="Age", color="%Pop.Growth"), aspect="auto"
).update_layout(width=1000)
fig.show()
Conclusions¶
- Identified the trends of (de-)growth of the Italian population from 2002 to 2022, including the role of immigrants
- Identified the distribution of population by age, and the agening of the baby boomers
- Identified the Cohort Change Ratio (
ccr)
Follow-up¶
- Check if the reason of the
ccrdrop for elder people in 2003, 2005, 2012, 2015 is related to heatwaves the year before - Make a model of the
ccrof population by age, to extrapolate the trends of future years - Check the evolution of the ratio between people in working and retirement age in the next years, testing different scenarios