Group Members: Priya Mapara, Maya Narayanasamy, Aravind Ganeshan
Data retrieved from: https://www.kaggle.com/unsdsn/world-happiness. Each year, countries are ranked based on their Happiness Index. It is a comprehensive survey instrument that assesses happiness, well- being, and aspects of sustainability and resilience. The happiness score is a numeric value on a scale of 1-10 with ten being the highest and one being the lowest. There are 6 categories that are taken into account when calculating the happiness score for each country.
The six categories are:
The data retrieved is happiness data from the years 2015-2019 from Kaggle. This tutorial will contain information in three parts:
Part One: Boxplot of Countries We will first start by counting how many countries are in each region. Then, we will group the countries based on their regions. We will then plot the data in a box plot to analyze which regions have a better score. This will visually give us a representation of which sections of the world tend to rank higher in happiness, regionally grouping them for comparison and analysis. We will also calculate the median happiness score among each region.
Part Two: Machine Learning In this part, we will retrive data from https://datahelpdesk.worldbank.org/knowledgebase/articles/906519. This gives us the ranking of the countries economies which is categorized into: High Income, Upper Middle, Lower Middle, and Low Income. We will use the kth nearest neighbors algorithm to predict the following years data.
Part Three: Linear Regressions In this portion, we will obtain data on the Human Development Index (HDI) from http://hdr.undp.org/en/data. The HDI is a summary measure of average achievement in the key dimensions of human development: a long and healthy life, being knowledgeable, and have a decent standard of living. Then, we will use linear regressions to see if there is a correlation between Happiness Score from our data set and from the HDI score for the countries from the years 2015-2018. We will analyze how this data compares to one another, and what the linear regressions can tell us in regards to else may factor into the happiness score.
For this tutorial we will be using Jupyter Notebooks as our choice of platform for code development since it is an open source application. It allowed one to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
The following libraries will also be used within our tutorial:
import os
import re
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
import scipy
import statsmodels
import seaborn as sb
import sklearn
from sklearn import datasets, metrics
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
# import statsmodels.formula.api as sm
import statsmodels.api as sm
from IPython.display import display, HTML
import pycountry as pyco
%matplotlib inline
# Data Retrieved From: https://www.kaggle.com/unsdsn/world-happiness
The first step into making our tutorial was finding data. We found our dataset from Kaggle which is a website that contains a mass amount of datasets on a variety of topics. From there we were able to download five separate files containing Happiness Index data from the years 2015-2019 inclusive. The downloaded data was presented to us in an excel sheet format which we were then easily able to convert to comma separated value files (CSVs). We downloaded the data files to our Jupyter Notebook and placed all our data in a separate folder titled ‘data’. We used the pandas read_csv() to read in our data to be presented in a table format.
# all csv files are stored under the data directory
path = "data/"
data = {file[:4]: pd.read_csv("data/" + file) for file in os.listdir(path) if file[0].isdigit()}
To clean and tidy our data it was important for us to make sure that all the datasets amongst the years 2015-2019 were consistent. When looking at the CSV files we noticed that the data columns were named slightly differently or had extra columns. This inconsistency made it difficult to properly categorize the data, especially in regards to region. To combat this issue we renamed all the datasets to follow the 2015 data set model since it was what had all the information we needed. The 2017, 2018, and 2019 data files were not as similar in their title names as the 2015 and 2016 dataset so we manually renamed the columns to match.
# column names
columns = ["Country", "Region", "Happiness Rank", "Happiness Score", "Economy (GDP per Capita)", "Social support",
"Health (Life Expectancy)", "Freedom", "Trust (Government Corruption)", "Generosity"]
# rename and drop columns for 2015, 2016, 2017 data files
data["2015"] = data["2015"].rename(columns={"Family": "Social support"})
data["2015"] = data["2015"][data["2015"].columns.intersection(columns)]
data["2016"] = data["2016"].rename(columns={"Family": "Social support"})
data["2016"] = data["2016"][data["2016"].columns.intersection(columns)]
data["2017"] = data["2017"].drop(["Whisker.high", "Whisker.low", "Dystopia.Residual"], axis=1)
data["2017"] = data["2017"].rename(columns={"Family": "Social Support"})
data["2017"].columns = [c for c in columns if c != "Region"]
# rename columns for 2018 and 2019 data files
new_names = {"Overall rank": "Happiness Rank", "Country or region": "Country", "Score": "Happiness Score",
"GDP per capita": "Economy (GDP per Capita)", "Healthy life expectancy": "Health (Life Expectancy)",
"Freedom to make life choices": "Freedom", "Perceptions of corruption": "Trust (Government Corruption)"}
data["2018"] = data["2018"].rename(columns=new_names)
data["2018"] = data["2018"][[c for c in columns if c in data["2018"].columns]]
data["2019"] = data["2019"].rename(columns=new_names)
data["2019"] = data["2019"][[c for c in columns if c in data["2019"].columns]]
When obtaining our second dataset we decided to import pycountry which allows us to standardize the country names. Doing this allowed us to have two separate data frames with the country naming convention to be the same. This allowed us to find countries in a more simple manner since there were some instances the naming of countries was spelt differently amongst all the datasets.For this we were able to standardize the names amongst all the datasets so it is easy to locate country names.
# inconsistent country naming
errors = ["Congo (Kinshasa)", "Congo (Brazzaville)", "Palestinian Territories", "Ivory Coast", "South Korea"]
alpha_2_names = [pyco.countries.get(alpha_2=a2).name for a2 in ["CG", "CD","PS", "CI","KR"]]
# use pycountry to standardize
for year, df in data.items():
for c in df["Country"]:
try:
d = {c: country.name for country in pyco.countries.search_fuzzy(c) if country.name not in df["Country"]}
df["Country"] = df["Country"].replace(d)
except LookupError:
df["Country"] = df["Country"].replace(dict(zip(errors, alpha_2_names)))
The next thing we did to clean our data is only to use countries that are in all five datasets. As there was an inconsistency in regards to which countries were in which data sets, if one country is only present in four out of the five datasets, we will not include that country into our analysis. Meaning, if a country was not present in every single dataset, then we would not use it in our cleaned data and further analysis. This will ensure that we are only using complete data and using the same data when grouping our data by years.
# remove all rows for countries that are not common to all the data files
# this will be used to filter out the countries that do not appear in each data file
countries_list = [df["Country"] for df in data.values()]
common_countries = set(countries_list[0]).intersection(*countries_list[1:])
for year, df in data.items():
data[year] = df[df["Country"].isin(common_countries)]
The 2017, 2018, and 2019 datasets also did not have a region model, so we used the 2015 datasets country-region pair to add the region to those datasets. Initially we were planning on manually adding the region column to the 2017, 2018, and 2019 dataset, When looking at the data sets, our first idea was to physically go through and see where the inconsistencies were, and how the rest of the data sets different from the 2015 one. But to reduce the aspect of human error which could be caused by mistyping we decided to add a Python script of code which solved this problem for us. Rather than changing the data sets and modifying the inconsistent column names themselves while in their excel form, we decided to write the code below to rename them.
# 2017, 2018, 2019 data files do not have the regions for the different countries filled in
# so we can use the regions from the 2015 to fill them in since the regions won't change
country_to_region = dict(zip(data["2015"]["Country"], data["2015"]["Region"]))
country_to_region = {k: v for k,v in country_to_region.items() if k in common_countries}
# create region column for the years it does not exist
pd.set_option('mode.chained_assignment', None)
data["2017"]["Region"] = data["2017"]["Country"].apply(lambda country : country_to_region[country])
data["2018"]["Region"] = data["2018"]["Country"].apply(lambda country : country_to_region[country])
data["2019"]["Region"] = data["2019"]["Country"].apply(lambda country : country_to_region[country])
# set proper order of the columns
for year in data.keys():
data[year] = data[year].loc[:, columns]
# pd.set_option("display.max_rows", None, "display.max_columns", None)
Below we have calculated to see how many countries are displayed in each region. This is useful because if certain regions have more countries it means that the data would be more spread apart than regions that only have a few countries listed under it. This is necessary to know because it changes how the data visually appears and how much data there is to pull from given a region.
# A table below will count the number of countries that are in each region.
western_europe_count = 0
north_america_count = 0
australia_and_new_zealand_count = 0
middle_east_and_northern_africa = 0
latin_america_and_carribbean = 0
southeastern_asia = 0
central_and_eastern_europe = 0
eastern_asia = 0
sub_saharan_africa = 0
southern_asia = 0
# iterates through the entire region column of the 2015 data set and counts how many countries
# are in each region.
for each in data["2015"]["Region"]:
if (each == "Western Europe"):
western_europe_count = western_europe_count + 1
elif (each == "North America"):
north_america_count = north_america_count + 1
elif (each == "Australia and New Zealand"):
australia_and_new_zealand_count = australia_and_new_zealand_count + 1
elif (each == "Middle East and Northern Africa"):
middle_east_and_northern_africa = middle_east_and_northern_africa + 1
elif (each == "Latin America and Caribbean"):
latin_america_and_carribbean = latin_america_and_carribbean + 1
elif (each == "Southeastern Asia"):
southeastern_asia = southeastern_asia + 1
elif (each == "Central and Eastern Europe"):
central_and_eastern_europe = central_and_eastern_europe + 1
elif (each == "Eastern Asia"):
eastern_asia = eastern_asia + 1
elif (each == "Sub-Saharan Africa"):
sub_saharan_africa = sub_saharan_africa + 1
elif (each == "Southern Asia"):
southern_asia = southern_asia + 1
# data frame that stores the counts
region_count = {'Region' : ['Western Europe', 'North America', 'Australia and New Zealand',
'Middle East and Northern Africa', 'Latin America and Caribbean','Southeastern Asia',
'Central and Eastern Europe', 'Eastern Asia', 'Sub-Saharan Africa','Southern Asia'],
'Number of Countries in Region' : [western_europe_count,north_america_count,australia_and_new_zealand_count,
middle_east_and_northern_africa,latin_america_and_carribbean,
southeastern_asia,central_and_eastern_europe,eastern_asia,
sub_saharan_africa, southern_asia]}
count_df = pd.DataFrame(region_count, columns = ['Region', 'Number of Countries in Region'])
count_df
Region | Number of Countries in Region | |
---|---|---|
0 | Western Europe | 20 |
1 | North America | 2 |
2 | Australia and New Zealand | 2 |
3 | Middle East and Northern Africa | 19 |
4 | Latin America and Caribbean | 21 |
5 | Southeastern Asia | 8 |
6 | Central and Eastern Europe | 29 |
7 | Eastern Asia | 6 |
8 | Sub-Saharan Africa | 31 |
9 | Southern Asia | 7 |
As shown above, each region has a different amount of countries in them, ranging from 2 countries (North America and, Australia and New Zealand) to 31 coutries (Sub-Saharan Africa). This means that each region is not pulling from the same consistent amount of data. This means that the boxplot graphs for the regions with a smaller amount of data, will visually show this lack of data, appearing to be very small and compact.
For the first part of our analysis we will compare the happiness score index amongst the different regions . The data will be displayed in 5 separate graphs with each graph symbolizing one year. There will be a total of 10 boxplots on each graph since there are 10 different predefined regions. After we will calculate the median happiness score index for each region for each year.
medians = {}
for year, df in data.items():
sb.set_style("whitegrid")
dims = (15,10)
_, ax = plt.subplots(figsize=dims)
medians[year] = df.groupby(["Region"])["Happiness Score"].median()
plt.title("Boxplot of the Region and their Happiness Score in " + year, fontsize = 20)
plt.xlabel("Happiness Score", fontsize = 15)
sb.boxplot(x="Region", y="Happiness Score", data=data[year], ax=ax)
plt.show()
From looking at the boxplots we can see that the Western Europe, North America, and Australia and New Zealand have the highest happiness scores. However, the data from North America and Australia/ New Zealand are less spread apart meaning whereas the Wester Europe data shows an increased spread. The less spread that is in the Australia/ New Zealand region and North American region can be accounted for the reason that there are not as many countries that are in that region. This can be noticed amongst all the dataset years. The biggest difference throughout the years can be seen in the Middle East/North African region. Throughout all five years, the data is spread apart meaning that there is a big range between the countries.
# shows the median Happiness score for each region from 2015 to 2019
median_data = pd.DataFrame.from_dict(medians)
median_data
2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|
Region | |||||
Australia and New Zealand | 7.2850 | 7.3235 | 7.2990 | 7.2980 | 7.2675 |
Central and Eastern Europe | 5.2860 | 5.4880 | 5.3950 | 5.6200 | 5.5290 |
Eastern Asia | 5.7290 | 5.6465 | 5.5555 | 5.6525 | 5.6580 |
Latin America and Caribbean | 6.1300 | 6.1260 | 6.0080 | 6.0705 | 6.0955 |
Middle East and Northern Africa | 5.1920 | 5.3030 | 5.5000 | 5.3580 | 5.2110 |
North America | 7.2730 | 7.2540 | 7.1545 | 7.1070 | 7.0850 |
Southeastern Asia | 5.3795 | 5.2965 | 5.3460 | 5.3135 | 5.2655 |
Southern Asia | 4.5650 | 4.6430 | 4.7850 | 4.6900 | 4.6845 |
Sub-Saharan Africa | 4.2520 | 4.1300 | 4.1390 | 4.3500 | 4.4900 |
Western Europe | 6.9385 | 6.9180 | 6.9510 | 6.9770 | 7.0540 |
From looking at the medians, it can be seen that the happiness score index for the North American region is slightly decreasing each year. However one thing to notice is that the happiness score index for the Sub-Saharan African region is on a slight increase.
In this section we will use Machine Learning, specifically the K-th nearest neighbors algorithm to predict the economic income status of countries. We received data from https://datahelpdesk.worldbank.org/knowledgebase/articles/906519 which ranked a country's income level by giving them a High, Upper Middle, Lower Middle, or Low. Using the subcategories that are used to calculate happiness index, we will use the K-th nearest neighbors algorithm to predict the income level of that country.
Once again, we were able to use pycountry to standardize all the names of the countries so that it was easier to locate their data. We first formatted the dataset for it to be only regarding the years from 2015-2019. We than filtered through the country names and only kept the ones that are common to the Happiness Index dataset. We then created a new dictionary containing the Happiness Index data containing a new column with the income status levels for that specific country.
# load income classification data
econ = pd.read_excel(path + "OGHIST.xls", sheet_name = "Country Analytical History")
pd.set_option("display.max_rows", None, "display.max_columns", None)
# format nd extract the country names and data from years (2015-2019)
econ = econ.iloc[[4] + list(range(10, 228)), [1] + list(range(30, 35))]
econ.reset_index(drop=True, inplace=True)
econ.columns = [str(c) for c in econ.iloc[0]]
econ.rename(columns={"4": "index", "Data for calendar year :": "Country"}, inplace=True)
econ = econ.drop([0], axis=0)
econ.reset_index(drop=True, inplace=True)
# deal with inconsistent country names
a2 = ["BS", "CG", "CD", "EG", "GM", "HK", "KP", "KR", "LA", "FM", "KN", "LC", "MF", "VC", "VE", "TW", "VI", "PS", "YE"]
a2 = [pyco.countries.get(alpha_2=c).name for c in a2]
discard = ["Channel Islands", "Faeroe Islands", "Macao SAR, China"]
errors = []
# use pycountry to get standard names
for c in econ["Country"]:
try:
d = {c: pyco.countries.search_fuzzy(c)[0].name}
if c == "Niger":
d = {c: "Niger"}
econ["Country"] = econ["Country"].replace(d)
except LookupError:
if c not in discard:
errors.append(c)
econ["Country"] = econ["Country"].replace(dict(zip(errors, a2)))
# create new dictionary to be used for the income class learning portion
new_data = {year: df.copy(deep=True) for year, df in data.items()}
for year, df in new_data.items():
df["Income Class"] = pd.Series(dtype=str)
Adds a column titled Income Class to the the Happiness Index dataset.
# add the income class to the new_data
for year, df in new_data.items():
for i, row in econ.iterrows():
country, income_class = row["Country"], row[year]
if country in common_countries:
df.loc[df["Country"] == country, "Income Class"] = income_class
# drop countries that contain an NaN for their income score
for _, df in new_data.items():
df.dropna(inplace=True)
For our data, we decided to use previous years data to predict the data of the upcoming year. We chose to use the 6 subcategories that are used to create the Happiness Score as our varaibles when deciding what the Income Class could be. The train year is the year that we use to predict the succeeding years income class. Our results printed below will be a value of how accurate our predictions were using this algorithm.
# create the KNN classifier and set the train and test data pairs
knn = KNeighborsClassifier()
pairs = [("2015", "2016"), ("2016", "2017"), ("2017", "2018"), ("2018", "2019")]
# perform classification
for train_year, test_year in pairs:
X, y = new_data[train_year].iloc[:, 4:10], new_data[train_year]["Income Class"]
X, y = np.array(X), np.array(y)
knn.fit(X, y)
X_test, y_test = new_data[test_year].iloc[:, 4:10], new_data[test_year]["Income Class"]
X_test, y_test = np.array(X), np.array(y)
print("Mean accuracy (train year: %s test year: %s): %f" % (train_year, test_year, knn.score(X_test, y_test)))
Mean accuracy (train year: 2015 test year: 2016): 0.840278 Mean accuracy (train year: 2016 test year: 2017): 0.863014 Mean accuracy (train year: 2017 test year: 2018): 0.819444 Mean accuracy (train year: 2018 test year: 2019): 0.805556
From the results we can see that using the kth nearest neighbors algorithm yields correct results anywhere from 80-86% of the time from our four test pair years, 2015(train)-2016(test), 2016(train)-2017(test), 2017(train)-2018(test), 2018(train)-2019(test). This shows that there could be some sort of relationship between the six subcategories of the happieness index data set and the income level for that country. Using this algorithm shows that we could decently predict the income class level of a certain country depending on the six parameters that are used to calculate the Happiness Score.
We chose to use Kth Nearest Neighbors as our algorithm because it relies on input data to learn a function. For the case of our dataset, our input data is the previous years data which has the six subcategories used to calcuate happiness index as well as contains the country's income class level. We use this information to output the succeeding years predictions of income class level based on the six subcategories of the happiness index.
We have collected data from the United Nations Human Development Records. This data shows a countries Human Development Index from the years of 2015-2018. We will use linear regressions to see if there is a increase trend that correlates with the Happiness Score. The Human Development Index is an index that measures human developemt by combining factors such as life expectancy, standard of living, and education. As a group, we wanted to see if there was any linear correleation between the Human Development Index Score and the Happiness Score among countries.
The data given to us was poorly formatted since there was only one column containing all the data for each year for each country. We had to seperate the columns by year to make it more readable while also removing all the 'nan' columns. To separate the columns we had to create a new dataframe and extract the data from the older one. Another problem we ran into was in the inconsistency of country naming. To resolve this issue we added the pycountry import which standardized naming for the countries.
Our nul hypothesis is that there is not a positive correlation between Happiness Index and Human Development Index Scores
# load the hdi data
hdi = pd.read_csv(path + "Human_Development_Index.csv")
hdi = hdi.drop(columns=["Human Development Index (HDI)"])
# clean and format the hdi data
remove_nan = lambda r: [e for e in r if str(e) != "nan"]
rows = [remove_nan(row.name) for i, row in hdi.iterrows()]
rows = [r[:2] + r[-4:] for r in rows]
rows = rows[:len(rows) - 18]
columns = rows.pop(0)
# create hdi dataframe
hdi = {c: [r[i] for r in rows] for i, c in enumerate(columns)}
hdi = pd.DataFrame(data=hdi)
# standardize country names in hdi dataframe
for c in hdi["Country"]:
try:
pyco.countries.search_fuzzy(c)
except LookupError:
name = re.search(r"^([\w\s]*).*$", c).group(1)
d = {c: country.name for country in pyco.countries.search_fuzzy(name) if country.name not in hdi["Country"]}
hdi["Country"] = hdi["Country"].replace(d)
hdi = hdi[hdi["Country"].isin(common_countries)]
hdi.reset_index(drop=True, inplace=True)
Below we graph our data in a scatterplot and add a line of best fit to see if there is any sort of linear trend.
# pd.set_option("display.max_rows", 500)
# create data arrays
happiness_score, hdi_score = [], []
# add the data to the arrays
for year, df in data.items():
if year != "2019":
for country in df["Country"]:
if country in list(hdi["Country"]):
happiness_score.append(df[df["Country"] == country]["Happiness Score"].iloc[0])
hdi_score.append(float(hdi[hdi["Country"] == country][year].iloc[0]))
# create regression model
happiness_score, hdi_score = np.array(happiness_score), np.array(hdi_score)
reshape_x, reshape_y = happiness_score.reshape(-1, 1), hdi_score.reshape(-1, 1)
reg = LinearRegression().fit(reshape_x, reshape_y)
print("R^2 coefficient: %f" % reg.score(reshape_x, reshape_y))
# plot data on scatter plot
plt.figure(figsize = (15, 10))
plt.title("Happiness Score Relation with HDI")
plt.xlabel("Happiness Score")
plt.ylabel("HDI Score")
plt.scatter(happiness_score, hdi_score)
# plot best fit line
m, b = np.polyfit(happiness_score, hdi_score, 1)
plt.plot(happiness_score, m * happiness_score + b, color = "orange")
X = sm.add_constant(happiness_score)
y = hdi_score
model = sm.OLS(y, X).fit()
print(model.summary())
R^2 coefficient: 0.626395 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.626 Model: OLS Adj. R-squared: 0.626 Method: Least Squares F-statistic: 942.3 Date: Mon, 21 Dec 2020 Prob (F-statistic): 2.99e-122 Time: 00:55:20 Log-Likelihood: 534.16 No. Observations: 564 AIC: -1064. Df Residuals: 562 BIC: -1056. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.1424 0.020 7.303 0.000 0.104 0.181 x1 0.1073 0.003 30.696 0.000 0.100 0.114 ============================================================================== Omnibus: 22.188 Durbin-Watson: 1.887 Prob(Omnibus): 0.000 Jarque-Bera (JB): 25.654 Skew: -0.421 Prob(JB): 2.69e-06 Kurtosis: 3.618 Cond. No. 28.3 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From looking above we can see a positive linear trend with the data. Most countries follow the pattern of having a high happiness index score while also having a high Human Developement Index Score. The R^2 value of 0.61 also shows that there is a moderately strong positive correlation between Happiness Score and HDI scores. The R^2 value is not too high to say that there is a very strong relationship between the two variables, but by looking at the graph we can see that there is some sort of positive trend from the two data sets.
From the OLS Model summary we can see that the P-value is very close to almost exactly being 0. Because of the p-value being close to zero, we would reject the nul hypothesis that was stated above being that there is no relationship between the Happiness Score and Human Development Score.
In the first part of the project, we set out to categorize the countries by region, and see how they compared to one another in regards to their happiness index. Through the use of python libraries such as pandas, we were able to extract and read multiple CSV files, which we were then able to use to create the boxplots of countries. Before we were able to create our visualizations, we needed to clean the data obtained by changing inconsistent column names, country names, removing countries which were not in all the data sets, and adding required columns. After this was finished, we could then move on to plotting and analysis.
By plotting the data in boxplots to analyze which regions have better scores, we were able to get a visual representation of which sections of the world tend to rank higher in happiness. The data already grouped them by region so we wanted to see how they compared with one another; this was useful in regards to the fact that regions tend to share ties -- cultural, political, economic -- and so by grouping them, it puts into perspective both how regions compare to one another, but how much countries can differ within one region. Some boxplots were small and tightly fit together while there were others what were much more spread out.
For the first section, it should also be noted that not every region has the same amount of countries. One region had 2 while there was another which had as much as 31 countries. This is important to consider because this means that each region is not pulling from the same amount of data. Regions with smaller numbers of countries might not have the chance to show spread out data simply because there isn’t enough data to create that. On a similar note, not every country was represented through the data we obtained. This means that there are missing pieces of data for many of the regions represented; though there is this missing data, most of the countries are represented and follow the same patterns as the ones not, so it can be concluded that the boxplots and analysis are still accurate.
For our second part of the project we imported data from the World Bank dataset that classified countries based on their Income Status Level. Countries were classified into four categorical variables: High, Upper Middle, Lower Middle, and Low. The countries were assigned their income level based off the World Bank Atlas Method. For our machine learning analysis, we wanted to see if the six subcategories used to predict Happiness Score could also be used to predict the income status level for the countries.
For our machine learning analysis we decided to use the Kth Nearest Neighbors algorithm to predict the Income Class Level of countries based off their Happiness Score subcategories, Economy (GDP per capita), Social support, Health (Life Expectancy), Freedom ,Trust (Government Corruption), and Generosity. We used the previous years data as our train data to predict the succeeding year's income class level. We used this model across four different test year pairs and reached up to a 86% level of accuracy. From this we were able to conclude that there may be some sort of correlation amonst the subcategories of the Happiness Score level when deciding what a country's income class level would be. For a hypothetical country in which the six parameters were given to us, we could predict its income class level with up to 86% accuracy.
In the next part of the project, we obtained data of the Human Development Index (HDI) which measures the average achievement in the key dimensions of human development, highlighting which countries maintain a good standard of living and quality of life. The data showed the HDI from 2015-2018, similar years to the ones we used for the previous part. Using linear regressions, we saw if there is a correlation between the Happiness Score from our dataset and from the HDI Score for the countries from the years 2015-2018 inclusive. After creating the linear regressions, we then analyzed how the data compared to one another, and what the linear regressions could tell us in regards to what else may factor into happiness score.
Just like the previous part, we had to tidy up and clean the Human Development Index data. The data was very poorly formatted, all the data for each country and each year only showing up in one column and having many NaN columns randomly appearing because of the poor formatting provided. We had to separate the countries by year, and to separate the columns, we had to create a new dataframe and extract the data from the older one. Another problem we ran into was in the inconsistency of country naming, so we added the pycountry import which standardized naming for the countries.
Based on the linear regressions, we saw that there was a correlation between the Happiness Score and the HDI Score, the two numbers following a pattern with one another. As the HDI score combines factors such as life expectancy, standard of living, and education to measure human development, it also ties into the Happiness Score, because even though the HDI isn’t a part of it, they’re both measured through similar things. It should be noted that the Happiness Score isn’t clear science; it uses a multitude of factors to try and determine the happiness of countries, but that doesn’t mean it’s perfectly accurate to how people in a given country really feel. It could be shown that there are more factors that need to be considered, and one of them could be the Human Development Index.