Happiness in the World

Group Members: Priya Mapara, Maya Narayanasamy, Aravind Ganeshan

Overview

Data retrieved from: https://www.kaggle.com/unsdsn/world-happiness. Each year, countries are ranked based on their Happiness Index. It is a comprehensive survey instrument that assesses happiness, well- being, and aspects of sustainability and resilience. The happiness score is a numeric value on a scale of 1-10 with ten being the highest and one being the lowest. There are 6 categories that are taken into account when calculating the happiness score for each country.

The six categories are:

The data retrieved is happiness data from the years 2015-2019 from Kaggle. This tutorial will contain information in three parts:

Part One: Boxplot of Countries We will first start by counting how many countries are in each region. Then, we will group the countries based on their regions. We will then plot the data in a box plot to analyze which regions have a better score. This will visually give us a representation of which sections of the world tend to rank higher in happiness, regionally grouping them for comparison and analysis. We will also calculate the median happiness score among each region.

Part Two: Machine Learning In this part, we will retrive data from https://datahelpdesk.worldbank.org/knowledgebase/articles/906519. This gives us the ranking of the countries economies which is categorized into: High Income, Upper Middle, Lower Middle, and Low Income. We will use the kth nearest neighbors algorithm to predict the following years data.

Part Three: Linear Regressions In this portion, we will obtain data on the Human Development Index (HDI) from http://hdr.undp.org/en/data. The HDI is a summary measure of average achievement in the key dimensions of human development: a long and healthy life, being knowledgeable, and have a decent standard of living. Then, we will use linear regressions to see if there is a correlation between Happiness Score from our data set and from the HDI score for the countries from the years 2015-2018. We will analyze how this data compares to one another, and what the linear regressions can tell us in regards to else may factor into the happiness score.

Required Tools

For this tutorial we will be using Jupyter Notebooks as our choice of platform for code development since it is an open source application. It allowed one to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

The following libraries will also be used within our tutorial:

Data Collection

The first step into making our tutorial was finding data. We found our dataset from Kaggle which is a website that contains a mass amount of datasets on a variety of topics. From there we were able to download five separate files containing Happiness Index data from the years 2015-2019 inclusive. The downloaded data was presented to us in an excel sheet format which we were then easily able to convert to comma separated value files (CSVs). We downloaded the data files to our Jupyter Notebook and placed all our data in a separate folder titled ‘data’. We used the pandas read_csv() to read in our data to be presented in a table format.

Data Tidying and Cleaning

Inconsistent column names

To clean and tidy our data it was important for us to make sure that all the datasets amongst the years 2015-2019 were consistent. When looking at the CSV files we noticed that the data columns were named slightly differently or had extra columns. This inconsistency made it difficult to properly categorize the data, especially in regards to region. To combat this issue we renamed all the datasets to follow the 2015 data set model since it was what had all the information we needed. The 2017, 2018, and 2019 data files were not as similar in their title names as the 2015 and 2016 dataset so we manually renamed the columns to match.

Inconsistent country names

When obtaining our second dataset we decided to import pycountry which allows us to standardize the country names. Doing this allowed us to have two separate data frames with the country naming convention to be the same. This allowed us to find countries in a more simple manner since there were some instances the naming of countries was spelt differently amongst all the datasets.For this we were able to standardize the names amongst all the datasets so it is easy to locate country names.

Some countries are not in all the datasets

The next thing we did to clean our data is only to use countries that are in all five datasets. As there was an inconsistency in regards to which countries were in which data sets, if one country is only present in four out of the five datasets, we will not include that country into our analysis. Meaning, if a country was not present in every single dataset, then we would not use it in our cleaned data and further analysis. This will ensure that we are only using complete data and using the same data when grouping our data by years.

Add region column for certain datasets

The 2017, 2018, and 2019 datasets also did not have a region model, so we used the 2015 datasets country-region pair to add the region to those datasets. Initially we were planning on manually adding the region column to the 2017, 2018, and 2019 dataset, When looking at the data sets, our first idea was to physically go through and see where the inconsistencies were, and how the rest of the data sets different from the 2015 one. But to reduce the aspect of human error which could be caused by mistyping we decided to add a Python script of code which solved this problem for us. Rather than changing the data sets and modifying the inconsistent column names themselves while in their excel form, we decided to write the code below to rename them.

Exploratory Analysis and Data Visualization

Number of Countries in Each Region

Below we have calculated to see how many countries are displayed in each region. This is useful because if certain regions have more countries it means that the data would be more spread apart than regions that only have a few countries listed under it. This is necessary to know because it changes how the data visually appears and how much data there is to pull from given a region.

Analysis of Number of Countries in Region

As shown above, each region has a different amount of countries in them, ranging from 2 countries (North America and, Australia and New Zealand) to 31 coutries (Sub-Saharan Africa). This means that each region is not pulling from the same consistent amount of data. This means that the boxplot graphs for the regions with a smaller amount of data, will visually show this lack of data, appearing to be very small and compact.

Part One: Boxplots of Data

For the first part of our analysis we will compare the happiness score index amongst the different regions . The data will be displayed in 5 separate graphs with each graph symbolizing one year. There will be a total of 10 boxplots on each graph since there are 10 different predefined regions. After we will calculate the median happiness score index for each region for each year.

Analysis of Boxplots

From looking at the boxplots we can see that the Western Europe, North America, and Australia and New Zealand have the highest happiness scores. However, the data from North America and Australia/ New Zealand are less spread apart meaning whereas the Wester Europe data shows an increased spread. The less spread that is in the Australia/ New Zealand region and North American region can be accounted for the reason that there are not as many countries that are in that region. This can be noticed amongst all the dataset years. The biggest difference throughout the years can be seen in the Middle East/North African region. Throughout all five years, the data is spread apart meaning that there is a big range between the countries.

Analysis of Medians

From looking at the medians, it can be seen that the happiness score index for the North American region is slightly decreasing each year. However one thing to notice is that the happiness score index for the Sub-Saharan African region is on a slight increase.

Part Two: Income Classification Based on Happiness Metrics

In this section we will use Machine Learning, specifically the K-th nearest neighbors algorithm to predict the economic income status of countries. We received data from https://datahelpdesk.worldbank.org/knowledgebase/articles/906519 which ranked a country's income level by giving them a High, Upper Middle, Lower Middle, or Low. Using the subcategories that are used to calculate happiness index, we will use the K-th nearest neighbors algorithm to predict the income level of that country.

Data Cleaning and Tidying for Income Status

Once again, we were able to use pycountry to standardize all the names of the countries so that it was easier to locate their data. We first formatted the dataset for it to be only regarding the years from 2015-2019. We than filtered through the country names and only kept the ones that are common to the Happiness Index dataset. We then created a new dictionary containing the Happiness Index data containing a new column with the income status levels for that specific country.

Adding Income Class Column

Adds a column titled Income Class to the the Happiness Index dataset.

Using Kth Nearest Neighbors Classifier

For our data, we decided to use previous years data to predict the data of the upcoming year. We chose to use the 6 subcategories that are used to create the Happiness Score as our varaibles when deciding what the Income Class could be. The train year is the year that we use to predict the succeeding years income class. Our results printed below will be a value of how accurate our predictions were using this algorithm.

Analysis

From the results we can see that using the kth nearest neighbors algorithm yields correct results anywhere from 80-86% of the time from our four test pair years, 2015(train)-2016(test), 2016(train)-2017(test), 2017(train)-2018(test), 2018(train)-2019(test). This shows that there could be some sort of relationship between the six subcategories of the happieness index data set and the income level for that country. Using this algorithm shows that we could decently predict the income class level of a certain country depending on the six parameters that are used to calculate the Happiness Score.

We chose to use Kth Nearest Neighbors as our algorithm because it relies on input data to learn a function. For the case of our dataset, our input data is the previous years data which has the six subcategories used to calcuate happiness index as well as contains the country's income class level. We use this information to output the succeeding years predictions of income class level based on the six subcategories of the happiness index.

Part Three: Linear Regressions

We have collected data from the United Nations Human Development Records. This data shows a countries Human Development Index from the years of 2015-2018. We will use linear regressions to see if there is a increase trend that correlates with the Happiness Score. The Human Development Index is an index that measures human developemt by combining factors such as life expectancy, standard of living, and education. As a group, we wanted to see if there was any linear correleation between the Human Development Index Score and the Happiness Score among countries.

Tidying and Cleaning HDI Dataset

The data given to us was poorly formatted since there was only one column containing all the data for each year for each country. We had to seperate the columns by year to make it more readable while also removing all the 'nan' columns. To separate the columns we had to create a new dataframe and extract the data from the older one. Another problem we ran into was in the inconsistency of country naming. To resolve this issue we added the pycountry import which standardized naming for the countries.

Hypothesis

Our nul hypothesis is that there is not a positive correlation between Happiness Index and Human Development Index Scores

Regression Model and Graph

Below we graph our data in a scatterplot and add a line of best fit to see if there is any sort of linear trend.

Analysis of Regression Graph

From looking above we can see a positive linear trend with the data. Most countries follow the pattern of having a high happiness index score while also having a high Human Developement Index Score. The R^2 value of 0.61 also shows that there is a moderately strong positive correlation between Happiness Score and HDI scores. The R^2 value is not too high to say that there is a very strong relationship between the two variables, but by looking at the graph we can see that there is some sort of positive trend from the two data sets.

From the OLS Model summary we can see that the P-value is very close to almost exactly being 0. Because of the p-value being close to zero, we would reject the nul hypothesis that was stated above being that there is no relationship between the Happiness Score and Human Development Score.

Conclusion

In the first part of the project, we set out to categorize the countries by region, and see how they compared to one another in regards to their happiness index. Through the use of python libraries such as pandas, we were able to extract and read multiple CSV files, which we were then able to use to create the boxplots of countries. Before we were able to create our visualizations, we needed to clean the data obtained by changing inconsistent column names, country names, removing countries which were not in all the data sets, and adding required columns. After this was finished, we could then move on to plotting and analysis.

By plotting the data in boxplots to analyze which regions have better scores, we were able to get a visual representation of which sections of the world tend to rank higher in happiness. The data already grouped them by region so we wanted to see how they compared with one another; this was useful in regards to the fact that regions tend to share ties -- cultural, political, economic -- and so by grouping them, it puts into perspective both how regions compare to one another, but how much countries can differ within one region. Some boxplots were small and tightly fit together while there were others what were much more spread out.

For the first section, it should also be noted that not every region has the same amount of countries. One region had 2 while there was another which had as much as 31 countries. This is important to consider because this means that each region is not pulling from the same amount of data. Regions with smaller numbers of countries might not have the chance to show spread out data simply because there isn’t enough data to create that. On a similar note, not every country was represented through the data we obtained. This means that there are missing pieces of data for many of the regions represented; though there is this missing data, most of the countries are represented and follow the same patterns as the ones not, so it can be concluded that the boxplots and analysis are still accurate.

For our second part of the project we imported data from the World Bank dataset that classified countries based on their Income Status Level. Countries were classified into four categorical variables: High, Upper Middle, Lower Middle, and Low. The countries were assigned their income level based off the World Bank Atlas Method. For our machine learning analysis, we wanted to see if the six subcategories used to predict Happiness Score could also be used to predict the income status level for the countries.

For our machine learning analysis we decided to use the Kth Nearest Neighbors algorithm to predict the Income Class Level of countries based off their Happiness Score subcategories, Economy (GDP per capita), Social support, Health (Life Expectancy), Freedom ,Trust (Government Corruption), and Generosity. We used the previous years data as our train data to predict the succeeding year's income class level. We used this model across four different test year pairs and reached up to a 86% level of accuracy. From this we were able to conclude that there may be some sort of correlation amonst the subcategories of the Happiness Score level when deciding what a country's income class level would be. For a hypothetical country in which the six parameters were given to us, we could predict its income class level with up to 86% accuracy.

In the next part of the project, we obtained data of the Human Development Index (HDI) which measures the average achievement in the key dimensions of human development, highlighting which countries maintain a good standard of living and quality of life. The data showed the HDI from 2015-2018, similar years to the ones we used for the previous part. Using linear regressions, we saw if there is a correlation between the Happiness Score from our dataset and from the HDI Score for the countries from the years 2015-2018 inclusive. After creating the linear regressions, we then analyzed how the data compared to one another, and what the linear regressions could tell us in regards to what else may factor into happiness score.

Just like the previous part, we had to tidy up and clean the Human Development Index data. The data was very poorly formatted, all the data for each country and each year only showing up in one column and having many NaN columns randomly appearing because of the poor formatting provided. We had to separate the countries by year, and to separate the columns, we had to create a new dataframe and extract the data from the older one. Another problem we ran into was in the inconsistency of country naming, so we added the pycountry import which standardized naming for the countries.

Based on the linear regressions, we saw that there was a correlation between the Happiness Score and the HDI Score, the two numbers following a pattern with one another. As the HDI score combines factors such as life expectancy, standard of living, and education to measure human development, it also ties into the Happiness Score, because even though the HDI isn’t a part of it, they’re both measured through similar things. It should be noted that the Happiness Score isn’t clear science; it uses a multitude of factors to try and determine the happiness of countries, but that doesn’t mean it’s perfectly accurate to how people in a given country really feel. It could be shown that there are more factors that need to be considered, and one of them could be the Human Development Index.

Sources

https://www.kaggle.com/unsdsn/world-happiness

http://hdr.undp.org/en/data

https://jupyter.org/

https://scholarworks.waldenu.edu/cgi/viewcontent.cgi?article=1131&context=jsc

http://hdr.undp.org/en/content/human-development-index-hdi#:~:text=The%20Human%20Development%20Index%20(HDI,each%20of%20the%20three%20dimensions.

https://datahelpdesk.worldbank.org/knowledgebase/articles/906519