I. Introduction:

Is there a perfect measurement of a country’s happiness score? How does the happiness of one continent compare to another, and what distinguishes a corrupted country from its more stable and happy counterpart? Our data is a measure of national average responses to variable-related questions. We use the “Happiness & Corruption 2015-2020” dataset, which is a combination of two other datasets from Kaggle (The “World Happiness Report” by the Gallup World Poll & The “Corruption Perceptions Index” by Transparency International: https://www.kaggle.com/datasets/eliasturk/world-happiness-based-on-cpi-20152020) to analyze the happiness score and its determining variables like family, freedom, government trust, generosity, GDP per capita, etc. for each country and continent. One variable in particular, the Corruption Perceptions Index (CPI), measures how corrupt public sectors are perceived in each country. Our goal is to analyze the regional statistical and realistic significance of the determinants of happiness and understand how they change over time to create predictions for 2021 and 2022.

II. Ethical consideration:

Since this is public data, and the creators did not collect it out of ill will, there are a few ethical considerations. But, if this were a more official government evaluation of happiness, depending on the regime, it’s a concern that governments might try to distort the “government trust” variable to hide the reality of their administration. Additionally, the results of the “family” variable might affect the cultures of the countries whose results for “family” were very insignificant. Overall, we understand that this data is subjective and doesn’t encompass the comprehensive ideas of happiness for each country, region, or continent.

III. Data explaination and exploration:

Our data set contains several variables that we incorporated into our analysis. These include the happiness score and its corresponding determinants of happiness (variables that were used to calculate the happiness score), which were all numerical variables with decimal values. The determinants of happiness were GDP per capita, family, health, freedom, generosity, government trust, and social support. The “cpi_score” variable provides the Corruption Perceptions Index (CPI) score to each observation, and it is a numerical integer variable. The higher the CPI score, the less corrupt the country is perceived to be by its citizens, so the CPI score has a negative relationship with the happiness score. The “Year” variable assigns the year, between 2015 and 2020 (inclusive), from which the data from the World Happiness Report and the Corruptions Perceptions Index comes from. There are also two geographical variables, “Country” and “continent” that assign a country and a continent, respectively, to each observation. Each observation shows the happiness score, the scores of the determinants of happiness and CPI score of a particular country in a particular year.

Additionally, the variable “dystopia_residual” compared the happiness score of each observation to a hypothetical “dystopia” with the lowest possible happiness scores. The “dystopia_residual” variable was the only variable we did not consider a key variable in our data analysis as it was a comparison variable that was not relevant to our analysis process. One particular challenge we faced with our data was that the numerical value “0” was used to distinguish NA values. Since these numerical values would distort our calculations, we had to switch them into “NA” during the initial data wrangling process. Moreover, we had to set the integer values in the “cpi_score” variable to the same scale as the variables that determine happiness as we wanted to make comparisons between those variables in our linear regression models. The data visualizations below demonstrate some of our findings during the data exploration process.

Above are the box plots demonstrating the distributions of happiness scores across the continents as well as the overall world’s happiness score distribution. Based on the visual above, we can see that the mean happiness score for Australia, Europe, North and South America are above the worldwide mean while the mean for the rest of the continents seems to fall below that threshold. The ranking in means of predicted happiness scores across continents would be: Australia, North America, South America, Europe, Asia, and Africa (ranked descending).

Above are the histograms showing the distributions of the happiness scores throughout different years from 2015 to 2020. Overall, all the histograms seem to be normally distributed with a mean of approximately 5. The range for all these distribution is from 0- 8. This shows that there were not that many changes in global happiness throughout this time period. Regarding abnormalities, in 2015, the count of mode is significantly high (above 15) at the 5.25 threshold on the 0-10 scale of happiness. This shows that in 2015, there are many more countries that have the same happiness level of 5.25. Another noticeable fact is for the 2020 histogram, there seems to be more countries with higher happiness scores compared to the previous years. This suggests that overall across the world, there are quite a decent number of countries which are getting happier in the year of 2020.

IV. Statistical Analysis And Interpretion:

Our statistical analysis process was structured in a way that would enable us to find the variables that most influenced happiness in different continents and in different years. First, we divided our main data set into smaller data sets that only contained data for a certain continent or a certain year. Then, we made an initial linear regression model for each of those data subsets that considered all of the determinants of happiness as well as the CPI score as the independent variables, and the happiness score as the dependent variable. After making this initial model, we used different model selection techniques - in particular, stepwise model selection and cross-validation model selection (Monte-Carlo) - to find out which combination of independent variables influenced happiness the most for each data subset. Using the independent variables chosen by model selection, we then made a new linear regression model for each data subset and computed the model summary. Finally, we used the magnitude of the coefficients of the independent variables in the model summary to determine which variables in the new model impacted the happiness score the most (The larger the absolute value of the coefficient, the higher its impact on the happiness score).

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54755 -0.32006  0.00231  0.31936  1.34798 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.10923    0.11010  19.157  < 2e-16 ***
## gdp_per_capita    1.16082    0.13370   8.683  < 2e-16 ***
## health            1.59297    0.21094   7.552 3.25e-13 ***
## freedom           1.16397    0.26720   4.356 1.71e-05 ***
## family            0.54909    0.11720   4.685 3.91e-06 ***
## generosity        0.70851    0.22701   3.121  0.00194 ** 
## government_trust  1.17466    0.32947   3.565  0.00041 ***
## cpi_score        -0.01868    0.02386  -0.783  0.43403    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5358 on 378 degrees of freedom
##   (406 observations deleted due to missingness)
## Multiple R-squared:  0.7869, Adjusted R-squared:  0.7829 
## F-statistic: 199.4 on 7 and 378 DF,  p-value: < 2.2e-16

The first model we made consisted of data from the entire data set. This model “model1” has the happiness score as the dependent variable, and the determinants of happiness along with the CPI score as the independent variables. To determine if there is a different combination of variables that impact the happiness score of this data set more than the initial model “model1”, model selection was used. The results of the Monte-Carlo cross validation method of model selection are shown in the data visualization above. The RMSE value on the y-axis tells you how far the values predicted by model selection are from the actual values. So, the lower the RMSE value shown in the graph, the closer the predicted values are to the observed values. In this case, the model with the lowest RMSE value was the same as “model1”, so another linear regression model did not have to be constructed using different variables. The linear regression model summary of “model1” tells us that the y-intercept of the model is 2.10923, which is the happiness score when the independent variables are 0. The coefficient of each independent variable tells us the change in the happiness score for every unit change in that independent variable. The higher the absolute value of the coefficient, the larger its impact on the happiness score, so the independent variables were ordered based on the value of their coefficient to determine which variables impacted happiness the most. We considered the absolute value of the coefficients because, as mentioned above, our CPI score variable has a negative relationship with happiness, unlike the other independent variables that have a positive relationship with the happiness score. It is the magnitude of this relationship and not the positive/negative direction of the relationship that is relevant when finding the variable’s impact on happiness. Based on the coefficients in the model summary shown above, the variables in this model that impacted happiness the most were health, government trust, freedom and GDP per capita. The model summary also shows the adjusted R^2 value and the hypothesis test of the model The adjusted R^2 value of this model is 0.7829, which means approximately 78.3% of the variation in the happiness score can be explained by changes in the independent variables. This adjusted R^2 value is high, so this model may have a good fit and this combination of independent variables causes significant variation in the happiness score. To check if these results have any statistical significance, we considered the results of the hypothesis test. The p-value of the model is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 378. As the p-value is less than alpha, we reject the null hypothesis that the relationship between the variables is 0. Therefore, we can conclude that there is evidence of a statistically significant relationship between the variables in this model. Although there is a statistical significance, we cannot infer that there is a causal relationship between the variables. Moreover, we must also consider the fact that the linear regression model summary shows the linear relationship between the variables. If the variables have a nonlinear relationship, this may not be the best method of finding the magnitude of the relationships between them. However, this problem may not be relevant to our data as the scatterplots we made for the happiness score and the scores of each independent variable showed linear relationships between the variables.

IV.1. Analysis based on year:

## Start:  AIC=-173.04
## happiness_score ~ gdp_per_capita + health + freedom + family + 
##     generosity + government_trust + cpi_score
## 
##                    Df Sum of Sq    RSS     AIC
## - cpi_score         1    0.0155 30.384 -174.97
## <none>                          30.368 -173.04
## - government_trust  1    0.6033 30.971 -172.48
## - generosity        1    0.6474 31.016 -172.30
## - freedom           1    1.7499 32.118 -167.75
## - health            1    2.7079 33.076 -163.93
## - gdp_per_capita    1    3.7070 34.075 -160.06
## - family            1    5.0780 35.446 -154.94
## 
## Step:  AIC=-174.97
## happiness_score ~ gdp_per_capita + health + freedom + family + 
##     generosity + government_trust
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          30.384 -174.97
## - generosity        1    0.6675 31.051 -174.15
## - government_trust  1    0.8533 31.237 -173.37
## - freedom           1    1.8558 32.240 -169.26
## - health            1    2.8752 33.259 -165.22
## - gdp_per_capita    1    4.2837 34.667 -159.82
## - family            1    5.0625 35.446 -156.94

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust, data = WHC_2015)
## 
## Coefficients:
##      (Intercept)    gdp_per_capita            health           freedom  
##           1.6747            0.9766            1.3085            1.1185  
##           family        generosity  government_trust  
##           1.0730            0.5806            0.9756

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_2015)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50181 -0.23610 -0.01687  0.27254  1.07360 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.674763   0.217648   7.695 4.10e-12 ***
## gdp_per_capita   0.957079   0.248008   3.859 0.000183 ***
## health           1.290966   0.391403   3.298 0.001275 ** 
## freedom          1.101291   0.415356   2.651 0.009079 ** 
## family           1.076300   0.238295   4.517 1.46e-05 ***
## generosity       0.573638   0.355698   1.613 0.109391    
## government_trust 0.911761   0.585681   1.557 0.122120    
## cpi_score        0.009511   0.038079   0.250 0.803185    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4989 on 122 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8163, Adjusted R-squared:  0.8057 
## F-statistic: 77.43 on 7 and 122 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust, data = WHC_2015)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48132 -0.24096 -0.01507  0.26349  1.06903 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.6747     0.2168   7.724 3.39e-12 ***
## gdp_per_capita     0.9766     0.2345   4.164 5.83e-05 ***
## health             1.3085     0.3836   3.412 0.000874 ***
## freedom            1.1185     0.4081   2.741 0.007039 ** 
## family             1.0730     0.2370   4.527 1.39e-05 ***
## generosity         0.5806     0.3532   1.644 0.102771    
## government_trust   0.9756     0.5249   1.859 0.065470 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.497 on 123 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8162, Adjusted R-squared:  0.8072 
## F-statistic: 91.02 on 6 and 123 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the year 2015. As with the previous initial models, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in 2015 includes all of the determinants of happiness and excludes the CPI score. Based on these results, we constructed a linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 1.6747, and the independent variables with the highest coefficients were health, freedom, and family– with health having the highest coefficient. Therefore, the independent variables that largely impacted happiness in 2015 were health, freedom, and family. The adjusted R^2 value for this model is 0.8072, while the p-value of this model is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 123. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_2016)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44935 -0.29933 -0.02047  0.30462  1.37562 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.76709    0.21759   8.121 4.64e-13 ***
## gdp_per_capita    0.91898    0.27502   3.341 0.001112 ** 
## health            1.53270    0.40174   3.815 0.000217 ***
## freedom           1.00249    0.47848   2.095 0.038261 *  
## family            1.17157    0.26189   4.474 1.76e-05 ***
## generosity        0.82414    0.43553   1.892 0.060866 .  
## government_trust  1.18201    0.54136   2.183 0.030951 *  
## cpi_score        -0.02139    0.04143  -0.516 0.606542    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.546 on 120 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7949, Adjusted R-squared:  0.783 
## F-statistic: 66.46 on 7 and 120 DF,  p-value: < 2.2e-16

## Start:  AIC=-147.16
## happiness_score ~ gdp_per_capita + health + freedom + family + 
##     generosity + government_trust + cpi_score
## 
##                    Df Sum of Sq    RSS     AIC
## - cpi_score         1    0.0795 35.857 -148.88
## <none>                          35.778 -147.16
## - generosity        1    1.0676 36.845 -145.40
## - freedom           1    1.3088 37.087 -144.56
## - government_trust  1    1.4214 37.199 -144.18
## - gdp_per_capita    1    3.3289 39.107 -137.77
## - health            1    4.3397 40.118 -134.51
## - family            1    5.9666 41.744 -129.42
## 
## Step:  AIC=-148.88
## happiness_score ~ gdp_per_capita + health + freedom + family + 
##     generosity + government_trust
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          35.857 -148.88
## - generosity        1    0.9954 36.853 -147.37
## - freedom           1    1.2550 37.112 -146.47
## - government_trust  1    1.3774 37.235 -146.05
## - gdp_per_capita    1    3.3827 39.240 -139.34
## - health            1    4.2637 40.121 -136.50
## - family            1    5.9470 41.804 -131.24

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust, data = WHC_2016)
## 
## Coefficients:
##      (Intercept)    gdp_per_capita            health           freedom  
##           1.7673            0.8703            1.5105            0.9761  
##           family        generosity  government_trust  
##           1.1695            0.7810            1.0751

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust, data = WHC_2016)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42643 -0.30534 -0.01359  0.31677  1.42252 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.7673     0.2169   8.147 3.88e-13 ***
## gdp_per_capita     0.8703     0.2576   3.379 0.000981 ***
## health             1.5105     0.3982   3.793 0.000234 ***
## freedom            0.9761     0.4743   2.058 0.041745 *  
## family             1.1695     0.2611   4.480 1.71e-05 ***
## generosity         0.7810     0.4262   1.833 0.069303 .  
## government_trust   1.0751     0.4986   2.156 0.033067 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5444 on 121 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7945, Adjusted R-squared:  0.7843 
## F-statistic: 77.97 on 6 and 121 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the year 2016. As with the previous initial models, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in 2016 includes all of the determinants of happiness and excludes the CPI score. Based on these results, we constructed a linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 1.7673, and the independent variables with the highest coefficients were health, family, and government trust– with health having the highest coefficient. Therefore, the independent variables that largely impacted happiness in 2016 were health, freedom, and family. The adjusted R^2 value for this model is 0.7843, while the p-value of this model is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 121. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     generosity + government_trust + cpi_score, data = WHC_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93893 -0.34155  0.05238  0.31919  1.03359 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.27209    0.23217   9.786  < 2e-16 ***
## gdp_per_capita    1.11972    0.27330   4.097 7.57e-05 ***
## health            1.40076    0.46821   2.992  0.00336 ** 
## freedom           1.52726    0.47356   3.225  0.00162 ** 
## generosity        0.75442    0.58609   1.287  0.20046    
## government_trust  0.02917    0.73394   0.040  0.96836    
## cpi_score         0.08507    0.04808   1.769  0.07936 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5776 on 122 degrees of freedom
## Multiple R-squared:  0.7342, Adjusted R-squared:  0.7212 
## F-statistic: 56.17 on 6 and 122 DF,  p-value: < 2.2e-16

## Start:  AIC=-134.81
## happiness_score ~ gdp_per_capita + health + freedom + generosity + 
##     government_trust + cpi_score
## 
##                    Df Sum of Sq    RSS     AIC
## - government_trust  1    0.0005 40.702 -136.81
## - generosity        1    0.5528 41.255 -135.07
## <none>                          40.702 -134.81
## - cpi_score         1    1.0443 41.746 -133.54
## - health            1    2.9861 43.688 -127.67
## - freedom           1    3.4701 44.172 -126.25
## - gdp_per_capita    1    5.5999 46.302 -120.18
## 
## Step:  AIC=-136.81
## happiness_score ~ gdp_per_capita + health + freedom + generosity + 
##     cpi_score
## 
##                  Df Sum of Sq    RSS     AIC
## - generosity      1    0.5814 41.284 -136.97
## <none>                        40.702 -136.81
## - cpi_score       1    1.5274 42.230 -134.05
## - health          1    2.9858 43.688 -129.67
## - freedom         1    3.5793 44.282 -127.93
## - gdp_per_capita  1    5.6660 46.368 -121.99
## 
## Step:  AIC=-136.98
## happiness_score ~ gdp_per_capita + health + freedom + cpi_score
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        41.284 -136.97
## - cpi_score       1    1.9164 43.200 -133.12
## - health          1    2.7989 44.083 -130.51
## - freedom         1    5.1418 46.425 -123.83
## - gdp_per_capita  1    5.2875 46.571 -123.43

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     cpi_score, data = WHC_2017)
## 
## Coefficients:
##    (Intercept)  gdp_per_capita          health         freedom       cpi_score  
##        2.36908         1.07087         1.35156         1.73267         0.09508

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     cpi_score, data = WHC_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09132 -0.32208  0.09886  0.33477  1.00877 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.36908    0.21400  11.070  < 2e-16 ***
## gdp_per_capita  1.07087    0.26871   3.985 0.000114 ***
## health          1.35156    0.46614   2.899 0.004423 ** 
## freedom         1.73267    0.44090   3.930 0.000140 ***
## cpi_score       0.09508    0.03963   2.399 0.017920 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.577 on 124 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7217 
## F-statistic:    84 on 4 and 124 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the year 2017. As with the previous initial models, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in 2017 include GDP per capita, health, freedom, and the CPI score . Based on these results, we constructed a linear regression model and used the outcomes of its model summary for analysis. The y-intercept of the new model for 2017 is 2.36908, and the independent variables with the highest coefficients were freedom, health, and GDP per capita– with freedom having the highest coefficient. Therefore, the independent variables that largely impacted happiness in 2017 were freedom, health, and GDP per capita. The adjusted R^2 value for this model is 0.7217, while the p-value of this model is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 124. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     generosity + government_trust + cpi_score, data = WHC_2018)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.84668 -0.33267  0.06428  0.35831  1.01233 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.32274    0.19293  12.039  < 2e-16 ***
## gdp_per_capita    1.39569    0.25096   5.561 1.63e-07 ***
## health            1.28883    0.42374   3.042 0.002886 ** 
## freedom           1.46652    0.40174   3.650 0.000388 ***
## generosity        0.83046    0.54222   1.532 0.128232    
## government_trust  0.12230    0.67572   0.181 0.856681    
## cpi_score         0.04821    0.04452   1.083 0.281019    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5555 on 121 degrees of freedom
## Multiple R-squared:  0.7688, Adjusted R-squared:  0.7574 
## F-statistic: 67.07 on 6 and 121 DF,  p-value: < 2.2e-16

## Start:  AIC=-143.72
## happiness_score ~ gdp_per_capita + health + freedom + generosity + 
##     government_trust + cpi_score
## 
##                    Df Sum of Sq    RSS     AIC
## - government_trust  1    0.0101 37.342 -145.68
## - cpi_score         1    0.3618 37.694 -144.49
## <none>                          37.332 -143.72
## - generosity        1    0.7237 38.056 -143.26
## - health            1    2.8542 40.186 -136.29
## - freedom           1    4.1113 41.443 -132.35
## - gdp_per_capita    1    9.5428 46.875 -116.58
## 
## Step:  AIC=-145.68
## happiness_score ~ gdp_per_capita + health + freedom + generosity + 
##     cpi_score
## 
##                  Df Sum of Sq    RSS     AIC
## - cpi_score       1    0.5562 37.898 -145.79
## <none>                        37.342 -145.68
## - generosity      1    0.8110 38.153 -144.93
## - health          1    2.8523 40.194 -138.26
## - freedom         1    4.2748 41.617 -133.81
## - gdp_per_capita  1    9.5774 46.919 -118.46
## 
## Step:  AIC=-145.79
## happiness_score ~ gdp_per_capita + health + freedom + generosity
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        37.898 -145.79
## - generosity      1    1.0505 38.949 -144.29
## - health          1    3.6032 41.502 -136.17
## - freedom         1    5.2650 43.163 -131.14
## - gdp_per_capita  1   12.5957 50.494 -111.06

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     generosity, data = WHC_2018)
## 
## Coefficients:
##    (Intercept)  gdp_per_capita          health         freedom      generosity  
##         2.2892          1.5100          1.4040          1.5978          0.9606

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     cpi_score, data = WHC_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09132 -0.32208  0.09886  0.33477  1.00877 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.36908    0.21400  11.070  < 2e-16 ***
## gdp_per_capita  1.07087    0.26871   3.985 0.000114 ***
## health          1.35156    0.46614   2.899 0.004423 ** 
## freedom         1.73267    0.44090   3.930 0.000140 ***
## cpi_score       0.09508    0.03963   2.399 0.017920 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.577 on 124 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7217 
## F-statistic:    84 on 4 and 124 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the year 2018. As with the previous initial models, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in 2018 include GDP per capita, health, freedom, and generosity. Based on these results, we constructed a linear regression model and used the outcomes of its model summary for analysis. The y-intercept of the new model for 2018 is 2.2892, and the independent variables with the highest coefficients were freedom, GDP per capita, and health– with freedom having the highest coefficient. Therefore, the independent variables that largely impacted happiness in 2018 were freedom, GDP per capita, and health. The adjusted R^2 value for this model is 0.7577, while the p-value is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 123. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_2019)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37049 -0.26677 -0.01843  0.32268  1.05837 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.09855    0.17575  11.941  < 2e-16 ***
## gdp_per_capita    0.94511    0.25598   3.692 0.000336 ***
## health            1.89926    0.42335   4.486 1.67e-05 ***
## freedom           1.05511    0.47389   2.226 0.027848 *  
## family            1.01990    0.25708   3.967 0.000124 ***
## generosity        0.56639    0.38360   1.477 0.142420    
## government_trust  1.40819    0.55506   2.537 0.012464 *  
## cpi_score        -0.04989    0.04180  -1.194 0.234963    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5174 on 120 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8103, Adjusted R-squared:  0.7993 
## F-statistic: 73.24 on 7 and 120 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust, data = WHC_2016)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42643 -0.30534 -0.01359  0.31677  1.42252 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.7673     0.2169   8.147 3.88e-13 ***
## gdp_per_capita     0.8703     0.2576   3.379 0.000981 ***
## health             1.5105     0.3982   3.793 0.000234 ***
## freedom            0.9761     0.4743   2.058 0.041745 *  
## family             1.1695     0.2611   4.480 1.71e-05 ***
## generosity         0.7810     0.4262   1.833 0.069303 .  
## government_trust   1.0751     0.4986   2.156 0.033067 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5444 on 121 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7945, Adjusted R-squared:  0.7843 
## F-statistic: 77.97 on 6 and 121 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the year 2019. As with the previous initial models, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in 2019 includes all of the determinants of happiness and excludes the CPI score. Based on these results, we constructed a linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 2.1025, and the independent variables with the highest coefficients were health, government trust, and family– with health having the highest coefficient. Therefore, the independent variables that largely impacted happiness in 2019 were health, government trust, and family. The adjusted R^2 value for this model is 0.7986, while the p-value of this model is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 121. TThis p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     generosity + government_trust + cpi_score, data = WHC_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8817 -0.3259  0.1319  0.3376  0.9592 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.313190   0.245516   9.422 3.84e-16 ***
## gdp_per_capita   0.989733   0.304615   3.249  0.00150 ** 
## health           1.564638   0.456816   3.425  0.00084 ***
## freedom          1.500371   0.496016   3.025  0.00304 ** 
## generosity       0.927006   0.573586   1.616  0.10866    
## government_trust 0.007449   0.652481   0.011  0.99091    
## cpi_score        0.081908   0.051910   1.578  0.11720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6016 on 121 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.6909 
## F-statistic: 48.32 on 6 and 121 DF,  p-value: < 2.2e-16

## Start:  AIC=-123.27
## happiness_score ~ gdp_per_capita + health + freedom + generosity + 
##     government_trust + cpi_score
## 
##                    Df Sum of Sq    RSS     AIC
## - government_trust  1    0.0000 43.797 -125.27
## <none>                          43.797 -123.27
## - cpi_score         1    0.9012 44.699 -122.67
## - generosity        1    0.9454 44.743 -122.54
## - freedom           1    3.3118 47.109 -115.94
## - gdp_per_capita    1    3.8212 47.619 -114.57
## - health            1    4.2463 48.044 -113.43
## 
## Step:  AIC=-125.27
## happiness_score ~ gdp_per_capita + health + freedom + generosity + 
##     cpi_score
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        43.797 -125.27
## - generosity      1    0.9914 44.789 -124.41
## - cpi_score       1    1.3532 45.151 -123.38
## - freedom         1    3.4076 47.205 -117.68
## - gdp_per_capita  1    3.8949 47.692 -116.37
## - health          1    4.2463 48.044 -115.43

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     generosity + cpi_score, data = WHC_2020)
## 
## Coefficients:
##    (Intercept)  gdp_per_capita          health         freedom      generosity  
##        2.31244         0.98924         1.56460         1.50130         0.92837  
##      cpi_score  
##        0.08225

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     generosity + cpi_score, data = WHC_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8822 -0.3262  0.1321  0.3372  0.9595 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.31244    0.23559   9.816  < 2e-16 ***
## gdp_per_capita  0.98924    0.30033   3.294 0.001294 ** 
## health          1.56460    0.45493   3.439 0.000799 ***
## freedom         1.50130    0.48729   3.081 0.002551 ** 
## generosity      0.92837    0.55865   1.662 0.099119 .  
## cpi_score       0.08225    0.04236   1.941 0.054504 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5992 on 122 degrees of freedom
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.6935 
## F-statistic: 58.46 on 5 and 122 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the year 2020. As with the previous initial models, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in 2020 include GDP per capita, health, freedom, generosity, and the CPI score. Based on these results, we constructed a linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 2.31244, and the independent variables with the highest coefficients were health, freedom, and GDP per capita– with health having the highest coefficient. Therefore, the independent variables that largely impacted happiness in 2020 were health, freedom, and GDP per capita. The adjusted R^2 value for this model is 0.6935, while the p-value of this model is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 122. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

IV.2. Analysis by continent:

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_Africa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23701 -0.27779  0.02617  0.28009  1.37341 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.14119    0.27249  11.528  < 2e-16 ***
## gdp_per_capita    1.17246    0.23495   4.990 3.16e-06 ***
## health            1.28523    0.35221   3.649 0.000453 ***
## freedom           1.02147    0.54010   1.891 0.061996 .  
## family            0.29955    0.20772   1.442 0.152962    
## generosity        0.63987    0.73715   0.868 0.387821    
## government_trust -0.51843    0.70394  -0.736 0.463470    
## cpi_score        -0.17530    0.06249  -2.805 0.006228 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4766 on 85 degrees of freedom
##   (99 observations deleted due to missingness)
## Multiple R-squared:  0.586,  Adjusted R-squared:  0.5519 
## F-statistic: 17.18 on 7 and 85 DF,  p-value: 5.69e-14

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_Africa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23701 -0.27779  0.02617  0.28009  1.37341 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.14119    0.27249  11.528  < 2e-16 ***
## gdp_per_capita    1.17246    0.23495   4.990 3.16e-06 ***
## health            1.28523    0.35221   3.649 0.000453 ***
## freedom           1.02147    0.54010   1.891 0.061996 .  
## family            0.29955    0.20772   1.442 0.152962    
## generosity        0.63987    0.73715   0.868 0.387821    
## government_trust -0.51843    0.70394  -0.736 0.463470    
## cpi_score        -0.17530    0.06249  -2.805 0.006228 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4766 on 85 degrees of freedom
##   (99 observations deleted due to missingness)
## Multiple R-squared:  0.586,  Adjusted R-squared:  0.5519 
## F-statistic: 17.18 on 7 and 85 DF,  p-value: 5.69e-14

Our process of analyzing models for the data subsets mirrored the method used to analyze the model for the entire data set described above. We started by investigating the continent data subsets. The initial linear regression model came from a subset of our data set that only contains data for the continent of Africa. As with the previous initial model, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in Africa includes the same independent variables used in the initial model. Based on these results, we constructed a multiple linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 3.14119, and the independent variables with the largest coefficients were health, GDP per capita, and freedom– with health having the highest coefficient. Therefore, the independent variables that most influenced the happiness score in Africa were health, GDP per capita, and freedom. The adjusted R^2 value for this model is 0.5519, while the p-value of this model is 5.69e-14, with alpha set at 0.05 and degrees of freedom at 85. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_Asia)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23998 -0.27709 -0.00759  0.27647  1.30798 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.39218    0.23900  10.009  < 2e-16 ***
## gdp_per_capita    1.09000    0.21127   5.159 1.24e-06 ***
## health            0.91249    0.49194   1.855 0.066529 .  
## freedom           0.23903    0.47617   0.502 0.616764    
## family            0.73540    0.19027   3.865 0.000197 ***
## generosity        0.73288    0.31178   2.351 0.020684 *  
## government_trust  1.23967    0.53159   2.332 0.021685 *  
## cpi_score         0.02834    0.05274   0.537 0.592249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4733 on 101 degrees of freedom
## Multiple R-squared:  0.7215, Adjusted R-squared:  0.7022 
## F-statistic: 37.37 on 7 and 101 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + family + 
##     generosity + government_trust + cpi_score, data = WHC_Asia)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22228 -0.30421 -0.02687  0.28574  1.29542 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.41883    0.23217  10.418  < 2e-16 ***
## gdp_per_capita    1.05967    0.20170   5.254 8.20e-07 ***
## health            0.96542    0.47874   2.017  0.04637 *  
## family            0.78488    0.16216   4.840 4.63e-06 ***
## generosity        0.77073    0.30141   2.557  0.01203 *  
## government_trust  1.34125    0.48976   2.739  0.00728 ** 
## cpi_score         0.02805    0.05254   0.534  0.59458    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4716 on 102 degrees of freedom
## Multiple R-squared:  0.7208, Adjusted R-squared:  0.7043 
## F-statistic: 43.88 on 6 and 102 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the continent of Asia. As with the previous initial model, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results show that the best combination of influential independent variables for the happiness score in Asia includes GDP per capita, health, family, generosity, and government trust. Based on these results, we constructed a multiple linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 2.3817, and the independent variables with the largest coefficients were government trust, health, and GDP per capita– with government trust having the highest coefficient. Therefore, the independent variables that most influenced the happiness score in Asia were government trust, health, and GDP per capita. The adjusted R^2 value for this model is 0.7064, while the p-value of this model is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 103. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust, data = WHC_NorthAmerica)
## 
## Residuals:
##        1        2        3        4        5        6       13       14 
## -0.03439  0.05553 -0.01627  0.02924  0.04284 -0.06515 -0.00800  0.02005 
##       15 
## -0.02385 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        4.3676     4.8875   0.894    0.466
## gdp_per_capita     1.1078     2.0681   0.536    0.646
## health            -0.8045     6.0655  -0.133    0.907
## freedom            8.0198     6.3063   1.272    0.331
## family            -1.1697     0.5530  -2.115    0.169
## generosity        -1.9847     2.7889  -0.712    0.550
## government_trust  -1.0028     2.8468  -0.352    0.758
## 
## Residual standard error: 0.07906 on 2 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.9805, Adjusted R-squared:  0.9221 
## F-statistic: 16.77 on 6 and 2 DF,  p-value: 0.05732

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_Australia)
## 
## Residuals:
## ALL 6 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (2 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        8.7485        NaN     NaN      NaN
## gdp_per_capita    -0.1486        NaN     NaN      NaN
## health            -1.2239        NaN     NaN      NaN
## freedom            0.5676        NaN     NaN      NaN
## family            -0.1185        NaN     NaN      NaN
## generosity        -0.7821        NaN     NaN      NaN
## government_trust       NA         NA      NA       NA
## cpi_score              NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 5 and 0 DF,  p-value: NA

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_Europe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00627 -0.25400  0.03021  0.24328  1.00754 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.37086    0.42976   7.844 4.68e-12 ***
## gdp_per_capita    0.68111    0.34312   1.985 0.049853 *  
## health            0.01745    0.60330   0.029 0.976979    
## freedom           1.00323    0.52220   1.921 0.057533 .  
## family            0.26678    0.21025   1.269 0.207397    
## generosity        1.03055    0.40217   2.562 0.011869 *  
## government_trust  2.06510    0.58695   3.518 0.000652 ***
## cpi_score         0.09947    0.06136   1.621 0.108124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4263 on 101 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.795 
## F-statistic: 60.82 on 7 and 101 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_Europe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00627 -0.25400  0.03021  0.24328  1.00754 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.37086    0.42976   7.844 4.68e-12 ***
## gdp_per_capita    0.68111    0.34312   1.985 0.049853 *  
## health            0.01745    0.60330   0.029 0.976979    
## freedom           1.00323    0.52220   1.921 0.057533 .  
## family            0.26678    0.21025   1.269 0.207397    
## generosity        1.03055    0.40217   2.562 0.011869 *  
## government_trust  2.06510    0.58695   3.518 0.000652 ***
## cpi_score         0.09947    0.06136   1.621 0.108124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4263 on 101 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.795 
## F-statistic: 60.82 on 7 and 101 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the continent of Europe. As with the previous initial model, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in Europe includes the same independent variables used in the initial model. Based on these results, we constructed a multiple linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 3.37086, and the independent variables with the largest coefficients were government trust, generosity, and freedom– with government trust having the highest coefficient. Therefore, the independent variables that most influenced the happiness score in Europe were government trust, generosity, and freedom. The adjusted R^2 value for this model is 0.795, while the p-value is < 2.2e-16, with alpha set at 0.05 and degrees of freedom at 101. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family + generosity + government_trust + cpi_score, data = WHC_SouthAmerica)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33673 -0.28009 -0.02407  0.31509  0.80127 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.45263    0.43999   5.574 8.97e-07 ***
## gdp_per_capita    1.79779    0.48031   3.743 0.000456 ***
## health            2.35555    0.68609   3.433 0.001178 ** 
## freedom           1.80433    0.65240   2.766 0.007844 ** 
## family           -0.36356    0.35509  -1.024 0.310637    
## generosity        0.19867    0.80806   0.246 0.806755    
## government_trust  0.26371    1.41285   0.187 0.852662    
## cpi_score        -0.03541    0.06179  -0.573 0.569014    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4543 on 52 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:   0.78,  Adjusted R-squared:  0.7504 
## F-statistic: 26.34 on 7 and 52 DF,  p-value: 5.363e-15

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom + 
##     family, data = WHC_SouthAmerica)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29441 -0.26802  0.02395  0.30147  0.82834 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.5440     0.2810   9.052 1.74e-12 ***
## gdp_per_capita   1.6609     0.3824   4.343 6.09e-05 ***
## health           2.3027     0.6400   3.598 0.000687 ***
## freedom          1.6919     0.5986   2.826 0.006555 ** 
## family          -0.3125     0.3256  -0.960 0.341420    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4432 on 55 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.7786, Adjusted R-squared:  0.7625 
## F-statistic: 48.36 on 4 and 55 DF,  p-value: < 2.2e-16

The initial linear regression model came from a subset of our data set that only contains data for the continent of South America. As with the previous initial model, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results show that the best combination of influential independent variables for the happiness score in South America includes GDP per capita, health, and freedom. Based on these results, we constructed a multiple linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 2.6226, and the independent variables with the largest coefficients were freedom, GDP per capita, and health– with freedom having the highest coefficient. Therefore, the independent variables that most influenced the happiness score in South America were freedom, GDP per capita, and health. The adjusted R^2 value for this model is 0.7488, while the p-value of this model is <2.2e-16, with alpha set at 0.05 and degrees of freedom at 116. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom, 
##     data = WHC_NorthAmerica)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37642 -0.11454 -0.02561  0.06321  0.47834 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.7516     0.5928   8.015 1.34e-06 ***
## gdp_per_capita   0.7290     0.3252   2.242   0.0417 *  
## health           0.4923     0.8382   0.587   0.5663    
## freedom          1.6857     0.9673   1.743   0.1033    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2177 on 14 degrees of freedom
## Multiple R-squared:  0.6126, Adjusted R-squared:  0.5296 
## F-statistic: 7.381 on 3 and 14 DF,  p-value: 0.003341

The initial linear regression model came from a subset of our data set that only contains data for the continent of North America. As with the previous initial model, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results display that the best combination of independent variables to influence the happiness score in North America includes the same independent variables used in the initial model. Based on these results, we constructed a multiple linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 4.7516, and the independent variables with the largest coefficients were freedom, GDP per capita, and health– with freedom having the highest coefficient. Therefore, the independent variables that influenced the happiness score in North America the most were freedom, GDP per capita, and health. The adjusted R^2 value for this model is 0.5296, while the p-value is 0.003341, with alpha set at 0.05 and degrees of freedom at 14. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model. The relatively low adjusted R^2 value and relatively high p-value in these results may have resulted from the limited amount of data we had for North America and/or the high number of NA values in the data subset for North America (causing the degrees of freedom to be 14).

## 
## Call:
## lm(formula = happiness_score ~ gdp_per_capita + health + freedom, 
##     data = WHC_Australia)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.031915 -0.014035  0.002100  0.009825  0.038763 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.3950     0.4609  18.215 8.48e-08 ***
## gdp_per_capita  -0.3463     0.1676  -2.067  0.07261 .  
## health          -0.4421     0.1239  -3.569  0.00731 ** 
## freedom         -0.3792     0.2987  -1.270  0.23991    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02513 on 8 degrees of freedom
## Multiple R-squared:  0.6212, Adjusted R-squared:  0.4792 
## F-statistic: 4.374 on 3 and 8 DF,  p-value: 0.04225

The initial linear regression model came from a subset of our data set that only contains data for the continent of Australia. As with the previous initial model, the happiness score is the dependent variable, while the determinants of happiness and the CPI score are the independent variables. The model selection results show that the best combination of influential independent variables for the happiness score in Australia includes GDP per capita, health, and freedom. Based on these results, we constructed a multiple linear regression model and used the outcomes of its model summary for analysis. The y-intercept of this model is 8.3950, and the independent variables with the largest coefficients were health, freedom, and GDP per capita– with health having the highest coefficient. Therefore, the independent variables that most influenced the happiness score in Australia were health, freedom, and GDP per capita. The adjusted R^2 value for this model is 0.4792, while the p-value is 0.04225, with alpha set at 0.05 and degrees of freedom at 8. This p-value suggests that we can reject the null hypothesis explaining that the relationship between the variables is 0 because there is evidence of a statistically significant relationship between the variables in this model.

Contrary to the results of our other models, this model summary for Australia shows a low adjusted R^2 value (suggesting that the independent variables do not lead to a significant amount of variance in the happiness score), a low-level degree of freedom, and a relatively high y-intercept and p-value. The model summary also shows a negative relationship between the happiness scores and all of the independent variables in the final model, which contradicts the positive relationship we observed between those variables during data exploration. We believe these results were caused by the minimal amount of observations in this data subset (Australia and New Zealand were the only countries part of this continent) and the high number of NA values in its variables.

Predictions:

The histogram above conveys the distribution of happiness scores around the world over the 2015-2020 time period.

The histogram above conveys the distribution of predictions for happiness scores around the world based on the linear regression model, model1.

In general, the distribution for our predictions and the actual happiness scores recorded over the year does not differ greatly; however, differences are visible. While the distribution for recorded happiness scores is approximately normal, that of our predictions is a little bit more left skewed since the mean of predictions for happiness scores tend to be higher on the scale of 1-10 compared to the recorded happiness scores. Another thing to keep in mind is that the range of the y-axis is different for the two distributions. While the count for recorded happiness scores can go up to 60, that for our predictions only has a maximum of 30. This shows that there is predicted to be more variation in happiness scores for countries all around the world in the future, compared to happiness scores recorded in the 2015-2020 time period. Based on the prediction visual, it also seems like the predicted happiness scores tend to be more concentrated towards the two extremes, which means that countries have more likelihood of either being much more happy or a lot less happy.

Above is a box plot comparing the distribution of predicted happiness scores for each continent as well as in general all around the globe. The way we set up these predictions is for each continent, we ran predictions based on the linear regression model that was selected through cross validation (shown previously). The predictions for worldwide happiness are run on the model1 linear regression model. Based on the visual above, we can see that the mean happiness score for Australia, Europe, North and South America are above the worldwide mean while the mean for the rest of the continents seems to fall below that threshold. Overall, the distribution for these predictions does not differ from the distribution of the actual recorded happiness scores (box plot shown in the Data Explanation and Exploration section). Based on our prediction, the ranking in means of predicted happiness scores across continents would be: Australia, North America, South America, Europe, Asia, and Africa (ranked descending).

V. Conclusion:

Overall, Australia, Europe, and the Americas had the highest happiness scores above the other two continents in the data. We had anticipated that the pandemic would significantly change the happiness distribution and specific determinants of happiness, like health, but our data reveals the opposite. For many countries, health stayed the same, but “by far, the largest changes were in three types of benevolent actions, especially in 2021…In 2020, there was a substantial increase in help given to strangers but no substantial change in donations and volunteering. In 2021, all three types of activity were much higher than in 2017-2019, having an increase averaging about 25% of baseline activity” (Happiness, Benevolence, and Trust During COVID-19 and Beyond | the World Happiness Report, 2022).

We were correct in our analysis about government trust, which was the most significant contributor to happiness in Asia, North America, and Europe in our data. The World Happiness Report for 2022 explained even though “there were no significant changes in the sense of freedom, perceived corruption and institutional trust during 2020 and 2021, confidence in government rose in 2020 and then returned to baseline in 2021.” This increase was related to COVID-19 and the government’s efforts limiting deaths. While we predicted many factors using the current dataset, the missing data still posed a problem, and we experienced continuous errors while modeling for Australia. In our map visualization, Africa, in particular, was also missing data on different countries. Regarding the final analyses for Africa, according to a Quartz article, despite Africa having the lowest COVID-19 deaths, it still ranked as the saddest continent because of the “constrained business activity and loss of livelihoods” from the “virus contaminants measures enforced by governments across the continent” (Ngila, 2022). This was on par with our analysis, where our graphs display these findings. Additionally, some model predictions don’t account for the comprehensive factors of happiness that could influence an individual. As this research is consistently being conducted and updated, we suggest that future data collectors and data analysts provide more information on the missing data. It may be difficult, but this is important to understand because it can help determine how to better a country with high dystopian values.

Happiness, Benevolence, and Trust During COVID-19 and Beyond | The World Happiness Report. (2022, March 18). https://worldhappiness.report/ed/2022/happiness-benevolence-and-trust-during-covid-19-and-beyond/#ranking-of-happiness-2019-2021 Ngila, F. (2022, July 20). Covid made Africa the saddest continent. Quartz. https://qz.com/africa/2171425/world-happiness-report-africa-has-the-worlds-saddest-population

Final Project

Sofia Vuong

2023-04-10