GGDOT Hacknight: using Box-Cox Transformations and Regression to Analyse the CO2 Emissions of Food
Sep 21, 2018
Mike Page
14 minute read

GGDOT HACKNIGHT


The GGDOT project combines expertise in greenhouse gas emission calculations, nutrition, and data science to deliver a toolkit. This toolkit contains information on the greenhouse gas emissions and nutritional content of the food we eat. As part of the project, GGDOT host Hacknights to help develop their tools and analyse the data within them. I was fortunate enough to be able to attend the recent Manchester Hacknight, and this blog post will go through the analyses I performed from that session (and the day after). If you get the chance, I highly recommend attending one of their hacknights, it was great fun! More info can be found here.

DATA WRANGLING


Although the GGDOT group provide a Jupyter notebook for Python to get started with the data, I chose to run all analyses in R (I really enjoy working within the tidyverse ecosystem).

First, the tidyverse libraries and GGDOT toolkit data sets were loaded into the work space:

# Load libraries

library(tidyverse)

# Read in datasets

GHG <- read_csv("/Users/mikepage/Documents/Data Science/random/intake24_20180920/hacknight/GHG_file_intake24_20180920.csv", col_names = TRUE)

aisle_food_info <- read_csv("/Users/mikepage/Documents/Data Science/random/intake24_20180920/hacknight/eaten_table_aisle_intake24_20180920.csv", col_names = TRUE)

individual_food_info <- read_csv("/Users/mikepage/Documents/Data Science/random/intake24_20180920/hacknight/eaten_table_intake24_20180920.csv", col_names = TRUE)

nutrition_info <- read_csv("/Users/mikepage/Documents/Data Science/random/intake24_20180920/hacknight/foods_table_intake24_20180920.csv", col_names = TRUE)

The GGDOT toolkit contained alot of information! So, we won’t be exploring the data sets in depth here. The key data set we will be exploring consists of a food diary over a period of days over which participants recorded individual meals (individual_food_info). Alongside the meals is a breakdown of their nutrient content and associated greenhouse gas emissions, among many other variables, as can be seen by using the ‘glimpse()’ function below:

glimpse(individual_food_info)
## Observations: 50,213
## Variables: 129
## $ SurveyYear              <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
## $ Age                     <int> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45...
## $ Sex                     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Country                 <chr> "Wales", "Wales", "Wales", "Wales", "W...
## $ DayofWeek               <int> 7, 7, 3, 1, 2, 2, 7, 3, 3, 3, 3, 3, 3,...
## $ DayNo                   <int> 1, 1, 4, 2, 3, 3, 1, 4, 4, 4, 4, 4, 4,...
## $ DiaryDaysCompleted      <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ MealTimeDescription     <chr> "2pm to 4:59pm", "8pm to 9:59pm", "8pm...
## $ MealTime                <dbl> 13.99972, 19.99972, 21.00000, 12.00000...
## $ MainFoodGroupCode       <int> 7, 7, 7, 59, 59, 59, 8, 21, 16, 50, 1,...
## $ MainFoodGroupDesc       <chr> "BISCUITS", "BISCUITS", "BISCUITS", "B...
## $ SubFoodGroupCode        <chr> "7A", "7A", "7A", "59R", "59R", "59R",...
## $ SubFoodGroupDesc        <chr> "BISCUITS MANUFACTURED / RETAIL", "BIS...
## $ RecipeMainFoodGroupCode <int> 7, 7, 7, 59, 59, 59, 8, 8, 8, 8, 8, 8,...
## $ RecipeMainFoodGroupDesc <chr> "BISCUITS", "BISCUITS", "BISCUITS", "B...
## $ RecipeSubFoodGroupCode  <chr> "7A", "7A", "7A", "59R", "59R", "59R",...
## $ RecipeSubFoodGroupDesc  <chr> "BISCUITS MANUFACTURED / RETAIL", "BIS...
## $ WhoWith                 <chr> "O - Not specified", "O - Not specifie...
## $ WhoWithOther            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Where                   <chr> "Unspecified", "Unspecified", "Home - ...
## $ WhereOther              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ WatchingTV              <chr> "Not Specified", "Not Specified", "Yes...
## $ Table                   <chr> "Not Specified", "Not Specified", "Not...
## $ diarymth                <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,...
## $ seriali                 <int> 60101021, 60101021, 60101021, 60101021...
## $ FoodNumber              <int> 259, 259, 259, 102, 107, 107, 6967, 38...
## $ FoodGroupCode           <chr> "7A", "7A", "7A", "59R", "59R", "59R",...
## $ FoodName                <chr> "DIGESTIVE PLAIN", "DIGESTIVE PLAIN", ...
## $ FoodCategory            <chr> "F", "F", "F", "F", "R", "R", "F", "F"...
## $ Description             <chr> "Digestives", "Digestives", "Digestive...
## $ Dilution                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 52, 1, 1, 1...
## $ Comment                 <chr> "06/09/04 Updated with MW6 BH. Updated...
## $ EdiblePortion           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CO2eRef                 <chr> "From INTAKE24 GHGs by The Rowett Inst...
## $ FoodAisle               <chr> "Cakes, biscuits, puddings", "Cakes, b...
## $ Base                    <dbl> 18.8333754, 18.8333754, 28.6153490, 13...
## $ Units                   <chr> "Grams", "Grams", "Grams", "Grams", "G...
## $ ACAR                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ALCO                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BCAR                    <dbl> 0.0000000, 0.0000000, 0.0000000, 0.000...
## $ BCRYPT                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BIOT                    <dbl> 0.75333501, 0.75333501, 1.14461396, 4....
## $ CA                      <dbl> 17.8917066, 17.8917066, 27.1845816, 24...
## $ CHO                     <dbl> 12.35469423, 12.35469423, 18.77166897,...
## $ CHOL                    <dbl> 0.3766675, 0.3766675, 0.5723070, 0.000...
## $ CL                      <dbl> 6.780015e+01, 6.780015e+01, 1.030153e+...
## $ CMON                    <dbl> 1.917238e+00, 1.917238e+00, 2.913043e+...
## $ CN3                     <dbl> 0.0150667003, 0.0150667003, 0.02289227...
## $ CN6                     <dbl> 0.412450920, 0.412450920, 0.626676144,...
## $ CU                      <dbl> 0.0395500882, 0.0395500882, 0.06009223...
## $ ENGFIB                  <dbl> 0.508501135, 0.508501135, 0.772614424,...
## $ FAT                     <dbl> 4.01150895, 4.01150895, 6.09506935, 3....
## $ FE                      <dbl> 0.339000756, 0.339000756, 0.515076283,...
## $ FOLT                    <dbl> 2.07167129, 2.07167129, 3.14768839, 60...
## $ FRUCT                   <dbl> 0.000000000, 0.000000000, 0.000000000,...
## $ GLUC                    <dbl> 0.0000000000, 0.0000000000, 0.00000000...
## $ HFE                     <dbl> 0.00000000, 0.00000000, 0.00000000, 0....
## $ I                       <dbl> 0.00000000, 0.00000000, 0.00000000, 8....
## $ K                       <dbl> 40.4917570, 40.4917570, 61.5230004, 28...
## $ KCALS                   <dbl> 87.1985279, 87.1985279, 132.4890661, 3...
## $ KJ                      <dbl> 365.9324831, 365.9324831, 555.9962319,...
## $ LACT                    <dbl> 0.00000000, 0.00000000, 0.00000000, 0....
## $ MALT                    <dbl> 0.00000000, 0.00000000, 0.00000000, 5....
## $ MG                      <dbl> 5.8383464, 5.8383464, 8.8707582, 60.33...
## $ MILK                    <dbl> 0.26366725, 0.26366725, 0.40061489, 6....
## $ MN                      <dbl> 0.171383716, 0.171383716, 0.260399676,...
## $ X67                     <dbl> 1.056552e+02, 1.056552e+02, 1.605321e+...
## $ NCF                     <dbl> 1.07350240, 1.07350240, 1.63107490, 7....
## $ NHFE                    <dbl> 0.339000756, 0.339000756, 0.515076283,...
## $ NIACEQU                 <dbl> 0.583834636, 0.583834636, 0.887075820,...
## $ NMILK                   <dbl> 3.03217343, 3.03217343, 4.60707120, 0....
## $ OSUG                    <dbl> 0.00000000, 0.00000000, 0.00000000, 0....
## $ P                       <dbl> 22.4117167, 22.4117167, 34.0522654, 21...
## $ PANTO                   <dbl> 0.094166877, 0.094166877, 0.143076745,...
## $ PROT                    <dbl> 1.167669272, 1.167669272, 1.774151641,...
## $ RET                     <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ RIBO                    <dbl> 0.0037666751, 0.0037666751, 0.00572306...
## $ SATFA                   <dbl> 1.452053240, 1.452053240, 2.206243411,...
## $ SE                      <dbl> 0.75333501, 0.75333501, 1.14461396, 5....
## $ STAR                    <dbl> 9.05885354, 9.05885354, 13.76398289, 5...
## $ SUCR                    <dbl> 3.295840687, 3.295840687, 5.007686083,...
## $ THIA                    <dbl> 0.0226000504, 0.0226000504, 0.03433841...
## $ TOTCAR                  <dbl> 0.0000000, 0.0000000, 0.0000000, 0.000...
## $ TOTNIT                  <dbl> 0.207167129, 0.207167129, 0.314768839,...
## $ TOTSUG                  <dbl> 3.29584069, 3.29584069, 5.00768608, 6....
## $ TRANS                   <dbl> 0.001883338, 0.001883338, 0.002861535,...
## $ VITA                    <dbl> 0.00000000, 0.00000000, 0.00000000, 0....
## $ VITB12                  <dbl> 0.000000000, 0.000000000, 0.000000000,...
## $ VITB6                   <dbl> 0.018833375, 0.018833375, 0.028615349,...
## $ VITC                    <dbl> 0.0000000, 0.0000000, 0.0000000, 0.000...
## $ VITD                    <dbl> 0.0000000, 0.0000000, 0.0000000, 0.000...
## $ VITE                    <dbl> 0.998168894, 0.998168894, 1.516613499,...
## $ WATER                   <dbl> 0.52733451, 0.52733451, 0.80122977, 55...
## $ ZN                      <dbl> 0.169500378, 0.169500378, 0.257538141,...
## $ Fruit                   <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ DriedFruit              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ FruitJuice              <dbl> 0.000000, 0.000000, 0.000000, 0.000000...
## $ SmoothieFruit           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Tomatoes                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TomatoPuree             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Brassicaceae            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YellowRedGreen          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Beans                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Nuts                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ OtherVeg                <dbl> 0.000000, 0.000000, 0.000000, 0.000000...
## $ Beef                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Lamb                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Pork                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ProcessRedMeat          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ OtherRedMeat            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Burgers                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Sausages                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Offal                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Poultry                 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000...
## $ ProcessedPoultry        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ GameBirds               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ WhiteFish               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ OilyFish                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CannedTuna              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Shellfish               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CottageCheese           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CheddarCheese           <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ OtherCheese             <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ AllCheese               <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ AllFish                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ AllMeat                 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000...
## $ AllAnimal               <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ CO2e                    <dbl> 5.970180e+01, 5.970180e+01, 9.071066e+...
## $ KCALS_personday         <dbl> 1991.146, 1991.146, 1310.486, 1342.519...

I was curious to analyse how the age of participants and the total calories they consume affects the greenhouse gas emissions associated with their dietary choices. Consequently a number of variables from the individual_food_info data set were selected including (i) the CO2 emissions of individually recorded meals (CO2e), (ii) participant age (Age), (iii) the day on which meals were recorded (DayNo), (iv) a participant identifier (seriali), (v) the total calories of all recorded foods on a given day (KCALs_personday).

# select variables of interest from individual_food_info dataset

trim_data <- select(individual_food_info, Age, DayNo, seriali, CO2e, KCALS_personday)

As individuals recorded a different number of calories and meals each day, it was decided that summing CO2 emissions alone for each individual per day would not provide a valid measure of the actual impact of their dietary choices on greenhouse gas emissions. This is because it tells little in the way of how their diets compare relative to one another. Rather, a more valid measure may be the amount of CO2 emissions per calorie consumed. To calculate this, the total CO2 emissions over each day were summed and then divided by the total calories consumed that day. Lastly, as participants recorded meals over multiple days, their average CO2 emission per calorie per day was calculated:

# Sum the CO2 emission over each day and create new 'index' variable that captures CO2 emissions per calorie eaten for each individual on each day

index_data <- trim_data %>%
  group_by(seriali, DayNo, KCALS_personday, Age)%>% 
  summarise(CO2e = sum(CO2e)) %>%
  ungroup() %>% 
  mutate(CE_index = CO2e/KCALS_personday)

# Find the mean calories and mean CO2 emissions per calorie for each individual per day

mean_index_data <- index_data %>% 
  group_by(seriali, Age) %>% 
  summarise(mean_CE_index = mean(CE_index), mean_KCAL_day = mean(KCALS_personday))

DATA ANALYSIS


AGE vs. CO2: to investigate whether age predicts mean CO2 emission per calorie (i.e., mean_CE_index) a simple linear regression model was built. As can be seen in the output and graphs below, there is clearly no significant relationships to be found in the data.

# Does age predict the mean CO2 emissions per calorie for each individual across days (i.e., mean CE_index)?

lm(mean_CE_index~Age, data = mean_index_data) %>%
  summary()
## 
## Call:
## lm(formula = mean_CE_index ~ Age, data = mean_index_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9571 -0.7835 -0.1531  0.6562  4.5094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.018454   0.171293  17.622   <2e-16 ***
## Age         0.001267   0.003941   0.322    0.748    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.068 on 480 degrees of freedom
## Multiple R-squared:  0.0002154,  Adjusted R-squared:  -0.001867 
## F-statistic: 0.1034 on 1 and 480 DF,  p-value: 0.7479
ggplot(mean_index_data, aes(x = Age, y = mean_CE_index)) +
  geom_jitter(alpha = 0.3) +
  geom_smooth(colour = "steelblue3")

The implications of this finding suggest that when it comes to our dietary choices and their impact on CO2 emissions, age has no bearing. In other words, young or old, our contributions to greenhouse gas emissions are uniform. This means we have a collective duty to reduce our impact, with no one age group to point the blame at. This finding was counter to my a priori beliefs that there may be a group of younger individuals in their twenties (depicted by a clustering of points on the above graph) whom are more environmentally conscious and thus demonstrate lower emission levels. However, to further examine this idea, a clustering analysis or similar machine learning method is warranted before firm conclusions can be made.

KCAL vs. CO2: to investigate whether the average number calories consumed on a given day(mean_KCAL_day) predicted the mean CO2 emission per calorie (i.e., mean_CE_index), another simple linear regression was built. Upon initially exploring the data and building a model, it was found that the data appeared to closer fit a polynomial function, than a strict linear model, as can be seen in the graph below:

Consequently, the regression assumptions were checked via model plotting diagnostics. As can be seen in the histograms and Q-Q plots below, the data demonstrated a large positive skew. Furthermore the residuals vs fitted plot demonstrated a lack of linearity, and the trend line on the scale-location plot demonstrated a lack of heteroscedasticity.

# Does the mean number of calories eaten per individual across days predict the mean CO2 emissions per calorie for each individual across days (i.e., mean CE_index)

# Check data for normaility

# Plot histogram of mean_KCAL_day

ggplot(data = mean_index_data, aes(x = mean_KCAL_day)) +
  geom_histogram(fill = "steelblue2", colour = "grey10", bins = 50, alpha = 0.5)

# Plot histogram of mean_CE_index

ggplot(data = mean_index_data, aes(x = mean_CE_index)) +
  geom_histogram(fill = "steelblue2", colour = "grey10", bins = 50, alpha = 0.5)

# Assumption plots

mean_lm <- lm(mean_CE_index~mean_KCAL_day, data = mean_index_data)

mean_lm %>% 
  plot()

To correct for the violated assumptions, a Box-Cox transformation was run. A Box-Cox transformation is a method to transform non-normal variables into a normal distribution (this often takes care of other model assumptions too such as heteroscedasticity). It achieves this by iterating through a list of power transformations using a profile likelihood function. From this the optimum transformation can be found (denoted by the global optima on the graph below)

# Run boxcox transformations to establish model fit

bc_mean <- MASS::boxcox(mean_lm)

# Extract best lambda

best_lam <- bc_mean$x[which(bc_mean$y == max(bc_mean$y))]

best_lam
## [1] -0.3434343

Once the best (lambda) value for the transformation was found (-0.34) the dependent variable was transformed and the linear model was updated and assumptions rechecked:

# Update model and re-check normality

update_mean_lm <- lm((mean_CE_index)^best_lam~mean_KCAL_day, data = mean_index_data) 

plot(update_mean_lm)

As can be seen in the plots above the data now demonstrated adequate linearity, heteroscedasticity, and a normal distribution. The model summary was then re-run and the relationship between the average number of calories consumed on a given day(mean_KCAL_day) and the mean CO2 emission per calorie (i.e., mean_CE_index) was re-plotted. Note that as mean_CE_index was transformed by raising it to the power of -0.34 (best_lam), its reciprocal was plotted in order to demonstrate the underlying direction of the trend:

# Model summary

update_mean_lm %>% summary()
## 
## Call:
## lm(formula = (mean_CE_index)^best_lam ~ mean_KCAL_day, data = mean_index_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.150989 -0.031488 -0.000803  0.034165  0.161985 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.457e-01  8.818e-03   50.54   <2e-16 ***
## mean_KCAL_day 1.348e-04  4.527e-06   29.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05031 on 480 degrees of freedom
## Multiple R-squared:  0.6489, Adjusted R-squared:  0.6482 
## F-statistic: 887.3 on 1 and 480 DF,  p-value: < 2.2e-16
# Plot

ggplot(mean_index_data, aes(x = mean_KCAL_day, y = 1/(mean_CE_index^best_lam))) +
  geom_jitter(colour = "steelblue2", alpha = 0.6) +
  stat_smooth(method = "lm") +
  labs(x = "mean calories consumed per day", y = "mean CO2 emission per calorie")

As can be seen in the model summary above, the mean number of calories consumed on a given day (mean_KCAL_day) was a significant negative predictor of the mean CO2 emission per calorie (i.e., mean_CE_index) (p < 0.001). Moreover, the adjusted R2 demonstrates a strong fit (R2 = 0.65), with the mean number calories consumed on a given day predicting 65% of the variance in the mean CO2 emission per calorie. The negative trend indicates that the higher the number of calories consumed on a given day, the lower the carbon emissions of CO2 per calorie of that food. This implies that those that eat higher calorie diets eat foods with a lower carbon footprint per calorie. An interesting finding to say the least!