Drawing Insight from Simple Data Exploration and Visualisation, Including Dodged Bars and Choropleth Maps
Mar 6, 2019
Mike Page
13 minute read

INTRODUCTION


Recently, I got approached by a client starting a company that specialises in company formations and related services. They asked if I could source and analyse data regarding the number of companies formed in the UK each year, in addition to any information on competitor companies in that sector. This information was to be merged into a business plan. I took on the challenge as I reasoned it would be a good opportunity to practice data wrangling and simple data exploration and visualisation. Below is a brief account of some of the steps I took to complete the report, as well as a snapshot at some of the data used.

THE DATA


The journey began by exploring official government reports and data repositories for the required information. Six repositories were found that contained the required information. These included: (i) BIS - Business Population Estimates; (ii) ONS - UK Business; (iii) ONS - Business Demography; (iv) devolved country administrations; (v) Insolvency Service; and (vi) Companies House. A breakdown on the differences between the six repositories can be read here.

The accompanying documentation and data sets for the six resources outlined above was large. In brief, Business Population Estimates (BIS) data was used. This is because these were the only data to provide: (a) consistent data across a wide date range, (b) information on the whole business landscape (i.e., public and private sectors), and (c) a breakdown of businesses by area, industry, and formation type (e.g., partnership, non-profit, etc.). Additionally, data from business birth and death rates taken from the UK Business (ONS) repository was used in section one (details below).

The steps taken to wrangle the data are too vast to be included in this blog post (the data was split across hundreds of spreadsheets, many of which were in different formats), so at each stage below (where applicable) data is imported in a wrangled, tidy format.

THE REPORT


The report was broken down into five key sections:

  1. Change in the number of UK businesses
  2. Change in the number of UK businesses broken down by company type
  3. Change in the number of UK businesses broken down by industry
  4. Change in the number of UK businesses broken down by area (national and regional)
  5. Change in the number of UK businesses specialising in company formations (competitor analysis)

In order of conciseness and to minimise repetition, below, I have cherry-picked data and visualisations from each section of the report. Indeed, the finished report contained more information than listed below. But the aim of this blog post is not to learn about the data used, rather, to learn about how I uses the data to draw insight.

DATA EXPLORATION AND VISUALISATION


First, lets load the libraries that will be used throughout:

library(tidyverse)
library(wesanderson)
library(maptools)
library(rgeos)
library(ggmap)
library(rgdal)

Next, lets go through each section examining different visualisations and techniques:

Section 1: In order to understand the difference in business birth and death rates (i.e., the number of companies formed and dissolved each year), a simple line graph most effectively demonstrates the difference:

# Birth:death rates
# Import data set
birth_death_rates <- read_csv("/home/mike/Documents/r_projects/ggr/data_sets/wrangled/birth_death_rates.csv")

# Tidy birth_death_rates
birth_death_rates <- gather(birth_death_rates, key = "year", value = "count", `2012`, `2013`, `2014`, `2015`, `2016`, `2017`)

# Plot annual birth and death rates for UK
birth_death_rates %>%
    filter(area == "UNITED KINGDOM") %>%
    ggplot(aes(x = year, y = count/1000, group = rate, colour = rate)) +
    geom_point(size = 2) +
    geom_line(size = 1) +
    scale_colour_manual(values = c("steelblue", "steelblue1")) +
    ylab("No. of businesses (thousands)")

The accompanying data:

# Table of UK birth and death rates
tmp4 <- birth_death_rates %>%
    filter(area == "UNITED KINGDOM") %>%
    spread(rate, count) %>%
    select(-area) %>%
    rename(birth_rate = birth, death_rate = death)

library(knitr)
kable(tmp4, caption = "Annual company birth and death rates")
Table 1: Annual company birth and death rates
year birth_rate death_rate
2012 269250 252085
2013 346275 237025
2014 350305 246190
2015 382755 281995
2016 413900 287805
2017 381885 356815

To better understand the accompanying data, the annual growth in birth and death rates can be expressed as a percentage of previous year. To do this, the ‘lag()’ function can be used:

tmp5 <- birth_death_rates %>%
    filter(area == "UNITED KINGDOM") %>%
    group_by(rate) %>%
    mutate(count = (count / lag(count) - 1) * 100) %>%
    ungroup() %>%
    spread(rate, count) %>%
    filter(year != "2012") %>%
    select(-area) %>%
    rename(birth_rate = birth, death_rate = death)

library(knitr)
kable(tmp5, caption = "Annual company birth and death rates as a percentage of the previous year")
Table 2: Annual company birth and death rates as a percentage of the previous year
year birth_rate death_rate
2013 28.607242 -5.974175
2014 1.163815 3.866681
2015 9.263356 14.543645
2016 8.137059 2.060320
2017 -7.734960 23.978041

Insights: Despite the decrease in the number of businesses in 2018, there was still a large number of businesses being formed in 2018 with only a small decrease from 2017 (-7.7%). This is because the number of businesses being formed (births) still exceeded the number of those being dissolved (deaths).

Section 2: In order to understand the change in the number of UK businesses broken down by company type, a dodged bar plot was chosen:

# Import data set
comp_types <- read_csv("/home/mike/Documents/r_projects/ggr/data_sets/wrangled/company_types_2012_2018.csv")

# Calculate the yearly difference in company types as both absolute and relative values (as a percentage) of the previous year

comp_types <- comp_types %>%
    group_by(comp_type) %>%
    mutate(diff = diff(c(0, count))) %>%
    mutate(perc_diff = (count / lag(count) - 1) * 100)

# Set the difference score(s) to NA for 2012
comp_types$diff[1:5] <- NA

# Plot
ggplot(comp_types, aes(x = year, y = count / 1000000, fill = comp_type)) +
    scale_x_continuous(breaks = 2012:2018) +
    scale_y_continuous(breaks = seq(0, 3.5, by = .5)) +
    geom_col(position = "dodge", colour = "black") +
    scale_fill_brewer(labels = c("Governmental", "Company", "Non Profit", "Partnership", "Sole Trader")) +
    ylab("Number (millions)") +
    labs(fill = "Company Type")

Accompanying data:

tmp2 <- comp_types %>%
    filter(year != 2012) %>%
    select(year, comp_type, diff) %>%
    spread(comp_type, diff) %>%
    rename(Governmental = central_and_local_gov,
           Company = companies,
           Non_Profit = non_profit_org,
           Partnership = partnernships,
           Sole_Trader = sole_proprietorships)

library(knitr)
kable(tmp2, caption = "Annual change in company type")
Table 3: Annual change in company type
year Governmental Company Non_Profit Partnership Sole_Trader
2013 1670 54390 -2465 -13630 60790
2014 1230 115265 80 22305 209910
2015 420 117480 -310 -20700 49535
2016 295 124665 7265 -15105 -1340
2017 270 127540 3375 -7325 76625
2018 205 30745 60 -8515 -49225

Accompanying data expressed as a percentage of the previous year:

tmp3 <- comp_types %>%
    filter(year != 2012) %>%
    select(year, comp_type, perc_diff) %>%
    spread(comp_type, perc_diff) %>%
    rename(Governmental = central_and_local_gov,
           Company = companies,
           Non_Profit = non_profit_org,
           Partnership = partnernships,
           Sole_Trader = sole_proprietorships)

library(knitr)
kable(tmp3, caption = "Relative annual change in company type as a percentage of previous year")
Table 4: Relative annual change in company type as a percentage of previous year
year Governmental Company Non_Profit Partnership Sole_Trader
2013 19.716647 4.055581 -3.0524426 -3.042275 2.0229819
2014 12.130178 8.259734 0.1021842 5.134787 6.8469156
2015 3.693931 7.776167 -0.3955595 -4.532565 1.5122097
2016 2.502120 7.656379 9.3069434 -3.464489 -0.0402983
2017 2.234175 7.275881 3.9554644 -1.740360 2.3052982
2018 1.659247 1.634977 0.0676437 -2.058927 -1.4475855

Insights: Sole traders remain the largest sector of the businesses landscape consisting of 3,351,265 companies in 2018. In the private sector, companies, as defined as public corporations and nationalised bodies, are the fastest growing segment, with an average annual growth of approximately 6%.

Section 3: In order to understand the change in the number of UK businesses broken down by industry, a plot of the the annual change in industries as a percentage of all industries can be used. This demonstrates a large amount of information, from the size of industry sectors to their relative growth and performance in the whole market:

# Import data set
industry <- read_csv("/home/mike/Documents/r_projects/ggr/data_sets/wrangled/industry_time_series.csv")

# Tidy industry data set
industry <- industry %>% gather(key = "year", value = "count", `2012`, `2013`, `2014`, `2015`, `2016`, `2017`, `2018`) %>%
    select(- industry)

# Plot

industry %>%
    group_by(year) %>%
    mutate(percent_change = count/sum(count)*100) %>%
    ungroup() %>%
    ggplot(aes(x = year, y = percent_change, colour = industry_key, group = industry_key)) +
    geom_line() +
    geom_point(position = "dodge") +
    ylab("Percentage")

Accompanying data set:

# Print absolute industry numbers for each year
tmp6 <- industry %>%
    spread(year, count)

kable(tmp6, caption = "Annual change in the number of businesses per industry sector")
Table 5: Annual change in the number of businesses per industry sector
industry_key 2012 2013 2014 2015 2016 2017 2018
A 152085 153620 151695 153360 153640 155795 157580
B_D_E 25655 28160 33695 29390 28285 38125 34275
C 230970 274295 261395 275565 265735 265775 290100
F 907480 890850 950220 956105 974625 1007500 991620
G 514805 496370 538530 522690 544490 542150 555810
H 269945 275290 284975 274840 313860 345285 325565
I 166555 169325 178700 183180 185510 202060 185955
J 289075 305610 320895 338905 337850 351485 362825
K 76380 89960 83785 84140 89215 86410 86615
L 91810 88890 99605 105045 105700 111870 116065
M 665625 686850 779725 792885 823560 855625 816685
N 378735 379685 419120 443400 459860 478810 491055
P 243220 255890 258945 267550 312220 296305 283470
Q 303540 333310 336530 371375 347700 362115 360140
R 209430 204730 234230 268365 252495 276300 276185
S 268805 262815 311090 322655 302925 318905 333555

Accompanying legend:

# Import data set
industry <- read_csv("/home/mike/Documents/r_projects/ggr/data_sets/wrangled/industry_time_series.csv")

industry_legend <- industry %>%
    select(industry_key, industry)

industry_legend$industry[2] <- "Quarrying; Electricity and Gas; Water Supply; Waste Management"
    
kable(industry_legend, caption = "Industry legend")
Table 6: Industry legend
industry_key industry
A Agriculture, Forestry and Fishing
B_D_E Quarrying; Electricity and Gas; Water Supply; Waste Management
C Manufacturing
F Construction
G Wholesale and Retail Trade; Repair of Motor Vehicles and Motorcycles
H Transportation and Storage
I Accommodation and Food Service Activities
J Information and Communication
K Financial and Insurance Activities
L Real Estate Activities
M Professional, Scientific and Technical Activities
N Administrative and Support Service Activities
P Education
Q Human Health and Social Work Activities
R Arts, Entertainment and Recreation
S Other Service Activities

Insight: Industries C, L, N, & S have demonstrated relative stability and/or growth over time, even in face of the recent decline in the number of businesses in 2018. Industry N is the only industry to show positive annual growth each year. Each year, industry F remains the largest industry sector.

Section 4: In order to understand the change in the number of UK businesses broken down by area (national and regional), a choropleth map was plotted using a UK shapefile. This plot took some hacking to get to work, so perhaps there will be a blog post covering the code below in the future:

# Import data
regional <- read_csv("/home/mike/Documents/r_projects/ggr/data_sets/wrangled/regional_time_series.csv")

national <- read_csv("/home/mike/Documents/r_projects/ggr/data_sets/wrangled/national_time_series.csv")

# Tidy data
regional_tidy <- regional  %>%
    gather(year, count, `2012`, `2013`, `2014`, `2015`, `2016`, `2017`, `2018`)

national_tidy <- national  %>%
    gather(year, count, `2012`, `2013`, `2014`, `2015`, `2016`, `2017`, `2018`)

# Import shapefile
shp <- readOGR(dsn = "/home/mike/Documents/r_projects/ggr/shapefile", layer = "NUTS_Level_1_January_2018_Full_Extent_Boundaries_in_the_United_Kingdom", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/mike/Documents/r_projects/ggr/shapefile", layer: "NUTS_Level_1_January_2018_Full_Extent_Boundaries_in_the_United_Kingdom"
## with 12 features
## It has 9 fields
## Integer64 fields read as strings:  objectid bng_e bng_n
shp_df <- fortify(shp, region = "nuts118nm")

# Bind national and regional tables
nat_reg <- bind_rows(regional_tidy, rename(national_tidy, region = nation))  %>% filter(region != "England")

# Recode regional
nat_reg <- nat_reg %>%
    mutate(id = recode(nat_reg$region, "North East" = "North East (England)", "North West" = "North West (England)", "Yorkshire and the Humber" = "Yorkshire and The Humber", "East Midlands" = "East Midlands (England)", "West Midlands" = "West Midlands (England)", "East of England" = "East of England", "London" = "London", "South East" = "South East (England)", "South West" = "South West (England)", "Wales" = "Wales", "Scotland" = "Scotland", "Northern Ireland" = "Northern Ireland"))

# Merge nat_reg and shp_df
shp_merge <- merge(shp_df, nat_reg, by = "id")

# Plot map showing number of businesses per area in 2018
shp_merge %>%
    filter(year == "2018") %>%
    ggplot(aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = count/1000000), color = 'black', size = 0.1) +
    scale_fill_gradient(high = "steelblue4", low = "lightskyblue1", guide = "colorbar") +
    guides(fill = guide_colorbar(title = "Businesses\n(millions)")) +
    theme_void()

Accompanying data:

# Print change in number of businesses per year
combined <-  bind_rows(regional, rename(national, region = nation))  %>% rename(area = region)

kable(combined, caption = "Annual number of businesses broken down by area")
Table 7: Annual number of businesses broken down by area
area 2012 2013 2014 2015 2016 2017 2018
North East 132300 134500 150600 135000 146400 142500 162800
North West 437500 485400 506800 531200 521700 529800 544800
Yorkshire and the Humber 343000 349900 353200 380900 391300 419200 400700
East Midlands 324900 315700 330900 356400 370300 370800 367700
West Midlands 373500 386200 412600 399700 412600 449800 448200
East of England 507400 499000 549000 529100 554900 572400 564800
London 820900 857900 939300 987400 1016600 1062300 1096100
South East 771600 792400 837300 879200 899000 928900 873600
South West 478200 467600 513000 531900 514900 532000 546300
England 4189200 4288700 4592600 4730700 4827600 5007700 5004800
Wales 195000 189500 213200 212800 221500 209000 198600
Scotland 319100 326100 322700 341000 324800 346200 331400
Northern Ireland 114500 110200 118500 116700 124000 131700 132700

Insight: Each year, London remains the area with the largest number of businesses. With the exception of London, the North West is the fastest growing region. The South East experienced the largest decrease in the number of businesses in 2018. England has shown the fastest rate of growth as a nation.

Section 5: In order to understand the change in the number of UK businesses specialising in company formations, a simple dodged bar plot was used. The Standard Industrial Classification (SIC) codes used below are a means to classify businesses by an industry category. More information can be read here:

# Import data set
sic <- read_csv("/home/mike/Documents/r_projects/ggr/data_sets/wrangled/sic_codes.csv")

# Change SIC codes to factors
sic  <- sic %>% mutate_at("SIC", factor)

# Plot change in absolute competitor growth per annum
ggplot(sic, aes(x = year, y = businesses/1000, group = SIC, fill = SIC)) +
    geom_col(position = "dodge", colour = "black") +
    scale_fill_brewer(name = "SIC Code") +
    ylab("No. Businesses (thousands)") +
    scale_y_continuous(breaks = seq(0, 60, by = 10))

The accompanying data:

kable(sic, caption = "Annual change in number of businesses, employment, and turnover for comeptitor formation companies broken down by three digit SIC code")
Table 8: Annual change in number of businesses, employment, and turnover for comeptitor formation companies broken down by three digit SIC code
SIC year businesses employ_thousands turnover_millions
649 2012 3935 59 NA
702 2012 43510 277 37446
829 2012 35880 239 29856
649 2013 3830 66 NA
702 2013 41345 276 40866
829 2013 34185 237 31296
649 2014 2410 60 NA
702 2014 47680 299 43677
829 2014 40340 267 37468
649 2015 2580 59 NA
702 2015 51110 341 48645
829 2015 43420 292 45322
649 2016 2735 62 NA
702 2016 52840 347 53655
829 2016 45405 298 54840
649 2017 2985 65 NA
702 2017 57145 354 63639
829 2017 54225 327 54684
649 2018 3275 71 NA
702 2018 58240 411 76172
829 2018 53670 356 65195

Insights: Combined, there was a total of 115,185 companies operating in the landscape of formation companies in 2018 (see details below). Overall, since 2012 there was a positive annual growth in the number of formation companies with a decrease in growth in 2018. This is inline with the overall industry landscape detailed in section one.

CONCLUSION


Using a variety of data exploration and visualisation methods (e.g., using the ‘lag()’ function, choropleth maps, dodged bar charts, etc.) one can gain basic insight into complex data. The choice of which method to use depends upon the nature of the data being studied, and the questions you would like to answer.