CDC Data analysis

I downloaded data from the BEAM dashboard (found here: https://data.cdc.gov/Foodborne-Waterborne-and-Related-Diseases/BEAM-Dashboard-Serotypes-of-concern-Illnesses-and-/fvm6-ic5r/about_data) I specifically filtered for data from year_first_ill is > 2020 so there should be about 5 years of data.

Now to lead in my packages:

library(here) #to set paths

here() starts at C:/Users/esthe/Documents/GitHub/EstherPalmer-portfolio

library(dplyr) #for data processing/cleaning


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr) #for data processing/cleaning
library(skimr) #for nice visualization of data 
library(eeptools) #I need to remove commas and this is the easiest way I found

Loading required package: ggplot2

library(ggplot2) #plots!

Now to load in my data:

data_location <- here::here("cdcdata-exercise", "BEAM_Dashboard_Data.csv")
BEAM <- read.csv(data_location)

Now to get a glimpse of my data:

summary(BEAM)

   table_id         Food_category      Year_first_ill       Serotype        
 Length:339         Length:339         Length:339         Length:339        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 No_of_illnesses   No_of_outbreaks    Pathogen             Year          
 Min.   :  0.000   Min.   :0.0000   Length:339         Length:339        
 1st Qu.:  0.000   1st Qu.:0.0000   Class :character   Class :character  
 Median :  0.000   Median :0.0000   Mode  :character   Mode  :character  
 Mean   :  5.752   Mean   :0.1888                                        
 3rd Qu.:  0.000   3rd Qu.:0.0000                                        
 Max.   :181.000   Max.   :4.0000                                        
  Year_range        Running_total_by_year_range
 Length:339         Min.   :  0.00             
 Class :character   1st Qu.:  0.00             
 Mode  :character   Median :  0.00             
                    Mean   : 38.68             
                    3rd Qu.: 29.00             
                    Max.   :679.00

glimpse(BEAM)

Rows: 339
Columns: 10
$ table_id                    <chr> "Pork_Adelaide_2017-2021", "Pork_Agona_201…
$ Food_category               <chr> "Pork", "Pork", "Chicken", "Turkey", "Pork…
$ Year_first_ill              <chr> "2,021", "2,021", "2,021", "2,021", "2,021…
$ Serotype                    <chr> "Adelaide", "Agona", "Anatum", "Anatum", "…
$ No_of_illnesses             <int> 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0,…
$ No_of_outbreaks             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
$ Pathogen                    <chr> "Salmonella", "Salmonella", "Salmonella", …
$ Year                        <chr> "2,021", "2,021", "2,021", "2,021", "2,021…
$ Year_range                  <chr> "2017-2021", "2017-2021", "2017-2021", "20…
$ Running_total_by_year_range <int> 48, 0, 4, 8, 30, 146, 11, 80, 0, 0, 0, 0, …

skim(BEAM)

Data summary
Name	BEAM
Number of rows	339
Number of columns	10
_______________________
Column type frequency:
character	7
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
table_id	1	20	32	178
Food_category	1	4	7	4
Year_first_ill	1	5	5	3
Serotype	1	5	14	36
Pathogen	1	10	10	1
Year	1	5	5	3
Year_range	1	9	9	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p75	p100	hist
No_of_illnesses	1	5.75	23.54	0	181	▇▁▁▁▁
No_of_outbreaks	1	0.19	0.58	0	4	▇▁▁▁▁
Running_total_by_year_range	1	38.68	100.92	29	679	▇▁▁▁▁

There are 339 observations of 10 variables. Some of these variables probably shouldn’t be characters like Year. Some of these categories also feel unnessesary, like table_id which looks like it just contains info from several other columns. Also Pathogen should be Salmonella for all of this data, so it is unhelpful (and we can see there is only one observation). It does look like there’s no missing data though. There are 36 unique serovars which is cool!

Lets remove the useless variables

d1 <- data.frame(BEAM$Food_category, BEAM$Year_first_ill, BEAM$Serotype, BEAM$No_of_illnesses, BEAM$No_of_outbreaks, BEAM$Year, BEAM$Year_range, BEAM$Running_total_by_year_range )
skim(d1)

Data summary
Name	d1
Number of rows	339
Number of columns	8
_______________________
Column type frequency:
character	5
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
BEAM.Food_category	1	4	7	4
BEAM.Year_first_ill	1	5	5	3
BEAM.Serotype	1	5	14	36
BEAM.Year	1	5	5	3
BEAM.Year_range	1	9	9	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p75	p100	hist
BEAM.No_of_illnesses	1	5.75	23.54	0	181	▇▁▁▁▁
BEAM.No_of_outbreaks	1	0.19	0.58	0	4	▇▁▁▁▁
BEAM.Running_total_by_year_range	1	38.68	100.92	29	679	▇▁▁▁▁

#Note to future self: this renames the variables to BEAM.variable

Lets turn the year into a numeric variable

d1$BEAM.Year <- decomma(d1$BEAM.Year)
glimpse(d1$BEAM.Year)

 num [1:339] 2021 2021 2021 2021 2021 ...

d1$BEAM.Year_first_ill <- decomma(d1$BEAM.Year_first_ill)
glimpse(d1$BEAM.Year_first_ill)

 num [1:339] 2021 2021 2021 2021 2021 ...

I want to make sure that these categorical variables are actually categories

d1$BEAM.Serotype <- as.factor(d1$BEAM.Serotype)
d1$BEAM.Food_category <- as.factor(d1$BEAM.Food_category)
skim(d1)

Data summary
Name	d1
Number of rows	339
Number of columns	8
_______________________
Column type frequency:
character	1
factor	2
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
BEAM.Year_range	0	1	9	9	0	3	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
BEAM.Food_category	0	1	FALSE	4	Chi: 108, Por: 102, Bee: 79, Tur: 50
BEAM.Serotype	0	1	FALSE	36	I 4: 23, Bra: 21, Mue: 21, Ent: 20

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
BEAM.Year_first_ill	1	2021.61	0.71	2021	2021	2021	2022	2023	▇▁▅▁▂
BEAM.No_of_illnesses	1	5.75	23.54	0	0	0	0	181	▇▁▁▁▁
BEAM.No_of_outbreaks	1	0.19	0.58	0	0	0	0	4	▇▁▁▁▁
BEAM.Year	1	2022.27	0.75	2021	2022	2022	2023	2023	▃▁▆▁▇
BEAM.Running_total_by_year_range	1	38.68	100.92	0	0	0	29	679	▇▁▁▁▁

summary(d1)

 BEAM.Food_category BEAM.Year_first_ill        BEAM.Serotype
 Beef   : 79        Min.   :2021        I 4,[5],12:i:-: 23  
 Chicken:108        1st Qu.:2021        Braenderup    : 21  
 Pork   :102        Median :2021        Muenchen      : 21  
 Turkey : 50        Mean   :2022        Enteritidis   : 20  
                    3rd Qu.:2022        Typhimurium   : 19  
                    Max.   :2023        Newport       : 18  
                                        (Other)       :217  
 BEAM.No_of_illnesses BEAM.No_of_outbreaks   BEAM.Year    BEAM.Year_range   
 Min.   :  0.000      Min.   :0.0000       Min.   :2021   Length:339        
 1st Qu.:  0.000      1st Qu.:0.0000       1st Qu.:2022   Class :character  
 Median :  0.000      Median :0.0000       Median :2022   Mode  :character  
 Mean   :  5.752      Mean   :0.1888       Mean   :2022                     
 3rd Qu.:  0.000      3rd Qu.:0.0000       3rd Qu.:2023                     
 Max.   :181.000      Max.   :4.0000       Max.   :2023                     
                                                                            
 BEAM.Running_total_by_year_range
 Min.   :  0.00                  
 1st Qu.:  0.00                  
 Median :  0.00                  
 Mean   : 38.68                  
 3rd Qu.: 29.00                  
 Max.   :679.00

I can now see that the most common serovars are monophasic Typhimurium (I 4,5,12,:i:-), Braenderup, Muenchen, Enteritidis, Typhimurium, and Newport.

This just leaves year range as something that should maybe be fixed. This one is tricky though, because each outbreak is going to have a separate year range. Also given how I filtered the data by year first ill this column may not be helpful.

I want to know what commodities are associated with my top 6 serovars

d2 <- subset(d1, (BEAM.Serotype == "Typhimurium" | BEAM.Serotype == "Braenderup" | BEAM.Serotype == "Muenchen" | BEAM.Serotype == "Newport" | BEAM.Serotype == "I 4,[5],12:i:-"))
summary(d2)

 BEAM.Food_category BEAM.Year_first_ill        BEAM.Serotype
 Beef   :30         Min.   :2021        I 4,[5],12:i:-:23   
 Chicken:25         1st Qu.:2021        Braenderup    :21   
 Pork   :30         Median :2021        Muenchen      :21   
 Turkey :17         Mean   :2022        Typhimurium   :19   
                    3rd Qu.:2022        Newport       :18   
                    Max.   :2023        Adelaide      : 0   
                                        (Other)       : 0   
 BEAM.No_of_illnesses BEAM.No_of_outbreaks   BEAM.Year    BEAM.Year_range   
 Min.   : 0.000       Min.   :0.0000       Min.   :2021   Length:102        
 1st Qu.: 0.000       1st Qu.:0.0000       1st Qu.:2022   Class :character  
 Median : 0.000       Median :0.0000       Median :2022   Mode  :character  
 Mean   : 7.088       Mean   :0.2549       Mean   :2022                     
 3rd Qu.: 0.000       3rd Qu.:0.0000       3rd Qu.:2023                     
 Max.   :81.000       Max.   :2.0000       Max.   :2023                     
                                                                            
 BEAM.Running_total_by_year_range
 Min.   :  0.00                  
 1st Qu.:  0.00                  
 Median :  0.00                  
 Mean   : 48.62                  
 3rd Qu.: 80.00                  
 Max.   :551.00

p1 <- d2 %>% ggplot(aes(x=BEAM.Food_category)) + geom_bar()
plot(p1)

p2 <- d2 %>% ggplot(aes(x=BEAM.Serotype)) + geom_bar()
plot(p2)

p3 <- d2 %>% ggplot(aes(fill=BEAM.Serotype, x=BEAM.Food_category)) + geom_bar()
plot(p3)

This gets counts of these top commodities and the top serovars, then proportion of each commodity that responds to each serovar!

This section is contributed by Rebecca Basta

AI prompt used: “Write R code to generate synthetic outbreak data similar in structure to CDC BEAM Salmonella dataset with 339 rows, 10 columns, and categorical + count variables.”

set.seed(123)  # reproducibility

# Number of observations similar to original
n <- 339

# Create synthetic dataset
synthetic_data <- data.frame(
  Food_category = sample(
    c("Chicken", "Beef", "Pork", "Eggs", "Vegetables", 
      "Fruit", "Seafood", "Dairy", "Turkey"),
    n, replace = TRUE
  ),
  
  Serotype = sample(
    paste("Serotype", 1:36),
    n, replace = TRUE
  ),
  
  Year_first_ill = sample(2021:2024, n, replace = TRUE),
  
  Year = sample(2021:2024, n, replace = TRUE),
  
  No_of_illnesses = rpois(n, lambda = 25),
  
  No_of_outbreaks = rpois(n, lambda = 3)
)

# Running total by year range
synthetic_data <- synthetic_data %>%
  arrange(Year) %>%
  mutate(Running_total_by_year_range = cumsum(No_of_illnesses))

glimpse(synthetic_data)

Rows: 339
Columns: 7
$ Food_category               <chr> "Beef", "Pork", "Seafood", "Pork", "Vegeta…
$ Serotype                    <chr> "Serotype 18", "Serotype 8", "Serotype 32"…
$ Year_first_ill              <int> 2021, 2021, 2021, 2023, 2021, 2024, 2022, …
$ Year                        <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, …
$ No_of_illnesses             <int> 19, 28, 26, 16, 28, 25, 25, 28, 19, 18, 28…
$ No_of_outbreaks             <int> 2, 7, 4, 1, 5, 3, 4, 1, 3, 4, 2, 4, 2, 0, …
$ Running_total_by_year_range <int> 19, 47, 73, 89, 117, 142, 167, 195, 214, 2…

summary(synthetic_data)

 Food_category        Serotype         Year_first_ill      Year     
 Length:339         Length:339         Min.   :2021   Min.   :2021  
 Class :character   Class :character   1st Qu.:2021   1st Qu.:2022  
 Mode  :character   Mode  :character   Median :2023   Median :2023  
                                       Mean   :2022   Mean   :2023  
                                       3rd Qu.:2023   3rd Qu.:2023  
                                       Max.   :2024   Max.   :2024  
 No_of_illnesses No_of_outbreaks Running_total_by_year_range
 Min.   :12.00   Min.   :0.000   Min.   :  19               
 1st Qu.:21.00   1st Qu.:2.000   1st Qu.:2079               
 Median :24.00   Median :3.000   Median :4153               
 Mean   :24.59   Mean   :3.006   Mean   :4157               
 3rd Qu.:28.00   3rd Qu.:4.000   3rd Qu.:6268               
 Max.   :41.00   Max.   :8.000   Max.   :8336

#renaming the serotypes from my synthetic data, so the plots can have the correct axis names
#Before I did this, the plots were blank. I asked AI why they were blank and it was because the variable names didn't match, which is why I recoded them.
synthetic_data <- synthetic_data %>%
  mutate(
    Serotype = recode(
      Serotype,
      "Serotype 1" = "Typhimurium",
      "Serotype 2" = "Braenderup",
      "Serotype 3" = "Muenchen",
      "Serotype 4" = "Newport",
      "Serotype 5" = "I 4,[5],12:i:-"
    )
  )

#Created a subset for the plots to have the same plots
plot_subset <- subset(
  synthetic_data,
  Food_category %in% c("Beef", "Chicken", "Pork", "Turkey") &
  Serotype %in% c(
    "Typhimurium",
    "Braenderup",
    "Muenchen",
    "Newport",
    "I 4,[5],12:i:-"
  )
)

Plots

p1 <- ggplot(plot_subset, aes(x = Food_category)) +
  geom_bar() +
  labs(x = "Food_category (synthetic data)", y = "count")
p1

p2 <- ggplot(plot_subset, aes(x = Serotype)) +
  geom_bar() +
  labs(x = "Serotype (synthetic data)", y = "count")
p2

p3 <- ggplot(plot_subset, aes(x = Food_category, fill = Serotype)) +
  geom_bar() +
  labs(x = "Food_category", y = "count", fill = "Serotype")
p3

The synthetic data is a little different then the orginal data because the food catergory and serotype are random in the synthetic data. Other then that, the synthetic data and orginal data are the same. They both include 339 rows, 10 columns, and the same variables.