Example Report Template for a Data Analysis Project

The structure below is one possible setup for a report stemming from a data analysis project. It loosely follows the structure of a standard scientific manuscript. Adjust as needed. You don’t need to have exactly these sections, but the content covering those sections should be addressed.

This uses HTML as output format. See the Quarto documentation for instructions on how to use other formats.

Authors

Author affiliations

  1. Departmet of Microbiology, University of Georgia, Athens, GA, USA.
  2. Department of Population Health, University of Georgia, GA, USA.
  3. University of Georgia.

\(*\) These authors contributed equally to this work.

\(\land\) Corresponding author: some@email.com

\(\dagger\) Disclaimer: The opinions expressed in this article are the author’s own and don’t reflect their employer.

1 Summary

This project was part of an exercise for the MADA course to learn about data cleaning and the READY workflow.

2 Methods

Describe your methods. That should describe the data, the cleaning processes, and the analysis approaches. You might want to provide a shorter description here and all the details in the supplement.

Data included height, weight, gender, pets owned, and number of books read in the last year. Data was cleaned by removing individuals from the set with missing data. Various plots and tables were made to explore the data. Data was eventually fit to a linear model for final analysis.

2.1 Data acquisition

Data was made up by Esther Palmer in order to have something for this exercise.

2.2 Data import and cleaning

Packages used:

library(readxl) #for loading Excel files
library(dplyr) #for data processing/cleaning

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr) #for data processing/cleaning
library(skimr) #for nice visualization of data 
library(here) #to set paths

Load data

data_location <- here::here("data","raw-data","exampledata2.xlsx")
rawdata <- readxl::read_excel(data_location)
codebook <- readxl::read_excel(data_location, sheet ="Codebook")

Clean Data

d1 <- rawdata %>% dplyr::filter( Height != "sixty" ) %>% 
                  dplyr::mutate(Height = as.numeric(Height))

skimr::skim(d1)
Data summary
Name d1
Number of rows 13
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Gender 0 1 1 2 0 5 0
Pets_owned 0 1 3 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Height 0 1.00 151.62 46.46 6 154.00 165 175 192 ▁▁▁▂▇
Weight 1 0.92 647.92 2000.48 45 54.75 73 90 7000 ▇▁▁▁▁
Number_books_read 0 1.00 8.15 10.84 0 1.00 3 11 32 ▇▁▁▂▁
hist(d1$Height)

d2 <- d1 %>% dplyr::mutate( Height = replace(Height, Height=="6",round(6*30.48,0)) )
skimr::skim(d2)
Data summary
Name d2
Number of rows 13
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Gender 0 1 1 2 0 5 0
Pets_owned 0 1 3 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Height 0 1.00 165.23 16.52 133 155.00 166 178 192 ▂▇▆▆▃
Weight 1 0.92 647.92 2000.48 45 54.75 73 90 7000 ▇▁▁▁▁
Number_books_read 0 1.00 8.15 10.84 0 1.00 3 11 32 ▇▁▁▂▁
d3 <- d2 %>%  dplyr::filter(Weight != 7000) %>% tidyr::drop_na()
skimr::skim(d3)
Data summary
Name d3
Number of rows 11
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Gender 0 1 1 2 0 5 0
Pets_owned 0 1 3 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Height 0 1 167.09 16.81 133 155.5 166 179.0 192 ▂▇▅▇▅
Weight 0 1 70.45 20.65 45 54.5 70 85.0 110 ▇▂▃▃▂
Number_books_read 0 1 7.36 10.26 0 1.5 3 8.5 32 ▇▁▁▁▁
d3$Gender <- as.factor(d3$Gender)  
skimr::skim(d3)
Data summary
Name d3
Number of rows 11
Number of columns 5
_______________________
Column type frequency:
character 1
factor 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Pets_owned 0 1 3 6 0 3 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 5 M: 4, F: 3, O: 2, N: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Height 0 1 167.09 16.81 133 155.5 166 179.0 192 ▂▇▅▇▅
Weight 0 1 70.45 20.65 45 54.5 70 85.0 110 ▇▂▃▃▂
Number_books_read 0 1 7.36 10.26 0 1.5 3 8.5 32 ▇▁▁▁▁
d3$Pets_owned <- as.factor(d3$Pets_owned)  
skimr::skim(d3)
Data summary
Name d3
Number of rows 11
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 5 M: 4, F: 3, O: 2, N: 1
Pets_owned 0 1 FALSE 3 Cat: 5, Dog: 5, Liz: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Height 0 1 167.09 16.81 133 155.5 166 179.0 192 ▂▇▅▇▅
Weight 0 1 70.45 20.65 45 54.5 70 85.0 110 ▇▂▃▃▂
Number_books_read 0 1 7.36 10.26 0 1.5 3 8.5 32 ▇▁▁▁▁
d4 <- d3 %>% dplyr::filter( !(Gender %in% c("NA","N")) ) %>% droplevels()
skimr::skim(d4)
Data summary
Name d4
Number of rows 9
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 3 M: 4, F: 3, O: 2
Pets_owned 0 1 FALSE 3 Cat: 4, Dog: 4, Liz: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Height 0 1 165.67 15.98 133 156 166 178 183 ▂▁▃▃▇
Weight 0 1 70.11 21.25 45 55 70 80 110 ▇▂▃▂▂
Number_books_read 0 1 7.78 11.09 0 2 3 6 32 ▇▁▁▁▁
processeddata2 <- d4

save_data_location <- here::here("data","processed-data","processeddata2.rds")
saveRDS(processeddata2, file = save_data_location)

3 Results

3.1 Exploratory/Descriptive analysis

Use a combination of text/tables/figures to explore and describe your data. Show the most important descriptive results here. Additional ones should go in the supplement. Even more can be in the R and Quarto files that are part of your project.

Table 1 shows a summary of the data.

Note the loading of the data providing a relative path using the ../../ notation. (Two dots means a folder up). You never want to specify an absolute path like C:\yourname\yourproject\results\ because if you share this with someone, it won’t work for them since they don’t have that path. You can also use the here R package to create paths. See examples of that below. I generally recommend the here package.

Table 1: Data summary table. All caption text goes here.
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace factor.ordered factor.n_unique factor.top_counts numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character Pets_owned 0 1 3 6 0 3 0 NA NA NA NA NA NA NA NA NA NA NA
factor Gender 0 1 NA NA NA NA NA FALSE 3 M: 4, F: 3, O: 2 NA NA NA NA NA NA NA NA
numeric Height 0 1 NA NA NA NA NA NA NA NA 165.666667 15.97655 133 156 166 178 183 ▂▁▃▃▇
numeric Weight 0 1 NA NA NA NA NA NA NA NA 70.111111 21.24526 45 55 70 80 110 ▇▂▃▂▂
numeric Number_books_read 0 1 NA NA NA NA NA NA NA NA 7.777778 11.08803 0 2 3 6 32 ▇▁▁▁▁

3.2 Basic statistical analysis

To get some further insight into your data, if reasonable you could compute simple statistics (e.g. simple models with 1 predictor) to look for associations between your outcome(s) and each individual predictor variable. Though note that unless you pre-specified the outcome and main exposure, any “p<0.05 means statistical significance” interpretation is not valid.

Figure 1 shows a scatterplot figure produced by one of the R scripts.

Figure 1: Height and weight stratified by gender.

3.3 Full analysis

Use one or several suitable statistical/machine learning methods to analyze your data and to produce meaningful figures, tables, etc. This might again be code that is best placed in one or several separate R scripts that need to be well documented. You want the code to produce figures and data ready for display as tables, and save those. Then you load them here.

Example Table 2 shows a summary of a linear model fit.

Table 2: Linear model fit table.
term estimate std.error statistic p.value
(Intercept) 149.2726967 23.3823360 6.3839942 0.0013962
Weight 0.2623972 0.3512436 0.7470519 0.4886517
GenderM -2.1244913 15.5488953 -0.1366329 0.8966520
GenderO -4.7644739 19.0114155 -0.2506112 0.8120871

4 Discussion

4.1 Conclusions

It seems that reading books is negatively, although not signifigantly correlated with height.