Example Report Template for a Data Analysis Project

The structure below is one possible setup for a report stemming from a data analysis project. It loosely follows the structure of a standard scientific manuscript. Adjust as needed. You don’t need to have exactly these sections, but the content covering those sections should be addressed.

This uses HTML as output format. See the Quarto documentation for instructions on how to use other formats.

Authors

Esther Palmer\(^{1,2,*}\)
Katherine Lorusso\(^{3, *}\)

Author affiliations

Departmet of Microbiology, University of Georgia, Athens, GA, USA.
Department of Population Health, University of Georgia, GA, USA.
University of Georgia.

\(*\) These authors contributed equally to this work.

\(\land\) Corresponding author: some@email.com

\(\dagger\) Disclaimer: The opinions expressed in this article are the author’s own and don’t reflect their employer.

1 Summary

This project was part of an exercise for the MADA course to learn about data cleaning and the READY workflow.

2 Methods

Describe your methods. That should describe the data, the cleaning processes, and the analysis approaches. You might want to provide a shorter description here and all the details in the supplement.

Data included height, weight, gender, pets owned, and number of books read in the last year. Data was cleaned by removing individuals from the set with missing data. Various plots and tables were made to explore the data. Data was eventually fit to a linear model for final analysis.

2.1 Data acquisition

Data was made up by Esther Palmer in order to have something for this exercise.

2.2 Data import and cleaning

Packages used:

library(readxl) #for loading Excel files
library(dplyr) #for data processing/cleaning


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr) #for data processing/cleaning
library(skimr) #for nice visualization of data 
library(here) #to set paths

Load data

data_location <- here::here("data","raw-data","exampledata2.xlsx")
rawdata <- readxl::read_excel(data_location)
codebook <- readxl::read_excel(data_location, sheet ="Codebook")

Clean Data

d1 <- rawdata %>% dplyr::filter( Height != "sixty" ) %>% 
                  dplyr::mutate(Height = as.numeric(Height))

skimr::skim(d1)

Data summary
Name	d1
Number of rows	13
Number of columns	5
_______________________
Column type frequency:
character	2
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Gender	0	1	1	2	0	5	0
Pets_owned	0	1	3	6	0	3	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Height	0	1.00	151.62	46.46	6	154.00	165	175	192	▁▁▁▂▇
Weight	1	0.92	647.92	2000.48	45	54.75	73	90	7000	▇▁▁▁▁
Number_books_read	0	1.00	8.15	10.84	0	1.00	3	11	32	▇▁▁▂▁

hist(d1$Height)

d2 <- d1 %>% dplyr::mutate( Height = replace(Height, Height=="6",round(6*30.48,0)) )
skimr::skim(d2)

Data summary
Name	d2
Number of rows	13
Number of columns	5
_______________________
Column type frequency:
character	2
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Gender	0	1	1	2	0	5	0
Pets_owned	0	1	3	6	0	3	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Height	0	1.00	165.23	16.52	133	155.00	166	178	192	▂▇▆▆▃
Weight	1	0.92	647.92	2000.48	45	54.75	73	90	7000	▇▁▁▁▁
Number_books_read	0	1.00	8.15	10.84	0	1.00	3	11	32	▇▁▁▂▁

d3 <- d2 %>%  dplyr::filter(Weight != 7000) %>% tidyr::drop_na()
skimr::skim(d3)

Data summary
Name	d3
Number of rows	11
Number of columns	5
_______________________
Column type frequency:
character	2
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Gender	0	1	1	2	0	5	0
Pets_owned	0	1	3	6	0	3	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Height	1	167.09	16.81	133	155.5	166	179.0	192	▂▇▅▇▅
Weight	1	70.45	20.65	45	54.5	70	85.0	110	▇▂▃▃▂
Number_books_read	1	7.36	10.26	0	1.5	3	8.5	32	▇▁▁▁▁

d3$Gender <- as.factor(d3$Gender)  
skimr::skim(d3)

Data summary
Name	d3
Number of rows	11
Number of columns	5
_______________________
Column type frequency:
character	1
factor	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Pets_owned	0	1	3	6	0	3	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Gender	0	1	FALSE	5	M: 4, F: 3, O: 2, N: 1

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Height	1	167.09	16.81	133	155.5	166	179.0	192	▂▇▅▇▅
Weight	1	70.45	20.65	45	54.5	70	85.0	110	▇▂▃▃▂
Number_books_read	1	7.36	10.26	0	1.5	3	8.5	32	▇▁▁▁▁

d3$Pets_owned <- as.factor(d3$Pets_owned)  
skimr::skim(d3)

Data summary
Name	d3
Number of rows	11
Number of columns	5
_______________________
Column type frequency:
factor	2
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Gender	0	1	FALSE	5	M: 4, F: 3, O: 2, N: 1
Pets_owned	0	1	FALSE	3	Cat: 5, Dog: 5, Liz: 1

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Height	1	167.09	16.81	133	155.5	166	179.0	192	▂▇▅▇▅
Weight	1	70.45	20.65	45	54.5	70	85.0	110	▇▂▃▃▂
Number_books_read	1	7.36	10.26	0	1.5	3	8.5	32	▇▁▁▁▁

d4 <- d3 %>% dplyr::filter( !(Gender %in% c("NA","N")) ) %>% droplevels()
skimr::skim(d4)

Data summary
Name	d4
Number of rows	9
Number of columns	5
_______________________
Column type frequency:
factor	2
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Gender	0	1	FALSE	3	M: 4, F: 3, O: 2
Pets_owned	0	1	FALSE	3	Cat: 4, Dog: 4, Liz: 1

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Height	1	165.67	15.98	133	156	166	178	183	▂▁▃▃▇
Weight	1	70.11	21.25	45	55	70	80	110	▇▂▃▂▂
Number_books_read	1	7.78	11.09	0	2	3	6	32	▇▁▁▁▁

processeddata2 <- d4

save_data_location <- here::here("data","processed-data","processeddata2.rds")
saveRDS(processeddata2, file = save_data_location)

3 Results

3.1 Exploratory/Descriptive analysis

Use a combination of text/tables/figures to explore and describe your data. Show the most important descriptive results here. Additional ones should go in the supplement. Even more can be in the R and Quarto files that are part of your project.

Table 1 shows a summary of the data.

Note the loading of the data providing a relative path using the ../../ notation. (Two dots means a folder up). You never want to specify an absolute path like C:\yourname\yourproject\results\ because if you share this with someone, it won’t work for them since they don’t have that path. You can also use the here R package to create paths. See examples of that below. I generally recommend the here package.

Table 1: Data summary table. All caption text goes here.

skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	factor.ordered	factor.n_unique	factor.top_counts	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	Pets_owned	1	3	6	0	3	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
factor	Gender	1	NA	NA	NA	NA	NA	FALSE	3	M: 4, F: 3, O: 2	NA	NA	NA	NA	NA	NA	NA	NA
numeric	Height	1	NA	NA	NA	NA	NA	NA	NA	NA	165.666667	15.97655	133	156	166	178	183	▂▁▃▃▇
numeric	Weight	1	NA	NA	NA	NA	NA	NA	NA	NA	70.111111	21.24526	45	55	70	80	110	▇▂▃▂▂
numeric	Number_books_read	1	NA	NA	NA	NA	NA	NA	NA	NA	7.777778	11.08803	0	2	3	6	32	▇▁▁▁▁

3.2 Basic statistical analysis

To get some further insight into your data, if reasonable you could compute simple statistics (e.g. simple models with 1 predictor) to look for associations between your outcome(s) and each individual predictor variable. Though note that unless you pre-specified the outcome and main exposure, any “p<0.05 means statistical significance” interpretation is not valid.

Figure 1 shows a scatterplot figure produced by one of the R scripts.

Figure 1: Height and weight stratified by gender.

3.3 Full analysis

Use one or several suitable statistical/machine learning methods to analyze your data and to produce meaningful figures, tables, etc. This might again be code that is best placed in one or several separate R scripts that need to be well documented. You want the code to produce figures and data ready for display as tables, and save those. Then you load them here.

Example Table 2 shows a summary of a linear model fit.

Table 2: Linear model fit table.

term	estimate	std.error	statistic	p.value
(Intercept)	149.2726967	23.3823360	6.3839942	0.0013962
Weight	0.2623972	0.3512436	0.7470519	0.4886517
GenderM	-2.1244913	15.5488953	-0.1366329	0.8966520
GenderO	-4.7644739	19.0114155	-0.2506112	0.8120871

4 Discussion

4.1 Conclusions

It seems that reading books is negatively, although not signifigantly correlated with height.