# EMIT_QUARANTINE_MAIN This README describes the Data cleaning and analysis project using the EMIT quarantine transmission trial data. The goal of this project is to take all of the raw data files from the study, clean them and merge them together into an authoritative dataset, and then produce all of the analyses required for the manuscript preparation. The scripts that do that are described here. Date: 09/23/2018 Updated: 02/08/2019 There are 6 .R scripts in this project. 1) EMIT_Quarantine_Main_Cleaning.R 2) EMIT_Quarantine_Main_work_with_clean_files.R 3) EMIT_Quarantine_Main_Analysis.R 4) EMIT_Quarantine_Text_Analysis.R 5) EMIT_Quarantine_POC_Inf_Criteria.R 6) EMIT_Quarantine_Source_Scripts.R There is one, important .Rmd file: EMIT Main Paper Tables Figures.Rmd The first two scripts are data cleaning scripts, and the next three are analysis scripts. The final script will source all 5 of the preceding scripts. The EMIT_Quarantine_Main_Cleaning.R script uses raw data from the EMIT quarantine study to create curated datasets that can be used for summary tables and analysis. Next, the EMIT_Quarantine_Main_work_with_clean_files.R script uses these cleaned datasets to create an authoritative dataset for analysis. The EMIT_Quarantine_Main_Analysis.R, EMIT_Quarantine_Text_Analysis.R, and the EMIT_Quarantine_POC_Inf_Criteria.R scripts produce all of the analysis output. In order to get a streamlined, reportable markdown of the results, run the "EMIT Main Paper Tables Figures.Rmd" script after the 2 cleaning and 3 analysis scripts have been run. A 'source' file (EMIT_Quarantine_Source_Scripts.R) runs all of these scripts and renders markdown reports for each of them, in the appropriate, sequential order, however, there is a bug with running the "EMIT Main Paper Tables Figures.Rmd" file within this source script. This bug is related to the fact that the RMarkdown file sits in different working directory as the working directories in the .R scripts being run. It's strange that such a problem does not exist when sourcing .R files but does occur when sourcing .Rmd files. Thus, while the source script can be used to run all of the .R files and prepare all of the data in the output directories for the EMIT Main Paper Tables Figures.Rmd to work with, the actual Rmd file must be run separately by opening the Rmd script, and clicking the knit option for html file. The working directory, containing raw data (input) and the curated datasets (output), is: /Users/jbueno/Box Sync/EMIT/EMIT_Data_Analysis_Jake/UK vs. UMD data and analysis. Users wishing to collaborate on the analysis or replicate findings should set an appropriate working directory. The raw datafiles are further described in a word document: "/Users/jbueno/Box Sync/EMIT/EMIT_Data_Analysis_Jake/UK vs. UMD data and analysis/Analysis Notes/List and description of data files in Study data folder 22-Jul-2014_JakeNotesSept2018.docx". The procedure for reproducing the analysis for this repo is to: As mentioned above, the "EMIT_Quarantine_Source_Scripts.R" file can be run, however after running this script, the EMIT Main Paper Tables Figures.Rmd file must be opened and the knit option for html must be selected in order to get out the final data report. Or, the scripts can be run individually in the following order (also see notes about each script in the list below): 1) Run the EMIT_Quarantine_Main_Cleaning.R script. This script cleans raw data files and writes out cleaned version in .csv format to: /Users/jbueno/Box Sync/EMIT/EMIT_Data_Analysis_Jake/UK vs. UMD data and analysis/UK Quarantine Study Data and Notes/Curated Data/Cleaned Data 2) Run the EMIT_Quarantine_Main_work_with_clean_files.R script. This script uses the dfs from the ...Cleaned Data directory (produced above) and write an authoritative df in .csv format called "QuarantineMergedData.csv" to: /Users/jbueno/Box Sync/EMIT/EMIT_Data_Analysis_Jake/UK vs. UMD data and analysis/UK Quarantine Study Data and Notes/Curated Data/Analytical Datasets 3) Run the EMIT_Quarantine_Main_Analysis.R script. This script uses the "QuarantineMergedData.csv" in the ...Analytical Datasets directory (produced above) and write the tables 1, 2, 3, and Figure 3 (two plots), as well as some other supplementary tables and figures, most of which are required for the generation of the main text table footnotes, and manipulated data frames ready for plotting in RMarkdown using kable to: /Users/jbueno/Box Sync/EMIT/EMIT_Data_Analysis_Jake/UK vs. UMD data and analysis/UK Quarantine Study Data and Notes/Curated Data/Analysis Results 4) Run the EMIT_Quarantine_Text_Analysis.R script. This script produces alternative versions of Tables 1 and 3 -- these alternative versions apply a more stringent criteria to the symptom and ILI classification in which symptoms that occur before study day 1 are excluding from contributing to classification. Note: there were no donors or recipients with fever ≥37.9 before study day 1. This script will use the authoritative df ("QuarantineMergedData.csv") as well as pieces of script from the EMIT_Quarantine_Main_Analysis.R script to produce tabular form data for the text that on which we can later apply kable function in RMarkdown to generate nice tables. The main reason for having this script, as opposed to simply lumping this analysis into the above EMIT_Quarantine_Main_Analysis.R script, is to improve easy of following the scripts because this bit of code is so long just to generate these couple of tables. The results are written to: /Users/jbueno/Box Sync/EMIT/EMIT_Data_Analysis_Jake/UK vs. UMD data and analysis/UK Quarantine Study Data and Notes/Curated Data/Analysis Results 5) Run the EMIT_Quarantine_POC_Inf_Criteria.R script. This script produces summary tables for the manuscript supplement. These summary tables are reproductions of Tables 1 and 3 from the manuscript, with the difference of applying the less stringent Proof-of-Concept infection criteria (a single day of PCR positivity, culture, or seroconversion - from Killingley et al, JID 2012) to the Main Q study. The original criteria used in the Main Q study analysis was 2 days of PCR positivity, or seroconversion, and we saw only one infection event (which was among a control recipient). 6) Run the "EMIT Main Paper Tables Figures.Rmd" script to produce an html markdown file. This RMarkdown uses the tables and ready to plot dfs in the ...Analysis Results directory (produced above) to produce publication quality tables in the html format (note: the .pdf and .docx extensions do not work here because I believe kable function disables them somehow, thus we must print this knitted RMarkdown file as an .html file), which can then be copied into word for fine tuning and preparation for submission to journal. Most journals require the tables in the word format but some do accept LaTeX formats as well. There were some issues with printing the LaTeX formatted tables that need further investigation. For now we will work with the workflow described here. Note: Future iterations of this work should look into debugging the procedural script "EMIT_Quarantine_Source_Scripts.R" to enabling running of EMIT Main Paper Tables Figures.Rmd to produce the entire analysis with a single command and by only opening a single script.