Tutorials

This page will provide links to cool R code tutorials

Intro course (by Dr. Remy VILLETTE)

minimalR (by Pr. Pat Schloss, author of Mothur)

ggplot animation + 3D + support accumulation

Steven Ge lab: Rtutor.ai (ChatGPT for R), ShinyGO (Gene set enrichment analysis) and iDEP (RNA-seq analysis)

The following documents will allow you to acquire fundamental knowledge about flow cytometry technology (English version or French Version).

Initiation literature for master and PhD students.

A sense of history:

First Nature article ever by Goethe (1869)

The 3 nature papers on DNA structure published in 1953: The Watson-Crick paper is one of the best written manuscripts I have ever read. Please read the delicate three lines on p. 558 by Watson & Crick starting with ("It has not escaped our notice.....", rarely has ONE sentence hidden so much depth and comprehension).

Guides:

How to become a scientist by Pr. Yewdell: First, Second

Microbiota terminology by Pr. Ravel.

Immunity - self versus non-self to the danger model: Matzinger et al. Scand J Immunology 2003

B cell immunology at the extremes of age: Blanco et al. JACI 2018

Shortest articles and thesis ever

Microbiota biology:

Whittaker's Ecology paper (alpha, beta and gamma diversity): Whittaker et al. Taxon 1972

Natural biotinylation of bacterial proteins (Streptavidin as a detection reagent): Matsuhisa et al. Microbiol Immunol 1993

Microbiota in numbers by Dr. Sender: Sender et al. PLoS Biology 2016

Specific literature for Immunity & Microbiota Ecology:

IgA regulatory mechanisms: Bunker et al. Immunity 2018 Review

Immune exclusion: Moor et al. Nature 2017

Immune inclusion: Donaldson et al. Science 2018

IgA immunity to microbes:

IgA binds colitogenic bacteria: Palm et al. Cell 2014

IgA bindings associated with infant malnutrition: Kau et al. STM 2015

IgA binds preferentially Small Intestine microbes: Bunker et al. Immunity 2015

IgA regulates microbiota ecology: Fadlallah et al. STM 2018

Microbial superantigens target immunoglobulin Fab region (C. comes, R. gnavus): Bunker et al. STM 2019

IgAseq analysis : Jackson et al. Microbiome 2021 and IgAscore R package

IgG immunity to microbes:

Seric IgG binds microbiota (flow cytometry): Slack et al. Science 2009

Seric IgG microbial specificity converge with gut IgA: Fadlallah et al. JACI 2019

Seric IgG responses to fungi: Moreno-Sabater et al. Microbiome 2020

Bacterial degradation of secretory component: Moon et al. Nature 2015

Immune memory and persistance (long-term effect of early-life exposure): Vergani et al. Immunity 2022 and Ramanan et al. Immunity 2020

Technical/experimental limitations in microbiota research:

Minimum 10⁷ microbes for robust WGS (solid): Plaza-Onate et al. BMC Genomics 2015

Minimum 10⁶ microbes for robust 16S rRNA gene analysis: VILLETTE et al. Sci Rep 2021

Practical experimental notes:

Safety

All laboratory staff NEED (obligatory) to have completed Inserm's Néo : accueil et prévention (expect 4x20min).

Always wear lab coat and gloves when experimenting.

Make sure to treat waste correctly. Ask your supervisor or colleagues if in doubt. A. Liquid waste should be marked with your initials, U1135, date and a description of the content (should be placed in the corridor on Wednesday afternoon). B. Solid waste should be marked with your initials, U1135 and the date.

Antibodies

Antibodies are playing a major role in our research. Both directly as our target of interest as well as indirectly, because they represent essential tools for the detection of various biomolecules (Flow cytometry, ELISA etc.). Antibody reagents used for research can be unconjugated or conjugated with biotin, HRP or various fluorochromes. They are generally very expensive (3-600 Euros per bottle/tube). HOW should you preserve these reagents.

1. Always work on ice (when you take the tube out of the fridge or freezer keep them on ice at all times).

2. Avoid freeze-thaw cycles. For antibodies that requires to be stored frozen make aliquotes. When aliquotes are thawn (only after verifying that an aliquote is not already available in fridge) do NOT re-freeze them (keep in fridge).

3. Antibodies are generally very stable and can be stored for quit a long time in fridge (follow manufactures recommendation)

4. For long-term storage antibodies can be frozen at -80dC (AVOID for PE conjugated antibodies). Make aliquotes of a reasonable size, which can be left in fridge after thawing for a few months (avoid freeze-thaw cycles). NEVER freeze a completely batch of antibody.

5. Rarely (only after discussing with your supervisor) antibodies can be frozen at -20dC, but in this case adding upto 50% glycerol is an advantage as it will benefit from the low temperature but avoid crystalization (even at -20dC the antibody solution is still liquid).

6. Samples containing antibodies (serum, fecal water, breastmilk etc.) are generally stored at -80dC (long-term storage), but if they have been diluted or plated out for experimental use with no need for long-term storage they can be stored at -20dC (please consult your supervisor).

Beyond the impact the above guidelines could have on the experimental quality of your work, antibodies are also a major part of our lab budget. Please take care of them.

Sequencing technology:

Metagenomic sequencing - principles:

Technical constraints:

Ion Torrent:

Illumina sequencing technology:

Oxford Nanopore technology (e.g. minION, PromethION)

Practical analytical notes:

R coding tutorials:

Riffomonas by Pat Schloss author of Mothur

Statistical software:

Free Graphprism-like statistical software (installs on all platforms)

BiostaTGV including Power analysis

Polyfunctionality / Poly-Ig analysis:

Polyfunctionality presentation

Exhaustive boolean gating can be done in Flow Jo using the "Combination gates" function. Exporting this data can then be analyzed with the FunkyCells software.

Polyfunctionality sample data (T cells)

Pubmed

You can use search tags to make your queries more specific (query example: Search term [tag]). Some of the more useful tags are:

First author : [1au] (e.g. Larsen [1au])

Last author : [lastau] (e.g. Larsen [lastau])

Publication Date : [dp] (e.g. "last 10 years" [dp] or 2012 [dp])

Journal : [ta] (e.g. JEM [ta])

Title : [ti]

Title/Abstract : [tiab]

Shortest articles and thesis ever.

Shortest PhD thesis ever by John NASH 1949 (nobel laureate)

Short articles:

Goldberg et al. 2014

Goldberg 2014

Tattersall et al. 2013

Tattersall 2013

Berry et al. 2011

Berry 2011

Conway et al. 2004

Conway 2004

Doyle et al. 1978

Doyle 1978

Gardner et al. 1974

Garder 1974

Upper et al. 1974

Upper 1974

Lander et al. 1966

Lander 1966

Video tutorials

FunkyCells ToolBox - David IOSIFESCU
This presentation explains how the web application of FunkyCells Toolbox can be applied to biological data.

ChatGPT and OpenAI - Martin LARSEN (Slides)

Of note, since my video presentation I have added a few slides (you can download above). In particular I realised that a ChatGPT detection tool exist (GPT-2 Output Detector). I tried the tool and it works well for texts completely generated by ChatGPT, but if you ask it to improve a text not written by ChatGPT, it doesn't seem to be able to detect the improvements. Generally, it seems to work best with long texts. In conclusion, the tool works but does suffer from a significant number of false positives and false negatives. I'm not convinced that we have the time to verify all texts and hand written text (copied from ChatGPT) would be difficult to test.

DADA2 Tutorial - Remy VILLETTE

Lab Guru Tutorial - Manon CHAUVIN

Microscopy - FIJI macros - Alice PASCAULT (script)

ChatGPT and OpenAI - Martin LARSEN (Slides)

Of note, since my video presentation I have added more slides (you can download above).

1) Engineering - Arduino (virtual circuit design - tinkercad)

2) Engineering - 3D print (openscad software)

3) Shiny App (download final script - try the application)

4) ChatGPT detection tool.

A ChatGPT detection tool exist (GPT-2 Output Detector). Probably many more will arrive over time. I tried the tool and it works well for texts completely generated by ChatGPT, but if you ask it to improve a text not written by ChatGPT, it doesn't seem to be able to detect the improvements. Generally, it seems to work best with long texts. In conclusion, the tool works but does suffer from a significant number of false positives and false negatives. I'm not convinced that a teacher has the time to verify all texts and hand written text (copied from ChatGPT) would be difficult to test. Contrarily, one could imagine that ChatGPT would one day be able to correct our students work (that would take away a heavy evaluation burden from teachers and allow them to spend more time on their primary mission - to teach).

Copyright: Interestingly, what derives from ChatGPT is owned by the one that asked the question resulting in the production. Therefore you own the copyright to the material you produce with ChatGPT (link to OpenAI's website).

Can I use output from ChatGPT for commercial uses?
- Subject to the Content Policy and Terms, you own the output you create with ChatGPT, including the right to reprint, sell, and merchandise – regardless of whether output was generated through a free or paid plan.

Steven Gee has a website with various examples of how to use ChatGPT for bioinformatics projects.

dbBact tutorielle - Amnon AMIR & Noam SHENTAL (slides, sample data)

dbBact is a tool that can associate microbiota or individual ASVs with a database of other studies and identify metadata features that may be relevant for your own study. Several types of output, including "word clouds".

MixOmics Workshop by Sebastien DEJEAN (08/09/2023)

Sebastien DEJEAN is a research engineer in the mathematical department at the University of Toulouse, France. He is specialised in statistical models and involved in the creation of the MixOmics R package. He has created the presentation and associated R script presented below. Sebastien DEJEAN has agreed that the video and documents are made available to the public domain. We wish to thank Sebastien for supporting our team with his advice and expertise - THANK YOU.

Theoretical Presentation (download presentation):

Practical Presentation (download R script):

SRA toolkit tutorial

Rémy VILLETTE

04/05/2021

NCBI made a very useful tool to download quickly and accurately large sequencing data stored on SRA, ENA or DDBJ servers. You will need to install, configure and get accession list for your samples. The toolkit will download SRA formatted files and then transform them in fastq.

Install the SRA toolkit

You first need to download the SRA toolkit from NCBI : https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

#for LINUX UBUNTU

wget --output-document sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

#FOR MAC OS

curl --output sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-mac64.tar.gz

Configure your toolkit

You need to configure it as explained here : https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration

Rapidly you can just run this in your terminal and chose the directory you want the SRA toolkit to download into. If you don’t perform this step the toolkit will download in the directory were you put it, usually in your root. So change this to a directory with a large storage capacity.

/Users/mala76/sratoolkit/sratoolkit.2.10.9-mac64/bin/vdb-config -i

Download the SRA formatted files.

Once the directory is set you can start to get your data. Go to SRA entrez in NCBI : https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/

First get the filenames in .txt format from the run selector or the sra entrez search. You need to go to the SRA entrez on NCBI and use the accession number to find your samples. Paste the accession number in the searching zone, press enter. Then on the top right of the panel you’ll find “Send to” button. Select “Run selector”. Select .SRA entrez usually covers for SRA, ENA, DDBJ and more.

setwd("/media/remy/Stockage/Database_fastq/SOP_Early_LIFE")

path <- "/home/remy/sratoolkit.current-centos_linux64/sratoolkit.2.10.7-centos_linux64/bin" #make the path to the toolkit directory

func_fetch<- paste0(path, "/prefetch --option-file") #prepare the function

Here you can chose the directory where you want your SRA formatted files to be stored. If you already did it in the vdb -configure -i you don’t need to redo it. You can also use this code if you want to change temporarily the output-directory.

# This section of code is to make sure that files do not contain WGS data

library(tidyverse)

# give an output file to where the files will be downloaded

f <- list.files("/home/remy/Documents/SOP_early_life_R/Datas", pattern = "SraRunTable.txt")

accession <- list()

metadata <- list()

not_shotgun <- list()

shotgun <- list()

sink("metadata cleaning.txt")

for(i in f){

tmp= str_split(i, pattern="_", simplify = T)[1]

tmp= str_split(i, pattern =" ", simplify = T )[1]

metadata[[tmp]] <- read.delim(paste0("/home/remy/Documents/SOP_early_life_R/Datas/", tmp, " SraRunTable.txt"), sep=",")

bool= sum( grepl(regex_match, metadata[[tmp]]))

if( bool > 0)

{

#find out which accession_list contains shotgun data

nb = matches(vars= metadata[[tmp]], c("shotgun", "WGS", "whole genome", "all_genome", "WXS", "METATRANSCRIPTOMIC", "NextSeq", "WholeGenomeShotgun"))

nb=

cat(tmp, "\n", "column",paste0(colnames(metadata[[tmp]])[nb], collapse = " | "), "\n", "contains a pattern related to SHOTGUN sequencing, there is shotgun innit \n")

shotgun[tmp] = tmp

}

# For those that are only 16S

if (bool == 0)

{

not_nb = matches(vars= metadata[[tmp]], c("pcr", "Amplicon", "OTHER", "16S"))

cat(tmp, "\n", "column", paste0(colnames(metadata[[tmp]])[not_nb], collapse = " | "), "\n", "seems to be only 16S rRNA gene \n")

# tmp2 = str_split(i, pattern="_", simplify = T)[1]

not_shotgun[[tmp]]<- tmp

}

sink(file = NULL)

#remove them from the accession_list and give a new accession_list with only the 16S samples

`%notin%` = Negate(`%in%`)

dir.create("/clean_Datas")

for(j in shotgun){

meta= metadata[[j]]

tmp = meta %>% filter(if_any(everything(),~str_detect(., regex_match)))

meta= meta[meta$Run%notin%tmp$Run,]

vec= meta %>% select(Run)

if(dim(meta)[1]>0){

write.table(x= vec, file = paste0("./clean_datas/", j , "_SRR_Acc_List.txt"), sep = "\t", row.names = F, col.names = F, quote = F)

write.table(x= meta, file = paste0("./clean_datas/", j , "_SraRunTable.txt"), sep="\t", row.names = F, col.names = T, quote = F)

}

# export the modified data

for(i in not_shotgun) {

write.table(x= metadata[[i]], file = paste0("./clean_datas/",i, "_SraRunTable.txt"), row.names = F, sep = "\t", col.names = T, quote = F)

write.table(x= metadata[[i]][,'Run'], file = paste0("./clean_datas/", i, "_SRR_Acc_List.txt"), sep = "\t", row.names = F, col.names = F, quote = F)

}

new_meta=list()

for(i in list.files("./clean_datas", pattern = "_SraRunTable.txt")){

tmp= str_split(i, pattern="_", simplify = T)[1]

tmp= str_split(tmp, " ")[[1]][1]

new_meta[[tmp]] = read.delim(file = paste0("/home/remy/Documents/SOP_early_life_R/clean_datas/", tmp, "_SraRunTable.txt"), sep="\t", header = T)

}

“PRJEB26419” project contains hidden WGS and transcriptomic datas. For this specific paper we need to subset the data manually

You can now proceed with the function. It will download SRA formatted files (not compressed however) in the directory you chose. Give the accession list you downloaded from NCBI.
For this tutorial we will use the accession number : PRJEB2079.

lstfil <- list.files("/home/remy/Documents/SOP_early_life_R/clean_datas", pattern = "_SRR_Acc_List")

#store the accession name for each project

project_number = str_split(lstfil, "_", simplify = T)[,1]

nb=31

outdir <- paste0("/media/remy/Stockage/Database_fastq/SOP_Early_LIFE/sra/", project_number[nb])

# accession<- file.choose() # the accession list to download

accession<- paste0( "/home/remy/Documents/SOP_early_life_R/clean_datas/", project_number[nb], "_SRR_Acc_List.txt" )

cmd1<- paste(func_fetch, accession , "--output-directory", outdir, "--progress") # here goes the function

system(cmd1) # launch it, it will take some time depending on the number of sample that you are downloading

#Troubleshooting In case you need to redownload some files or you’re code stopped. If the utils function find an existing fastq it will stop the loop. For that we remove these fastqs from the list. For the files that stopped downloading

library(tidyverse)

dir2<- outdir #Get the directory were files are stored

non_files = list.files(dir2, pattern=".prf|lock|.tmp")

for(i in non_files){

x= str_remove(i, ".sra.prf|.sra.tmp|.sra.lock")

print(x)

cmd1<- paste( paste0(path, "/prefetch "), x , "--output-directory", outdir, "--progress") # here goes the function

system(cmd1)

}

Transform SRA formatted data into fastq.

There is a subtility there : be aware that some files will be uploaded as single-end sequence files, meaning there is no Forward and Reverse files. In that case you just need to adjust the functions gzip.
Presently fasterq-dump doesn’t allow compression so we need to use external compression function.

# You may encounter a problem and the code stopped. If a file has allready been done the utils function will block the loop

# In that manner there is a character vector to detect the files allready done

files<- list.files(dir2) #make a lit of the samples

non_files = list.files(dir2, pattern=".prf|lock|.tmp")

redo= str_remove(non_files, pattern = ".prf|.lock|.tmp")

outdir2<- paste0("/media/remy/Stockage/Database_fastq/SOP_Early_LIFE/fastq/", project_number[nb]) # chose an directory to store the fastq

done= list.files(outdir2)

done = str_remove(done, pattern = ".fastq.gz|_1.fastq.gz|_2.fastq.gz")

for(f in files[!(files%in%done | files%in%redo | files%in%non_files)]) {

dif<- paste0(dir2,"/", f)

func<- paste0(path, "/fasterq-dump --split-files --progress")

cmd2 = paste(func, dif, "--outdir", outdir2)

cat(f,"\n")#print the current command

system(cmd2) # invoke command

if(file.exists(paste0(outdir2,"/", f, "_1.fastq"))){

# now compress the files

R.utils::gzip(paste0(outdir2,"/", f, "_1.fastq"), remove=T)

R.utils::gzip(paste0(outdir2,"/", f, "_2.fastq"), remove=T)

} else{

R.utils::gzip(paste0(outdir2,"/", f, ".fastq"), remove=T)

}

Finally, we want to remove SRA formatted files as they are big and useless now.

outdir2 <- list.files(outdir)

outdir2 <- paste0(outdir,"/",outdir2)

for (dl in outdir2){

file.remove(dl)

}

Work metadata from multiple SRA projects.

The challenge is now to homogenize the metadata in order to merge them later. Metadata are given by authors and can vary from 6 columns to 50, with different column names for the same information (ex: “Host_age” and “Age”). For now we just want to import the metadata files in R to analyze the names of the columns and their frequencies before considering any modifications.

# Get frequencies of data types - used for metadata harmonization

colname <- NULL

data_name= NULL

dim_metadata= NULL

for (i in names(new_meta)){

colname <- c(colname, colnames(new_meta[[i]]))

data_name[[i]]= colnames(new_meta[[i]])

dim_metadata= rbind(dim_metadata,data_frame(project= i, factors= dim(new_meta[[i]])[2], samples= dim(new_meta[[i]])[1]))

}

tmp = as.data.frame(table(colname))

tmp = tmp[order(-tmp$Freq),]

sum(tmp$Freq==49)

filter(tmp, Freq==49)%>%View()

write.csv(. ,"shared_col.csv")

filter(tmp, Freq==1)%>% View()

sum(tmp$Freq==1)

sum(dim_metadata$samples)

Change the column names work on progress.

metadat= metadata

for(i in names(new_meta)){

nb = str_detect(colnames(new_meta[[i]]), "Age|_age_|AGE")

print(colnames(new_meta[[i]][1:5,])[ nb])

}