Old Links Used in Class - Spring 2024

Lecture 12 (3/8)

A.) Merge gtf files from individual samples to prepare for comparison.

  • Observe the output of cufflinks.
  • Combine the GTFs using the assembly file. Downlaod the file and save it in your /hts (or /HTS) directory.
  • Execute the command "cuffmerge -p 2 -g ./index/dm6.49.gtf -s ./index/Dm6.49.fa -o Merged WG_assemblies.txt". It may take 5-10 minutes for the task to complete.
  • Alternatively, you can submit the CuffMerge sbatch job file. use "sbatch --account=gms6014 --qos=gms6014 cuffmerge_gms6014.sbatch" if you have an account from your lab.
  • B.) transfer files to and from HiPerGator

    C.) Displaying genomic dataset with UCSC browser

  • UCSC Genome Browser is a popular server for viewing genomic data -- pay attention to select the right genome release. Sign up with the service will allow you to save the analysis sessions and share the results with others. Example
  • An example of a saved session , which will persist and you can send the link to collaborators
  • An alternative to UCSC Browser is the IGB genome browser, which you can download and install on your own computer

  • D.) Identify differentially expressed genes (DEGs)

  • Observe the output of cufflinks.
  • Compile the assembly files for Whole Gut
  • Identify DEGs with the CuffDiff job file .

  • A pipeline that streamline the whole process. Example


  • Lecture 11 (3/6)

    A.) Mapping to reference genome

  • Perform transcript counting with cufflinks job file
  • B.) How to obtain genomic sequence surrounding your gene?

    * From genome browser - example
    * For gneomes not available in public genome browser - use Fastacmd or your own Pyhon script

    C.) Identifying TF and epigenetic regulator binding sites in the genome

  • Searching for TF binding sites in the DNA sequence using TFsiteScan or PROMO  
  • * Identifyng genome-wide trancription factor binding sits with ChIP-Seq data analysis. An example of ChIP-Seq data and TF binding sites prediction


    Lecture 10 RNA-Seq (3/4)

    A.) Obtaining HTS dataset from GEO

  • Example - GSE62580 ; links for SRA dataset - SRP049144 - use the run selector to get accession list
  • using fastq-dump, you may choose to either download 4 samples of whole gut samples into a /wg folder within your working directory
  • for batch download, using a .sbatch job file - download the example ; change it to your email address. upload it to your folder and submit the job with "sbatch --account=gms6014 --qos=gms6014 [Filename]
  • B. RNA_seq data analysis overview.

  • Review paper 1 and 2 .
  • Examples of protocols: 1.) Tophat-->Cofflink--Cuffdiff ; 2.) hisat2-->StringTie. You could also make your own based on the needs of your project with available components
  • C.) Mapping to reference genome

    1. Check to see if the file size are right, run FASTQ QC if necessary.
    2. Make genome index file accordiing to genome sequence and gtf files and script file
    3. Map to the reference genome using the script/job file StarMap . Edit the script/job file with your text editor and load to your RNA-Seq directory.
    4. submit the job file
  • Intro for Star Aligner


  • Lecture 9 Phylogenetic analysis and HTS data (3/1)

    A.) Phylogenetic analysis

    B.) Obtaining HTS dataset from GEO

  • Example - GSE62580 ; links for SRA dataset - SRP049144 - use the run selector to get accession list
  • using fastq-dump, you may choose to either download 4 samples of whole gut samples into a /wg folder within your working directory
  • for batch download, using a .sbatch job file - download the example ; change it to your email address. upload it to your folder and submit the job with "sbatch --account=gms6014 --qos=gms6014 [Filename]

  • Lecture 8 Protein struture (2/28)

    A.) Predicting protein structure

  • Alphafold2 paper by Deepmind .
  • Introduction to the Alphafold project on the Deepming website.
  • PDB - Protein Data Bank ( wwPDB ), played a fundamental role in improving protein struture analysis.
  • B.) Running AF2 on Hipergator

    1. Obtain a fasta format file of the protein that you are interested to predict structure. Save and load it to your folder in /gms6014/share/YourName/AF2. (Prtoein file we use before.)
    2. Edit the Script file on your local computer:
      1. change email address on the #SBATCH --mail-user line.
      2. change the output folder name.
      3. change the name of the fasta formate file.
      upload it to the same AF2 folder as above.
    3. Open terminal connect to HiPerGaot. Navigate to your own AF2 folder, run the command "sbatch --account=gms6014 --qos=gms6014 AlphaFold2.sh".

    Lecture 7 - Protein Domains and Motifs (2/26)

    A - Search for binary patterns

  • Search for Binary pattern with Bagua in dfile.
  • Extra: Python and Biopython
  • B - Identifying motifs shared by a group of protein

  • Load the 8IL6.txt to MEME . Leave your email address for accessing the results.
  • --> Shared motifs identified by MEME.
  • an example of scoring matrix generated by MEME
  • The Motifs identified by MEME could then be used for searching database of sequences
  • Blast output vs. motif search using Mast or Using BG_0.5

    C - Protein profile (protein family) databases:

  • Pfam at InterPro and the entry for IL6 ,
  • Prosite entry for IL6 ,
  • Search the Blast hit and the motif search hit against Pfam or Prosite



    Lecture 6 - Protein Domains and Motifs (2/23)

    A litter more on scoring matrices

  • The orginal Henikoff and Henikoff paper that served as the fundation of using BLOSUM62 as the default for BLAST search
  • Comparison of PAM vs BLOSUM matrices
  • B - Global vs. Local alignment

    Using the two test sequences , perform

    (A) Local Alignment using EMBOSS_Water ; or

    (B) Global alignment with Stretcher

    (use Blosume-62, -10, -1)

    Lecture 5 (2/21) Alignment and sequence similarity

    1.) How did BLAST identify the hit(s) for us?

  • Blast results from standalone blast - IL6_Dm6.44Genes and IL6_Dm6.32.cDNAs .
  • What is Blast ? - Basic Local Alignment Search Tool ( NCBI site ; Nature Education page ;)
  • The basis of quantifying sequence similarity

  • - Block Substituion Matrices (BloSuM)
  •     Blosum 62 matrix, Blosum matrices.


  • What is block?
  • "Many known proteins can be grouped into families according to functional and sequence similarities. The similarity of the proteins across the sequences in each family is far from uniform. While some regions are clearly conserved, others display little sequence similarity. Often the conserved regions are crucial to the protein's function, for example enzymatic catalytic sites. Such conserved regions can be used to probe an uncharacterized sequence to indicate its function. " -- Pietrokovski, Henikoff, Henikoff 1996 Link

    2.) How scoring matrix and penalty affect the outcome of local alignment

    Using the two test sequences , perform local alignment with the following parameters.

    Local Alignment Web Service EMBOSS_Water or (backup LALIGN ).

    1.) alpha (gap opening penalty)=15, beta (extension penalty)=3; and
    2.) alpha=5, beta=1.

    Observe the results.


    Lecture 4 (2/19) Standalone Applications

    1.) Download and isntall the standalone NCBI-Blast: manual at NCBI

  • Before installatin, read instruction for Windows, Mac.
  • Download the .win64.exe file for Windows or the .dmg for Mac from the NCBI ftp server . Change the default installation path to YourHomeFolder/GMS6014/
  • Open a Command (Windows) or Terminal (Mac), navigate to the blast folder, list subfolders, then make new subfolder "dbs", "query", "out".
  • Download Data set for BLAST search. Genomic dataset can be downloaded at Ensemble. A previously downloaded dataset - All Genes in D. mel genome in FASTA format - save the data set in blast/dbs.
  • Download 3 IL6 proteins sequences from UniProt and save as "3IL6.fasta" in the blast/query/ folder .
  • 2.) Runnign blast

    1. Open a Command (Windows) or Terminal (Mac), navigate to the blast folder, list the directory, then make new subfolders "dbs", "query", "out".
    2. Download Data set for BLAST search. Genomic dataset can be downloaded at Ensemble. A previously downloaded dataset - All Genes in D. mel genome in FASTA format. Name this file as Dm6.44.AllGenes and save it in blast/dbs.
    3. Search and download 3 IL6 proteins sequences in FASTA format from UniProt and save as "3IL6.txt" in the blast/query/ folder . example
    4. Formate the dataset for search by running "makeblastdb -in dbs/Dm -dbtype nucl". Check that by observing the new files generated in dbs/
    5. Run tblastn to search for orthologs of IL6 in Dm6.44.Allgenes

    Lecture 3 (2/15) HiPerGator

    1.) Presentation slides for class.

    2.) Tutorials:

  • HiperGator tutorial and introduction ;
  • Linux Command line tutorial.

  • Lecture 2 (2/14)

    A.) Retrieve and save sequence file:

    1. Try different view (format) of the same entry by selecting different "display settings".
    2. Download the FASTA files of nucleotide and protein sequence into your local computer*. 
    3. *: make a folder such as "GMS6014" in your home fodler. Save all course-related files in this folder. Avoid space in folder and file names

    4. Open the saved file using a text editor such as Notepad in Wondows and TextEdit in MacOS.
    5. * consider using a dedicated text editor for bioinformatics projects. Such as NotePad++ (for Windos only) or Emacs (all major OS)


    B.) Local storage of sequence files:


    C.) List of public resources. ·


    D.) Navigate the web of information on your favorite gene (or IL6)

  • Search for the gene in Gene vs. Protein database
  • Observe the difference between all text search and advacned search
  • Pay attention to the multitude of links associated the Gene entry.
  • Compare the human "IL6" entry in the NCBI Gene database v.s. the EBI UniProt database.
  • Pathway and interaction information on human IL-6 at Reactome ;



    Lecture 1 (2/12)

    Familiar yourself with the HiPerGator System if you have not use it