---------------------------------------------------------------------------- UCR Phred/Phrap/Consed Manual ---------------------------------------------------------------------------- URL: http://faculty.ucr.edu/~tgirke/Documents/Consed/Consed Authors: Yoon Gi Choi, Swati Chawla, Amy Evans, Peter Morrell, Karen Lundy, Audrey mv Ah Fong & Thomas Girke INDEX (1) OVERVIEW (2) PHREDPHRAP SCRIPT (does it all) (3) PHRED (base caller) (4) CROSS MATCH (vector trimmer) (5) PHRAP (assembler) (6) CONSED Interface (viewer, primer designer, etc.) (7) AUTOFINISH (finishes projects) (8) POLYPHRED (SNP detection) ---------------------------------------------------------------------------- 1. OVERVIEW (Thomas Girke) ---------------------------------------------------------------------------- Phred/Phrap/Consed is a trace file analysis and sequence assembly pipeline for large sequencing projects. Its usage requires some experience with basic UNIX commands. It consists of several software pieces that are bundled together by the Perl script 'phredPhrap'. The obtained base calls and assemblies can be viewed and edited with the graphical interface and editor 'Consed'. SNPs can be identified with the add-on tool PolyPhred. Additional online tutorials can be found at: http://stein.cshl.org/genome_informatics/BasicSequenceManipulation/pages/main.html http://www.dbbm.fiocruz.br/helpdesk/bioinformatics/phredphrap.html ---------------------------------------------------------------------------- 2. PHREDPHRAP SCRIPT DOES IT ALL (Thomas Girke) ---------------------------------------------------------------------------- The phredPhrap Perl script pipes the programs phred, phd2fasta, crossmatch and phrap together using optimized parameters and to allow displaying the final results ('ace files') with the graphical interface Consed. Help $ phredPhrap -V Location of script $ /usr/local/opt/genome/bin/phredPhrap Requirements Create project directory with the following sub-directories in your home directory: - chromat_dir # place there your chromats - phd_dir # phd files will be written here - edit_dir # ace files and others will be written here - poly_dir # output from PolyPhred will be written here Running the phredPhrap script Type from the directory edit_dir: $ phredPhrap What it does A) It runs phred on all *new* reads (reads for which there is no phd file): $ phred -id chromat_dir -pd phd_dir # reads chromat files in "chromat_dir" and writes the "phd" files to "phd_dir". B) It can run determineReadTypes.perl so consed, autofinish, and phrap will understand your read naming convention. This step is uncommented by default. C) Then it runs phd2fasta to create 2 fasta files: one containing read bases and one containing read quality. These are of the highest versions of each read (in case any editing has been done): $ phd2fasta -id phd_dir -os seqs_fasta -oq seqs_fasta.screen.qual D) Then it runs crossmatch to screen them for vector: $ cross_match seqs_fasta vector.seq -minmatch 12 -penalty -2 -minscore 20 -screen > screen.out it uses as query vector library in "/usr/local/opt/genome/lib/screenLibs/vector.seq" or ../vector.seq. This generates the screened sequence file "seqs_fasta.screen" E) It runs phrap to perform the assembly: $ phrap seqs_fasta.screen -new_ace > phrap.out # writes the assembled contigs to the file "seqs_fasta.screen.contigs", and creates a '.ace' file that can be used for importing the assembly to Consed for editing. F) It can run tagRepeats.perl to tag any common repeats that you want to have automatically tagged. It uses the repeat library in /usr/local/opt/genome/bin/tagRepeats.perl. This step is uncommented by default. G) It runs transferConsensusTags to transfer any consensus tags from the newest old ace file to the one phrap created in step 4 H) It runs polyPhred: $ polyhPred -ace name.ace -tag genotype ---------------------------------------------------------------------------- 3. PHRED (Thomas Girke & Yoon Gi Choi) ---------------------------------------------------------------------------- PHRED reads SCF, ABI and ESD chromatogram files, calls bases and assigns quality scores using Fourier methods. Help $ phred -V # displays phred version $ less /usr/local/doc/phred/PHRED.DOC $ phred -doc # display manual $ phred -help # display command options Requirements Phred parameter file needs to list all sequencing parameters used to generate the chromatogram files (chemistry, primer and sequencing machine). Admin has to add this information to this file: /usr/local/opt/genome/lib/phredpar.dat Sequencing chemistries from UCR sequencing facility (Yoon Gi Choi): "DT3100POP4{BDv3}v1.mob" terminator big-dye ABI_3100 "DT3100POP6{BDv3}v1.mob" terminator big-dye ABI_3100 "DT3100POP4{dRhod}v2.mob" terminator d-rhodmine ABI_3100 "DT3100POP6{dRhod}v2.mob" terminator d-rhodmine ABI_3100 "DT3730POP7{BDv3}.mob" terminator big-dye ABI_3730 "3730_POP7_BDTv3.mob" terminator big-dye ABI_3730 Running phred A. As independent application. One example: $ phred -if my_list -sd ./ -qd ./ -pd ../phd_dir -if # processes files listed in this pointer file. Generate pointer file with $ ls *.abi > my_list -sd # write sequence output files with appended .seq in specified directory -qd # write quality output files with appended .qual in specified directory -pd # writes phd output files, that are required for Consed, into directory ../phd_dir. The phd format is explained in paragraph 10 of manual /usr/local/doc/phred/PHRED.DOC Note: Phred allows you to trim off low quality regions of reads. In general, it is not recommended to trim sequences that will be used for a Phrap assembly! You can run Phred with trimming option like this: $ phred -if my_list -trim_alt 0 -trim_fasta -sd ./ -qd ./ -pd ../phd_dir -trim_alt 0 # enables alternate auto trim -trim_fasta # trims .seq and .qual files Please read more about trimming in paragraph 7 of manual /usr/local/doc/phred/PHRED.DOC Some phred distributions contain a simple trace file viewer that can be started with the command "phred -view". The Phred/Phrap/Consed distribution from WU doesn't have a license to this viewer. We will install a similar viewer which is called "vtrace". B. As part of phredPhrap Perl script If the Phred base calls and quality scores are used as input for Phrap and Consed, then one should run Phred with the phredPhrap Perl script. ---------------------------------------------------------------------------- 4. CROSS MATCH (Swati Chawla) ---------------------------------------------------------------------------- Help less /usr/local/doc/phrap/phrap.doc Running cross_match A. As independent application. One example: $ cross_match vector -minmatch 20 -minscore 20 -screen > screen.out In this example the vector sequence in file "vector" will be used for masking the sequences in my_fasta_file by 'Xs'. The masked output will be written to file my_fasta_file.screen B. As part of phredPhrap Perl script See above. Repeat Masker is often used in genome assembly projects to mask repetitive elements and low complexity sequences. This software is distributed by www.geospiza.com. The required repeat libraries can be dowmnloaded from www.girinst.org. ---------------------------------------------------------------------------- 5. PHRAP (Amy Evans) ---------------------------------------------------------------------------- Phrap is one of the leading DNA assembler programs designed to assemble large sets of sequence data. One of its strength is that it utilizes phred quality scores for the assembly. Help less /usr/local/doc/phrap/phrap.doc Note: Phrap is installed on bioinfo.ucr.edu as manyread/longread version to allow analysis of >64,000 reads and sequences longer than 64,000bp. Running phrap A. As independent application. One example: $ phrap -minmatch 20 -new_ace > phrap.out In this example all identfied contigs in provided fasta file are written into my_fasta_file.screen.contigs. To include quality data in the assembly, one has to name the .qual file accordingly: my_fasta_file.screen.qual -minmatch 20 # minimum overlap for join -new_ace # creates ace file Please read more about the phrap options in assembly paragraph (5) of the manual /usr/local/doc/phrap/phrap.doc. B. As part of phredPhrap Perl script See above. Phrap output files: - .contigs - fasta file containing the contig sequences - .contigs.qual - qualities for the contig bases - .singlets - fasta file containing the singlet reads - .log - summary of aspects of the assembly - .problems - problems with assembly - .ace - used for viewing by consed - .view - required for viewing in phrapview Options: -forcelevel 0 -max_subclone_size 5000 -bypasslevel 1 -trim_penalty .2 -maxgap 30 -trim_score 20 -repeat_stringency .95 -trim_quality 13 -revise_greedy * -confirm_length 8 -shatter_greedy * -confirm_trim 1 -preassemble * -confirm_penalty .5 -force_high -confirm_score 30 -retain_duplicates * -indexwordsize 10 ---------------------------------------------------------------------------- 6. CONSED (Peter Morrell & Karen Lundy ---------------------------------------------------------------------------- Help click in Consed window on help menu Running Consed Type from the directory edit_dir: $ consed # or consed_linux Consed is an automated sequence assembler and viewer (NOT a sequence editor). As the authors of the the program describe it, Consed is designed to assemble sequence data, assess its quality, and design new primers and sequencing experiments to finish high quality sequence assemblies. Consed is used to graphically display sequence data. Many researchers are familiar with software that displays individual chromotagrams or assemblies of chromotograms (e.g., Sequencher). Consed is designed around the idea of reproducible, machine-scored data. Phred quality values guide the assembly and even determine the relative degree to which individual sequences contribute to your final consensus sequence. In a well designed experiment, no user editing or intervention is required. Consed is the only graphical program in the Phred/Phrap/Consed package (Polyphred also has a graphical interface, but Consed can be used to graphically display information from Polyphred as well). However, use of the program starts by typing "consed" or "consed_linux" in directory, ~/Sequence071603/edit_dir, where Sequence071603 is the name of your assembly directory. Consed NEVER modifies your original chromatograms. Don't cut or change or modify the original results of the experiment. You would NEVER take a pair of scissor to a photo of an agarose gel and cut out the bands you didn't want in photo, why would change the base calls you don't like? The chromotagrams are photographs of the electrophoresis of your termination products. Since Consed doesn't alter the results, you are free to experiment with different ways to assemble your data. There is nothing sacred about the assembly files, and you can delete them and rerun phredPhrap whenever you think it is necessary. 1. The first window that appears is Ace Files, used to select the file that contains all the information on the assembly. Select the most recent ace file, ie., the one with the highest number. (The various ace files contain a history of your assemblies, and you can revert to an older assembly if necessary). 2. The next window that appears is the Consed Main Window. It shows the Contig List, and the Read List with all useable reads and the contig they have joined. You select the contig you want to view. 3. The Aligned Reads window displays reads and read quality. The lighter the color, the better the quality of the sequence. At this point we can scroll through the assembly using either the slider or the arrow buttons or we can use the Navigate menus to jump to particular features, like a high quality discrepancy. If you find a place where you would like to see the chromatograms, use the 3rd mouse button to show a single read or the right mouse button to show all reads at a given location. 4. If there are no problems with the assembly, check the orientation, there is a reverse complement button on this window "Comp Cont", and then we are ready to export. There are two options for export, either export the entire contig, or just a portion of it. There are MANY user-controlled options in Consed. Options and the Consed Main Window: The most important items are all in the Option menu. General Preferences are listed separately from Primer Picking Preferences. There are only a couple of options that you might need, most notably "Show ABI Bases in the Trace Window" displays ABI base calls so they can be compared to those from Phred. The options in General Preferences and Primer Picking Preferences are for a single session. To write permanent preferences, use Edit Consed/Autofinish Parameters. (e.g., if you never want to design a primer shorter than 18 bp). These preferences are written to a file (consed.rsc) that is run whenever you execute Consed. "Fake" traces: At times you would like to be able to use raw sequence data, where no chromatogram is available. There is a small utility with Consed called "mktrace" or Make Trace, that will turn fasta files into fake chromomatogram and phred quality value files. If you need the fake trace data to guide an assembly, use the .phd version of the file, the fake data won't contribute to the consensus sequence. Tags: You can turn apply custom tags or turn off tags already applied to your sequence. Contig Sorting Running phredPhrap with no specified forcelevel produces many highly stringent contigs. The slightest haplotype divergence patterns, or polymorphisms, produce separate contigs. If you want more samples to be included in one contig for group analysis or primer design, there are several methods for joining multiple contigs into one. 1) forcelevels When you run phredPhrap, add the command "-forcelevel <#>" where # is the level of stringency, between 0 and 10. 1 is relatively high stringency, and consed will produce more contigs based on fewer segregating sites. as the forcelevel goes up, contig rules become less stringent and more reads are allowed to fit into one contig, with polyphred tagging all the mismatches and polymorphisms. a forcelevel of 9 produces few contigs, with lots of colorful polyphred markings all in one contig. 2) Compare-Align-Join contigs If you dont want to experiment with rerunning phredphrap at different forcelevels, you can also pick and choose the specific contigs you want to force together. This method can give you more control of exactly what samples go into the primer design, because maybe the forcelevel command will put more samples into a contig then you really wanted to have for input on the primer design. a) Open an aligned-reads window, scroll to a section of consensus sequence that most likely overlaps with other contigs you want to join. b) Hit Search for String button. Type in a portion of the consensus sequence, about 15 bases or so, and hit OK. c) In the Searching Contigs window, double click on a contig you want to join with another. in the aligned reads window for that contig, hit the compare cont button. d) minimize those windows, go back to the Searching Contigs window, double click on another contig. hit the compare cont button, go to compare contigs window. e) hit the align button in the middle of the Compare Contigs window. f) hit join contigs button. the program will get rid of the 2 old contigs and create one new one with all the joined sequences. Primer Design After creating contigs full of the samples you want and none of the samples you dont, you may find you want to design a primer for further sequencing experiments. Consed can do this for you. right click on the consensus sequence where you want new high-quality sequence to start, and choose the option for top-strand (forward) or bottom strand (reverse) primer design. The program will pick a plethora of possible primers, usually starting about 50 base pairs back from where you pick the sequence to begin, as it is compensating for the fact that high quality sequence tends to start down the strand from the primer annealing site. The list of primers the porgram creates includes all the information for length, melting temp, and other factors. If no primers are possible at the site you pick, you can change the site, or modify the preferences consed uses for designing primers. Consed follows guidelines for good primer design, such as only creating primers with a certain GC content and not allowing primers with too high or low a melting temp. You can change these preferences however if no primers within these parameters are possible: In the Consed Main Window, go to the Options button at the top. Select Primer Picking Preferences from the menu. ---------------------------------------------------------------------------- 7. AUTOFINISH (Audrey mv Ah Fo) ---------------------------------------------------------------------------- Phrap looks for similar regions of sequence and aligns them. A group of aligned, overlapping reads is called a contig. While a project is in the process of being sequenced, its reads will fall into many contigs because there will be regions that remain unsequenced (gaps). After finishing is complete, there should be only one remaining large contig containing the entire project sequence. The finishing goals: 1. Join contigs into a single, large contig that represents the sequence of the original clone (BAC, cosmid, etc.) 2. Address regions of low sequence quality that may contain sequence errors. (The NIH has set a sequence quality goal of 1 error/10kB. If the value in the window is 1 or less, the entire sequence of the contig is expected to be of sufficiently high quality). 3. Add coverage to single-clone regions (areas spanned by only one plasmid subclone). This is important because sometimes mutations can occur in a subclone; therefore we want to have at least 2 different subclones covering every base pair of a finished project. 4. Examine high quality discrepancies. These are regions where two high-quality sequencing reads have differences that can not explained by sequencing errors. There are many reasons why this can happen: one subclone could have mutated during the shotgun sequencing process; a read could be misassembled (put in the wrong place), especially if it falls inside a repeat region. 5. Confirm that the sequence is assembled correctly by comparing the sequence- derived restriction map to a fingerprint that is obtained by digesting the original clone with restriction enzymes and running these fragments on a gel. Autofinishing involves only steps 1 through 3. The Autofinish computer program does this by automatically choosing finishing reads. Autofinish will suggest universal primer reads (forward or reverse), custom primer reads with subclone templates, custom primer reads with whole clone templates, minilibraries from subclone templates). The user can configure Autofinish to perform tasks 1 through 3 or any combination of these tasks. However, Autofinish cannot solve the most complex problems. It is suggested that Autofinish be allowed to suggest reads for the first three rounds of finishing, and if the project still is not finished completely, a human finisher complete the work. Autofinish can be used to finish either genomic or cDNA. HOW TO USE AUTOFINISH To use Autofinish, begin by running the phredPhrap script that, among other things, runs Phred and then Phrap. Autofinish requires precisely the same input files that Consed needs: the ace file (output from Phrap) and the PHD files (output from Phred). To start Autofinish, from the command line type: $ consed -ace -autofinish Autofinish will create the following files: Autofinish.fof (which contains the names of files) e.g.(project name).001014.55627.customPrimers (project name).001014.55627.nav (project name).001014.55627.out (project name).001014.55627.sorted (project name).001014.55627.univForwards (project name).001014.55627.univReverses The "001014.155627" is the date and time in format YYMMDD.HHMISS The .out file is the Autofinish output file. This is the most important file to examine when evaluating Autofinish. The above files can be exported into Excel since the fields are separated by commas. Since you will be creating many files as you work through Autofinish exercises, the UNIX command $ ls -tlr is useful which is the same as ls, but it puts one file on a line and prints the lines so that the most recent files are on the bottom. This gives an easy way to see the files you have just created without having to always look at autofinish.fof to look for the names of the files you just created. CUSTOMIZING AUTOFINISH The user can customize Autofinish by creating a file named .consedrc containing some parameters and their customized values. A complete list of the customizable parameters is in the beginning of the Autofinish “.out” file. The list is quite long and the user must understand a parameter in order to correctly modify it, as this is the price of Autofinish's flexibility. The .consedrc file should be in the edit_dir of a particular project. When you are working on that project, whatever is in that .consedrc will override whatever is in your ~/.consedrc file or the CONSED_PARAMETERS file. e.g. if you want to use Autofinish just to close gaps, add the following to the .consedrc file: consed.autoFinishCoverLowConsensusQualityRegions: false consed.autoFinishCoverSingleSubcloneRegions: false and run Autofinish again Defaults, however are an excellent starting point. NOT REPEATING FAILED EXPERIMENTS For Autofinish to have a record in the ace file of reads for suggested experiments you should run Autofinish with the -doExperiments paramater i.e. type: $ consed -ace(ace filename) -autofinish -doExperiments If a custom primer read fails, Autofinish will not pick that same experiment again, and it won't pick a custom primer read that is even close to the failed one. Close is defined by the resource: consed.autoFinishNewCustomPrimerReadThisFarFromOldCustomPrimerRead: 50 The default 50 can be changed if you like -doExperiments will cause oligos to be tagged. You can turn this off by setting: consed.autoFinishTagOligosWhenDoExperiments: false DO NOT FINISH PARTICULAR REGIONS If a region has already been finished by an overlapping clone or you do not care to finish, then go to the consensus sequence, highlight the region you do not want finish, click on the middle mouse button on the doNotFinish tag and save the assembly as e.g. autofinish.fasta.screen.ace.2 Then run autofinish again: $ consed -ace autofinish.fasta.screen.ace.2 -autofinish NOT USING PARTICULAR SUBCLONE TEMPLATES If you no longer want to use a template, you can put it in a file badTemplates.txt in edit_dir. This is file with one name per line. TOO MANY UNIVERSAL PRIMER READS The consed.autoFinishRedundancy is 2.0 meaning that Autofinish will try to fix every problem area twice. If you want to change that, then put in your .consedrc file: Consed.autoFinishRedundancy: 1.0 AUTOFINISH CLOSING GAPS WITH MINILIBRARIES Use the following parameters: consed.autoFinishAllowMinilibraries true consed.autoFinishPrintMinilibrariesSummaryFile: true The following parameter can be set to true or false, depending on your preference: consed.autoFinishAlwaysCloseGapsUsingMinilibraries: false If the parameter above is set to false, then Autofinish will only choose minilibraries if the gap is the size below or larger: consed.autoFinishSuggestMinilibraryIfGapThisManyBasesOrLarger: 800 Autofinish can suggest more than one minilibrary per gap: Consed.autoFinishSuggestThisManyLibrariesPerGap: 2 CHECKS IF EITHER END OF THE CONTIG IS THE CLONE END Autofinish checks each contig end to see if it contains the clone end. The user can specify that AutoFinish do this in either of two ways: one applying when the clone vector is included in the shotgun, and the other when there are to reads that are known to lie at the ends of the clone. The latter case has been used for finishing cDNA clones. (see below) LIBRARIES WITH DIFFERNT INSERT SIZES If some of the subclone templates have different sizes, Autofinish must know which of the subclone templates have large inserts and which ones have small inserts. Modify determineReadTypes.perl so that it puts the library name into the phd file like this: WR { Template phredPhrap 990224:045110 Name: ab08a29 Lib: ab08 } where ab08a29 is the name of the subclone template and ab08 is the name of the library it came from. Then you must construct a file in the same directory as the ace file called "librariesInfo.txt" that lists the insert sizes of the different libraries like this: LIB { name : ab08 avgInsertSize: 1500 maxInsertSize: 3000 stranded: double Cost: 600.0 } LIB { name: ab10 avgInsertSize: 3000 maxInsertSize: 5000 stranded: single cost 1200.0 } 'name' is the name of the library. This is the name that goes into the PHD files after the "lib:" keyword. "avgInsertSize" is the average insert size of the library the figure to be used by Autofinish if there are not enough forward/reverse pairs. "maxInsertSize" is the maximum insert size. If forward/reverse pairs are further apart than this, Autofinish will assume these reads are misassembled. "stranded" is whether this template is single or double stranded. "cost" is the cost of making a minilibrary out of a template from this library. To find out information about your libraries, go to Consed's Main Window, point to "Info", hold down the left mouse button, and release on "Show Library Info". You should see the names of your libraries and the correct number of reads in each library. ---------------------------------------------------------------------------- 8. POLYPHRED ---------------------------------------------------------------------------- Help polyphred -help ---------------------------------------------------------------------------- Webpage update: scp Consed tgirke@cache.ucr.edu:~/public_html/Documents/Consed/Consed