📄 diana.txt
字号:
*** GENOME EXPLORER HELP FILE ***To provide help on using the GUI, and information about how the programs runContents1) Outline of Function2) Parameters loaded from the .inf file (settings menu)3) The User Interface4) Underlying Method--- DIANA ------ 1) Outline of Function ---Defined Interval Amino-acid Numerating Algorithm.Allows user to search for a minimum number of the search element within a particular sequence window (length of amino acids / nucleotides within a sequence).Produces an output file for each input file, of all sequences which meet the entered criteria. The output file will be in fasta format.The search element can be: a) a number of different nucleotides / amino acids to be searched for individually but counted cumulatively (e.g. search for Q and N: in 10 residue window yNabggQNyN: cumulative score: 4) b) a number of different nucleotides / amino acids to be searched for individually but counted independently (e.g. search for Q and N: in 10 residue window yNabggQNyN: Q score:1, N score: 3, highest overall score: 3) c) a number of strings searched for individually and counted cumulatively e.g. search for GGY and YQQ: in 10 residue window aaaGGYQQgg: GGY score: 1, YQQ score: 1, cumulative score 2 however search for GG and QQ in 10 residue window aaGGGygQQg GG score: 1, QQ score: 1, cumulative score 2. because when GG is found, the search continues from where that 'hit' ends... d) a number of string searched for individually and counted independently As for c) but only the highest score will count --- 2) Parameters loaded from the .inf file (settings menu) ---fastaLineLength number of characters per line of fasta output (80)dianaInDir directory in which to open the infile file chooserdianaOutDir default directory to which all output files will be writtendianaOutfileExt file extension for all outfiles (.di_out.fasta)--- 3) The User Interface ---fasta files to search Use the add / remove buttons to add / remove files to / from the list box. All files displayed in the list box will be searched (independently) when the program is run.select output directory select the directory to which all output files will be writtenuse infile name as outfile basename tick this check box to use the name of the file being searched as the basename for the outfile of sequences that meet criteria. The file extension dianaOutfileExt (as set in the settings panel) will be added.select a name for output fasta file enter the basename for output files - this will be incremented for each infile and the file extension dianaOutfileExt will be added. e.g. outfile_1.diOut, outfile_2.diOut etcmatch min quantity match a minimum number of residues in a window of a specific size. match min percentage match a minimum percentage of residues in a range of window sizes. min number of matches in window number of matches per window - if cumulative search, total number of all matches must equal or exceed this number if sequence is to be a 'hit'. If NOT cumulative one individual search must equal or exceed this score for sequence to be a hit. e.g. search for Q and N in 10 residue window, min number matches = 5 yNQbggQNyN Q score: 2, N score: 3 cumulative score: 5 therefore hit OR individual max score: 3, therefore not a hit min window size (residues) the minimum number of residues in the search window (only applicable when a percentage search is being performed)max window size (residues) the maximum number of residues in the search window (for a percentage search) or the fixed window size (for a minimum quantity search).report progress every x sequences reads the fasta files to be searched in blocks of this many sequences to prevent memory problems. Reports progress to the screen every time a batch finishes.count all residue string hits cumulatively tick this box to count hits cumulatively, as described in the 'outline of function' section of this documentstrings to search for (; separated) enter the strings to search for if you do not want to search for individual nucleotides or amino acids. Upper/lower case is irrelevant as all data is converted to lower case before comparisons are madedna sequence / protein sequence / enter text radio buttons enable the different components for entering searchesnucleotide / amino acid check boxes tick the boxes of ALL residues you want to search for--- 4) Underlying Method --- Code for DIANA (Defined Interval Amino acid Numerating Algorithm). This class has additional functions that allow it to search for single elements, or strings of letters. *** FUNCTION read in a fasta file of protein sequences set a 'window size' in which to scan for a set of amino acids set the amino acids to scan for set a cut off number of key amino acids that must be found in the window to be interesting scan each possible window of each sequence for the key amino acids fasta file is read in 'batches' of sequences - to avoid hitting the memory ceiling *** ASSUMPTIONS 1) that the input file is in FASTA format where '>' character indicates the start of a protein name 2) that ALL characters in a sequence (except blank spaces) represent residues and should therefore be counted in length-of-protein counts, even if they do not represent a single specific residue *** ALGORITHM overall form of the diana part of this program 1) get array of names and corresponding array of protein sequences 2) determine average key aa per window (string of windowLength consecutive amino acids in a protein) 3) look at all possible windows and find those rich in key aas 4) output list of all sequences with key aa rich regions, and max key-aa-per-window value ***
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -