Makolo AU and Lamidi UA
Motifs are repeated patterns of short sequences usually of varying lengths between 6 to 20 bases. Within Deoxyribonucleic Acid (DNA) sequences, these motifs constitute the conserved region of most common signatures for recognizing protein domains that are relevant in it evolution, function and interaction. The Gibbs sampling is a Markov Chain Monte Carlo (MCMC) algorithm which has been applied in the past to discover motifs in DNA sequences. A problem with this technique is the profusion of iterative operations in the sampling process because it progressively chooses new possible motif positions from a continuous randomize sampling in DNA sequences. We applied an Improved Gibbs (iGibbs) sampling algorithm on Breast Cancer (brca) human disease DNA sequences obtained from https://www.ncbi.nlm.nih.gov/nuccore to overcome this unwieldy iteration by altering the processes to obtain a reduced runtime and also achieve an accurate satisfactory motif result. The methodology applied in iGibbs algorithm takes an input of fasta or gbk DNA file and creates a list of all nucleotides to predict a random sampling starting position. It applies motif length, lesser iterative value and further computes the probability and position ranking scores using Position Weight Matrix (PWM). The algorithm was implemented using Python, Python(x,y) and Biopython. The iGibbs algorithm was evaluated using varying motif lengths of 12, 18 and 24 on different base lengths of 5,000, 10,000 and 15,000 with different iteration levels. The result showed that the iGibbs returned a better average runtime of 7, 10 and 23 seconds respectively compared to 12, 32 and 60 seconds respectively in the existing Gibbs sampling algorithm found at http://ccmbweb.ccv.brown.edu/gibbs/gibbs.html. The accuracy of the motif result was checked using the hamming distance for finding the contiguous string and minimum edit distance into consensus sequences.
इस लेख का हिस्सा