Study of Long Range DNA Correlations for Genes Affecting milk yield of dairy cow

Document Type : Research Paper

Authors

1 Animal Science Department.Faculty of Agricultural Sciences and Food Industry.Science and Research Branch,Islamic Azad University,Tehran.Iran

2 Animal Science Department.Faculty of Agricultural ,Yasuj university,Yasuj,Iran

Abstract

Background and Objective:
For mathematically-oriented investigators, DNA is a string. Therefore, they consider a DNA sequence as a string of symbols whose correlation structure can be characterized almost completely by all possible base-base correlation functions at any range, short or long or their corresponding power spectra. Long-range correlations between bases in the DNA sequence are a statistical feature found in the genome of many eukaryotes. The existence of long-range DNA correlations indicates the existence of DNA rearrangement or duplication processes. These types of phenomena are not directly applicable to breeding and are mostly used in evolutionary studies. Our basic assumption in this study was that by extracting long-range DNA correlations between all the different nucleotides within a gene, it is possible to achieve a degree of correlation between them in the first place and possibly better run SNP-based researches. Due to many furious issues, not all investigations of a complete characterization of long-scale correlation structure of DNA sequences were motivated by biology arena. Rather, many such investigations were motivated by the issues of mathematical modeling, cryptography language code detections, dynamical systems, stochastic processes, and noise detections. Perhaps due to this reason, long-scale correlation structure has not yet become part of the toolbox in the “mainstream” DNA sequence analysis in human genetics and breeding settings. Prediction of DNA correlations from a sequence with finite length could be done with, frequency-count estimator, indirect Bayesian estimator, direct Bayesian estimator. Here we followed the ideas by CorGen theory.
Materials and methods: 24 genes selected out of genes affecting milk yield of dairy cow. The number, length and length of each exon and its position on the chromosome were obtained from the NCBI gene bank and the sequences were saved in FASTA format. Using software previously designed in #C language, according to the research request, the accession numbers of the studied genes was entered and the appropriate output was obtained. CorGen software was used to calculate the long-range DNA correlations of the genes involved in milk production.
Results: The results showed that there is a significant level of long-term correlation in DNA sequence of a number of genes such as EZR, FGG, KRT6A, RAB1A, EIF3L, TBC1D20, ZNF419, S100A16, MRPL3, TPPP3, PHF10. The reduction power of the fitting function of the power function was based on the long-range correlations obtained from genes of different lengths, in the range of 0.146 and 0.643, so it can be concluded that reducing the range of long-range correlations by increasing the interval between DNA sequence intervals does not follow a random process. And so, the fractal geometry of nature is also seen in these genes.
This research was an attempt for the first to address long-DNA correlation in dairy cattle genes. There are at least two goals for this job. First, there has been discordant on the result of correlation structure in DNA sequences. Due to this matter of what the actual result is, some researches still believe that DNA sequences do not exhibit any feature long-range DNA correlation which cannot be explained by the basic known stochastic processes such as random sequence or Markov chain - with the first one having no correlation inherently in its theory and the second one considers only short-range correlations. Resolving this disagreement can be straightforward once everybody agrees to use the same measure of correlation, use the same estimator, and apply this estimator of the correlation to the same sequence. The second is to highlight more biologically-motivated study of correlation structure of long range DNA sequences especially in animal breeding. Although this research does not accomplish this task, the intention was to at least put forward the issue. Most of the current studies of correlation (especially the long range one) in DNA sequences are based-base base statistical correlations. This base-base correlation won't not be a powerful tool to reveal the correlation on a global scale or between larger blocks in DNA-sequences.

Conclusion: The genes studied have been shown to have high complexity and mode of invariant on their DNA. This type of analysis can be generalized to the work of breeding setting. A more complete characterization of long-range correlation between base pairs at both short and long distances became possible only as long DNA sequences became more commonly available. Now thanks to stupendous growth of DNA generating technologies, almost the entire whole genome of an organism can be sequences in low cost price with high speed time. Therefore, a raw data shall be available for many researchers who are looking for to check new DNA correlation hypotheses in handy DNA sequences. The claim of DNA base-base statistical correlation at long distances in DNA sequences is sought to be still a few steps away from finding a Naive organization principle of the genome.
Conclusion: The genes studied have been shown to have high complexity and mode of invariant on their DNA. This type of analysis can be generalized to the work of breeding setting. A more complete characterization of long-range correlation between base pairs at both short and long distances became possible only as long DNA sequences became more commonly available. Now thanks to stupendous growth of DNA generating technologies, almost the entire whole genome of an organism can be sequences in low cost price with high speed time. Therefore, a raw data shall be available for many researchers who are looking for to check new DNA correlation hypotheses in handy DNA sequences. The claim of DNA base-base statistical correlation at long distances in DNA sequences is sought to be still a few steps away from finding a Naive organization principle of the genome.

Keywords


Arneodo A, d'Aubenton-Carafa Y, Audit B, Bacry E, Muzy J and Thermes C, 1998. Nucleotide composition effects on the long-range correlations in human genes. The European Physical Journal B-Condensed Matter and Complex Systems 1: 259-263.
Audit B, Thermes C, Vaillant C, Aubenton-Carafa Y, Muzy JF and Arneodo A, 2001. Long-range correlations in genomic DNA: a signature of the nucleosomal structure. Physical Review Letters. 86:2471-2479.
Bernaola-Galván P, Carpena P, Román-Roldán R and Oliver J, 2002. Study of statistical correlations in DNA sequences. Gene 300: 105-115.
Blaisdell BE, 1986. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences 83: 5155-5159.
Chatzidimitriou‐Dreismann CA, Streffer R and Larhammar D, 1994. A quantitative test of long‐range correlations and compositional fluctuations in DNA sequences. The FEBS Journal 224: 365-371.
Comin M and Verzotto D, 2012. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms for Molecular Biology 7: 34-41.
Edea Z, Hong JK, Jung JH, Kim DW, Kim YM, Kim ES, Shin SS, Jung YC, Kim KS, 2017. Detecting selection signatures between Duroc and Duroc synthetic pig populations using high-density SNP chip. Animal  Genetics. 48(4):473-477.
Edgar RC and Batzoglou S, 2006. Multiple sequence alignment. current opinion in structural biology. Current Opinion in Structural Biology 16: 368-373.
Edwards SV, Fertil B, Giron A and Deschavanne PJ, 2002. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Systematic Biology 51: 599-613.
Fletcher W, Yang Z, 2009. INDELible: a flexible simulator of biological sequence evolution. Molecular Biology and Evolution 26(8): 1879-88.
Guo AM, 2007. Long-range correlation and charge transfer efficiency in substitutional sequences of DNA molecules. Physical Review E 75:061915.
Jun SR, Sims GE, Wu GA and Kim SH, 2010. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences 107: 133-138.
Katoh K, Misawa K, Kuma Ki and Miyata T, 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30: 3059-3066.
Kemena C and Notredame C, 2009. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25: 2455-2465.
Larkin MA, Blackshields G, Brown N, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A,Thompson RL, Gibson TJ and Higgins DG, 2007. Clustal W and clustal X version 2.0. Bioinformatics 23: 2947-2948.
Lemay DG, Lynn DJ, Martin WF, Neville MC and Casey TM, 2009. The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biology 10: R43.
Li W, 1997. The study of correlation structures of DNA sequences: a critical review. Computers & Chemistry 21: 257-271.
Li ,W., Marr TG and Kaneko K, 1994. Understanding long-range correlations in DNA sequences. Physica D: Nonlinear Phenomena 75: 392-416.
Messer PW, Bundschuh R, Vingron M and Arndt PF, 2007. Effects of long-range correlations in DNA on sequence alignment score statistics. Journal of Computational Biology 14: 655-668.
Mohanty A, and Rao A, 2002. Long range correlations in DNA sequences. arXiv preprint physics/0202075.
Nagar AK and Sokhi D, 2008. Phylogenetic comparison of genes using long range correlation patterns in DNA sequences. Proc. Computer Modeling and Simulation 2: 197-202:
Peng CK, Buldyrev S, Goldberger A, Havlin S and Mantegna R, 1995. Statistical properties of DNA sequences. Physica A: Statistical Mechanics and its Applications 221: 180-192.
Peng CK Buldyrev SV, Goldberger AL, Havlin S and Sciortino F, 1992. Long-range correlations in nucleotide sequences. Nature 356: 168-170.
Sims GE, Jun SR, Wu GA and Kim SH, 2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences 106: 2677-2682.
Stuart GW, Moffett K and Baker S, 2002. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics 18: 100-108.
Stuart GW, Moffett K and Leader JJ, 2002. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Molecular Biology and Evolution 19: 554-562.
Sutthibutpong T, Matek C, Benham C, Slade GG and Noy A, 2016. Long-range correlations in the mechanics of small DNA circles under topological stress revealed by multi-scale simulation. Nucleic Acids Research 44: 9121-9130.
Vinga S, 2013. Information theory applications for biological sequence analysis. Briefings in Bioinformatics 15: 376-389.
Voss RF, 1992. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Physical Review Letters 68(25): 3805-3808.
Warnow T, 2013. Large-scale multiple sequence alignment and phylogeny estimation in models and algorithms for genome evolution. pp. 85-146, Springer Londin.
Yu ZG, Anh V and Lau KS, 2003. Multifractal and correlation analyses of protein sequences from complete genomes. Physical Review E 68: 021913.
Yu ZG, Anh V and Zhou LQ, 2005. Fractal and dynamical language methods to construct phylogenetic tree based on protein sequences from complete genomes. In International Conference on Natural Computation (pp. 337-347). Springer, Berlin, Heidelberg.
Yu ZG, Zhou LQ, Anh VV, Chu KH, Long SC and Deng JQ, 2005. Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment. Journal of Molecular Evolution 60: 538-545.
Zhao P, Yu Y, Feng W, Du H, Yu J, Kang H, Zheng X, Wang Z, Liu G, Ernst CW, Ran X, Wang Jand Liu J, 2018. Evidence of evolutionary history and selective sweeps in the genome of Meishan pig reveals its genetic and phenotypic characterization. Gigascience 7(5): 1-12