CN113764041B - Searching method and device for species gene identification tag and electronic equipment - Google Patents

Searching method and device for species gene identification tag and electronic equipment Download PDF

Info

Publication number
CN113764041B
CN113764041B CN202110901123.2A CN202110901123A CN113764041B CN 113764041 B CN113764041 B CN 113764041B CN 202110901123 A CN202110901123 A CN 202110901123A CN 113764041 B CN113764041 B CN 113764041B
Authority
CN
China
Prior art keywords
species
sequence
nucleic acid
tag
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110901123.2A
Other languages
Chinese (zh)
Other versions
CN113764041A (en
Inventor
李杨坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuanfang Gene Technology Beijing Co ltd
Original Assignee
Yuanfang Gene Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuanfang Gene Technology Beijing Co ltd filed Critical Yuanfang Gene Technology Beijing Co ltd
Priority to CN202110901123.2A priority Critical patent/CN113764041B/en
Publication of CN113764041A publication Critical patent/CN113764041A/en
Application granted granted Critical
Publication of CN113764041B publication Critical patent/CN113764041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a species gene identification tag searching method, a species gene identification tag searching device and electronic equipment, wherein the method comprises the following steps: s1: searching and comparing target species in a sequence name total nucleic acid library to obtain species information of the target species; s2: comparing according to the classification status and names of the nucleic acid sequences and species to obtain a first comparison result; s3: classifying the first comparison result to form fragment pools corresponding to the similarity of each range sequence, and performing Blastn comparison on the fragment pools corresponding to the similarity of the range sequence and the sequence name total nucleic acid library to obtain a second comparison result; s4: and automatically controlling the quality of the second comparison result, obtaining a species characteristic tag, and comparing the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species. The invention combines the identification tag with the identification threshold value, thereby effectively increasing the range and accuracy of the intraspecies and interspecies identification of species identification; and the analysis and calculation are automatically carried out, so that the labor is saved.

Description

Searching method and device for species gene identification tag and electronic equipment
Technical Field
The invention relates to the field of biotechnology, in particular to a method and a device for searching a species gene identification tag and electronic equipment.
Background
Identification and classification of species based on DNA or RNA technology is mostly dependent on conserved sequences, which are commonly found in 16S (bacteria), 18S (higher eukaryotes), ITS (fungi) and key genes (viruses). Wherein, a conserved sequence refers to a nucleotide fragment in a DNA molecule that remains substantially unchanged during evolution.
The most common methods for species identification based on conserved sequences in the related art are: (1) Adopting ClustalW, bioeidt, T-Coffee, MAFFT, blastn and other comparison software to compare more than two known gene sequences; (2) Manually selecting a target sequence fragment according to sequence similarity, and then designing a primer with the whole genome sequence or a local region of the species; (3) obtaining sequence fragments by using PCR amplification or sequencing technology; (4) splicing or assembling the obtained sequences; (5) Alignment with known sequences of database species, thereby completing identification of species.
However, this approach can only identify the family or genus levels for many species, and cannot identify intra-and inter-species levels; and, need to be equipped with different professionals between each step, rely on operating personnel's technique and experience.
Disclosure of Invention
The invention mainly aims to provide a searching method and a searching device for a species gene identification tag, which are used for solving the problems that a biological species cannot identify the intra-species and inter-species levels and is dependent on manual operation.
In order to achieve the above object, a first aspect of the present invention provides a method for searching a species gene identification tag, comprising:
S1: receiving an input target species, and searching and comparing the target species in a sequence name total nucleic acid library to obtain species information of the target species, wherein the species information comprises a nucleic acid sequence, a species classification status and a name;
S2: comparing according to the classification status and names of the nucleic acid sequences and species to obtain a first comparison result;
S3: classifying the first comparison result according to the range of the sequence similarity to form fragment pools corresponding to the sequence similarity of each range, and comparing the fragment pools corresponding to the sequence similarity of each range with the sequence name total nucleic acid library to obtain a second comparison result;
s4: and automatically controlling the quality of the second comparison result based on the configured reject sequence control index to obtain a species characteristic tag, and comparing the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species.
Optionally, before the receiving the input target species, the method further comprises:
obtaining species whole gene DNA and RNA sequences, the species whole gene DNA and RNA sequences comprising an NT database;
reverse transcription of the RNA sequence is converted into a DNA sequence, and a sequence name total nucleic acid library is constructed.
Optionally, the comparing according to the classification status and the names of the nucleic acid sequences and the species to obtain a first comparison result includes:
extracting the sequence and the name of the genus of the target species from the sequence name total nucleic acid library according to the species classification status and the name to form a sequence pool of the genus of the target species;
carrying out K-mer segmentation on the nucleic acid sequence to form fragment pools corresponding to the K values, and extracting a name fragment mark library from the fragment pools corresponding to the K values;
Based on a configured first comprehensive threshold, comparing a fragment pool corresponding to the K value with a sequence pool of the genus of the target species to obtain a first comparison result;
wherein the first integrated threshold comprises: the alignment length is 100%, the sequence similarity is not higher than 95%, and the alignment score is not lower than 1e-5.
Further, the K-mer cleavage of the nucleic acid sequence comprises:
If the nucleic acid sequence is a whole genome sequence, carrying out K-mer segmentation on the whole genome sequence, wherein the K is 30 in initial value, 1 in step length and 300 in ending value;
If the nucleic acid sequence is a local sequence and the length of the local sequence is greater than or equal to 300, carrying out K-mer segmentation on the whole genome sequence, wherein the K initial value is 30, the step length is 1, and the end value is 300;
If the nucleic acid sequence is a local sequence and the length of the local sequence is less than 300, carrying out K-mer segmentation on the whole genome sequence, wherein the K initial value is 30, the step length is 1, and the end value is the length of the local sequence.
Optionally, the comparing the fragment pool corresponding to the similarity of the range sequence with the sequence name total nucleic acid library to obtain a second comparison result, which includes:
Based on a configured second comprehensive threshold, comparing the fragment pool corresponding to the range sequence similarity with the sequence name total nucleic acid library, and removing sequences with sequence similarity higher than a preset value on other species to obtain a second comparison result;
wherein the second integrated threshold comprises: the alignment length is 100%, the sequence similarity is higher than 95%, and the alignment score is not lower than 1e-5.
Optionally, the automatically controlling the quality of the second comparison result based on the configured reject sequence control index, obtaining a species characteristic tag, and comparing the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species, including:
and automatically controlling the quality of the second comparison result based on the configured reject sequence control index, wherein the reject sequence control index for automatic quality control comprises: GC content is higher than 70%, and the bases are arranged continuously;
Acquiring a species characteristic label, a distinguishing threshold value and a K-mer value;
Based on the configured rejection index, comparing the species characteristic tag with a name fragment tag library, and rejecting the position where the similarity of the repeated fragments and sequences is higher than a preset value to obtain a first identification tag and a first identification threshold of the target species;
The reject index takes the comparison length as 100%, the sequence similarity is not higher than 95%, and the comparison score is not lower than 1e-5 as a third comprehensive threshold.
Optionally, the method further comprises:
and (3) taking the first identification tag in the step (S4) as a next nucleic acid sequence, and repeating the steps (S2) to (S4) to obtain a second identification tag and a second identification threshold of the target species.
In a second aspect, the present invention provides a device for searching for a species gene identification tag, comprising:
The searching and comparing unit is used for receiving an input target species, searching and comparing the target species in the sequence name total nucleic acid library, and acquiring species information of the target species, wherein the species information comprises a nucleic acid sequence, a species classification status and a name;
The comparison unit is used for comparing according to the classification status and names of the nucleic acid sequences and the species to obtain a first comparison result;
The Blastn comparison unit is used for classifying the first comparison result according to the range of the sequence similarity to form fragment pools corresponding to the sequence similarity of each range, and carrying out Blastn comparison on the fragment pools corresponding to the sequence similarity of each range and the sequence name total nucleic acid library to obtain a second comparison result;
The searching unit is used for automatically controlling the quality of the second comparison result based on the configured eliminating sequence control index, obtaining a species characteristic tag, and comparing the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species.
A third aspect of the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to perform the method of searching for a species gene identification tag provided in any one of the first aspects.
A fourth aspect of the present invention provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method of searching for a species gene identification tag provided in any one of the first aspects.
In the searching method of the species gene identification tag provided by the invention, the fragment pool corresponding to the similarity of the range sequence is subjected to Blastn comparison with the sequence name total nucleic acid library, and the species characteristic tag is subjected to Blastn comparison with the name fragment tag library, so that the first identification tag and the first identification threshold of the target species are finally obtained. According to the invention, through three comparison, the identification tag is combined with the identification threshold value, so that the traditional method for identifying species by using the conserved sequence is broken through, and the range and accuracy of intra-species and inter-species identification of species identification are effectively increased; in addition, the invention can automatically analyze and calculate after inputting the target species to be searched, saves a great number of links of manual participation, and solves the technical problems that the existing biological species can not identify the intra-species and inter-species levels and rely on manual operation.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for searching a species gene identification tag according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for searching a species gene identification tag according to an embodiment of the present invention;
FIG. 3 is a block diagram of a device for searching a species gene identification tag according to an embodiment of the present invention;
Fig. 4 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
Identification and classification of species based on DNA or RNA technology is mostly dependent on conserved sequences, which are commonly found in 16S (bacteria), 18S (higher eukaryotes), ITS (fungi) and key genes (viruses), wherein conserved sequences refer to nucleotide fragments in DNA molecules that remain essentially unchanged during evolution. In the related art, species identification methods based on conserved sequences can only identify the levels of families or genera for many species, but cannot identify the intra-species and inter-species levels; and, need to be equipped with different professionals between each step, rely on operating personnel's technique and experience.
In order to solve the problems, the invention adopts the K-mer to find the unique gene fragment sequence or sequence combination of the species, and uses the unique gene fragment sequence or sequence combination of the species as the identification tag of the species, so as to achieve the effects of more efficiently finding, detecting, classifying and counting the identification tag of the species;
Wherein, the K-mer is a character string which divides the sequence into K bases, and a sequence with the length of N can be divided into N-K+1K-mers subsequences; there are 5 bases commonly found in organisms, adenine (A), guanine (G), cytosine (C), thymine (T) and uracil (U), respectively.
The embodiment of the invention provides a searching method for a species gene identification tag, as shown in fig. 1, comprising the following steps of S1 to S4:
Step S1: receiving an input target species, and searching and comparing the target species in a sequence name total nucleic acid library to obtain species information of the target species, wherein the species information comprises a nucleic acid sequence, a species classification status and a name;
The input target species to be searched can be any one of Chinese name, latin name, whole genome sequence or local sequence of the target species, and can be searched based on the local sequence or whole genome sequence range of the species; and, if the inputted whole genome sequence or partial sequence is an RNA sequence, it is automatically converted into a DNA sequence. After receiving input target species to be searched, the sequence name total nucleic acid library is subjected to search comparison full-automatic process construction, and species information of different target species can be obtained according to different inputs, wherein the nucleic acid sequence in the species information can be a whole genome sequence according to different inputs, and can also comprise a whole genome sequence and a local sequence.
Specifically, if the input target species to be searched is a Chinese name, latin name or whole genome sequence of the target species, the nucleic acid sequence is a whole genome sequence, and the species classification status, name and whole genome sequence of the target species are obtained; if the input target species to be searched is a local sequence of the target species, the nucleic acid sequence simultaneously comprises a whole genome sequence and a local sequence, and the species classification status and name of the target species, the whole genome sequence and the local sequence are obtained. The target species to be searched, which is input by the invention, can be a whole genome sequence, can also be a local sequence, can be searched based on the partial sequence of the species or the whole genome sequence range, and can also realize synchronous searching of identification tags of single species or multiple species; after the target species or sequence to be searched is input, the analysis and calculation can be fully automated, and a great number of links of manual participation are saved.
Specifically, before the receiving the input target species in step S1, the method further includes:
Obtaining species whole gene DNA and RNA sequences, the species whole gene DNA and RNA sequences comprising an NT database; obtaining the total gene DNA and RNA sequences of the species reported at present by downloading, including NT database (nucleic acid sequence database);
Reverse transcription of the RNA sequence is converted into a DNA sequence, and a sequence name total nucleic acid library is constructed. And (3) for the obtained RNA sequence, reverse transcribing and converting the RNA sequence into a DNA sequence, constructing a database after finishing the reverse transcription and converting the DNA sequence, and naming the constructed database as a sequence name total nucleic acid library.
Step S2: comparing according to the classification status and names of the nucleic acid sequences and species to obtain a first comparison result;
specifically, the comparing in step S2 according to the classification status and names of the nucleic acid sequences and species to obtain a first comparison result includes:
extracting the sequence and the name of the genus of the target species from the sequence name total nucleic acid library according to the species classification status and the name to form a sequence pool of the genus of the target species;
Carrying out K-mer segmentation on the nucleic acid sequence to form fragment pools corresponding to the K values, and extracting a name fragment mark library from the fragment pools corresponding to the K values; wherein, the K-mer is a character string which divides the sequence into K bases, a sequence with the length of N can be cut into N-K+1K-mers subsequences, and the sequence can be a DNA sequence or an RNA sequence;
Based on a configured first comprehensive threshold, comparing a fragment pool corresponding to the K value with a sequence pool of the genus of the target species to obtain a first comparison result; the first comprehensive threshold is a parameter and an index set during Blastn comparison, the Blastn comparison belongs to a sequence comparison method, and the homology of the nucleic acid sequences is directly compared through the comparison of the nucleic acid sequences and the nucleic acid library;
Wherein the first integrated threshold comprises: the alignment length is 100%, the sequence similarity (IDF) is not higher than 95%, and the alignment score is not lower than 1e-5. Sequence similarity refers to the ratio of the number of bases at the same position relative to the total number of bases in the two fragments.
Specifically, the K-mer cleavage of the nucleic acid sequence comprises:
If the nucleic acid sequence is a whole genome sequence, carrying out K-mer segmentation on the whole genome sequence, wherein the K is 30 in initial value, 1 in step length and 300 in ending value;
If the nucleic acid sequence is a local sequence and the length of the local sequence is greater than or equal to 300, carrying out K-mer segmentation on the whole genome sequence, wherein the K initial value is 30, the step length is 1, and the end value is 300;
If the nucleic acid sequence is a local sequence and the length of the local sequence is less than 300, carrying out K-mer segmentation on the whole genome sequence, wherein the K initial value is 30, the step length is 1, and the end value is the length of the local sequence. I.e. if the local sequence is long enough, the end value is 300, and if the local sequence length is less than 300, the actual length of the local sequence is taken as the end value.
By comparing the sequence similarity of the target species and each species in the genus through the steps S1 and S2, the range and accuracy of the intraspecies identification of the species identification are effectively increased.
Step S3: classifying the first comparison result according to the range of the sequence similarity to form fragment pools corresponding to the sequence similarity of each range, and comparing the fragment pools corresponding to the sequence similarity of each range with the sequence name total nucleic acid library to obtain a second comparison result; according to the range of the sequence similarity, classifying the first comparison result to form a fragment pool of formation of different ranges of the sequence similarity, wherein the starting range of the sequence similarity is 0% -5%, the step length is 5% and the ending range is 90% -95% by taking the sequence similarity as a reference index.
Specifically, in step S3, the comparing the fragment pool corresponding to the similarity of the range sequence with the total nucleic acid library of the sequence name to obtain a second comparison result includes:
Based on a configured second comprehensive threshold, comparing the fragment pool corresponding to the range sequence similarity with the sequence name total nucleic acid library, and removing sequences with sequence similarity higher than a preset value on other species to obtain a second comparison result; the configured second comprehensive threshold is a parameter and an index set during Blastn comparison;
wherein the second integrated threshold comprises: the alignment length is 100%, the sequence similarity is higher than 95%, and the alignment score is not lower than 1e-5.
Through step S3, sequences with high sequence similarity with other species on the target species are removed, namely non-characteristic tags outside identification tags are removed, so that the range and accuracy of inter-species identification of species identification are effectively increased.
Step S4: and automatically controlling the quality of the second comparison result based on the configured reject sequence control index to obtain a species characteristic tag, and comparing the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species. In the present invention, the identification tag of a species refers to a unique gene fragment sequence or sequence combination of a species that can be used to identify the species; according to the invention, the first identification tag is combined with the first identification threshold, so that the traditional method for identifying the species based on the conserved sequence is broken through, and the method for identifying the species is broken through by 97%, so that the range and accuracy of intra-species and inter-species identification of the species identification are effectively increased, and the search of the specific identification tag of the species can be rapidly, accurately and efficiently completed.
Specifically, in step S4, the automatic quality control of the second comparison result based on the configured culling sequence control index is performed to obtain a species characteristic tag, and the species characteristic tag is compared with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species, which includes:
And automatically controlling the quality of the second comparison result based on the configured reject sequence control index, wherein the reject sequence control index for automatic quality control comprises: GC content is higher than 70%, and the bases are arranged continuously; wherein, the GC content is also called G+C ratio or GC ratio, and the ratio of the sum of guanine (G) and cytosine (C) content in the sequence is called GC content, and the calculation formula is as follows: [ (total number of g+c)/(total number of a+t+c+g) ]; the multiple continuous base arrangement may be a continuous base arrangement of 8 or more bases;
Acquiring a species characteristic label, a distinguishing threshold value and a K-mer value;
based on the configured rejection index, comparing the species characteristic tag with a name fragment tag library, and rejecting the position where the similarity of the repeated fragments and sequences is higher than a preset value to obtain a first identification tag and a first identification threshold of the target species; wherein the preset value may be 95%;
The reject index takes the comparison length as 100%, the sequence similarity is not higher than 95%, and the comparison score is not lower than 1e-5 as a third comprehensive threshold. The configured rejection index and the third comprehensive threshold are parameters and indexes set during Blastn comparison.
And S4, eliminating the repeated fragments and the high-similarity sequence positions, and eliminating the occurrence of multiple positions of the characteristic tag approximate sequences among the same species to obtain a first identification tag and a first identification threshold of the target species.
Specifically, after step S4, the method further includes:
And (3) taking the first identification tag in the step (S4) as a next nucleic acid sequence, and repeating the steps (S2) to (S4) to obtain a second identification tag and a second identification threshold of the target species. The first identification tag and the first identification threshold value of the target species obtained in the step S4 are the rough identification tag and the rough identification threshold value of the preliminary blurring obtained after the first passage of the species information of the target species in the steps S2 to S4, and in order to finally obtain the more accurate identification tag and the more accurate identification threshold value of the target species, the obtained first identification tag is used as the next nucleic acid sequence, and the steps S2 to S4 are repeated physically again to obtain the accurate identification tag and the accurate identification threshold value, namely the second identification tag and the second identification threshold value.
After the first identification tag is obtained, the steps S2 to S4 may be repeated to obtain a second identification tag and a second identification threshold value of the target species more accurately. The second identification tag for the further acquired species gene can be detected more accurately, and species classification and statistics can be completed more efficiently around the detection result.
The embodiment of the invention provides a method for searching a species gene identification tag, and a flow chart is shown in figure 2. The method comprises the following steps:
the first step: constructing a sequence name total nucleic acid library;
And a second step of: searching and comparing;
And a third step of: acquiring two or three information of the species;
Fourth step: forming a fragment tag pool (a species sequence pool) of the genus of the species;
Fifth step: forming a fragment pool (k=30, 31 …, 300) corresponding to each K value;
Sixth step: comparing the fragment pool (the result of the fifth step) corresponding to each K value with the fragment tag pool (the result of the fourth step) of the genus of the target species;
Seventh step: classifying the comparison results to form a segment pool of formation, such as an IDF 0-5% K-mer segment pool;
eighth step: comparing the fragment pool with a sequence name total nucleic acid library by Blastn;
Ninth step: automatic quality control is carried out, and species characteristic labels and distinguishing threshold values and K-mer values are obtained;
tenth step: and comparing the obtained characteristic tag with a name fragment tag library to obtain a species identification tag and an identification threshold.
In addition, an eleventh step of: and repeating the fifth step to the tenth step to further obtain more accurate species identification labels and identification thresholds.
The invention uses the K-mer method to find the unique gene fragment sequence or sequence combination of the species as the identification tag of the species, so as to achieve the purpose of more efficiently completing the searching, detection, classification and statistics of the identification tag of the species. The invention can complete synchronous searching of identification tags for single species or multiple species, and can fully automatically analyze and calculate after inputting target species or sequences to be searched, thereby saving a great deal of links of manual participation.
From the above description, it can be seen that the following technical effects are achieved:
Firstly, the unique gene fragment sequence or sequence combination of the species is searched by using a K-mer method to serve as the identification tag of the species, so that the effects of searching, detecting, classifying and counting the identification tag of the species are achieved more efficiently;
secondly, searching for the species-specific identification tag can be completed rapidly, accurately and efficiently;
thirdly, synchronous searching of the identification tag can be completed for a single species or multiple species;
fourth, a search can be based on species local sequences or whole genome sequence ranges;
fifth, the identification tag is combined with the identification threshold, so that the traditional method for identifying the species based on the conserved sequence is broken through, and 97% of the method for identifying the species is broken through, and the range and accuracy of intra-species and inter-species identification of the species identification are effectively increased;
sixthly, the invention can automatically analyze and calculate after inputting the target species or sequence to be searched, thereby saving a great number of links of manual participation;
seventh, the obtained gene identification tag can be accurately detected, and species classification and statistics can be more efficiently analyzed around the detection result.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
The embodiment of the invention also provides a searching device for the species gene identification tag, which is used for implementing the searching method for the species gene identification tag, as shown in fig. 3, and comprises the following steps:
a search and alignment unit 31, configured to receive an input target species, and search and align the target species in a sequence name total nucleic acid library to obtain species information of the target species, where the species information includes a nucleic acid sequence, a species classification status, and a name;
An alignment unit 32, configured to perform alignment according to the nucleic acid sequence and the classification status and name of the species, so as to obtain a first alignment result;
A Blastn comparison unit 33, configured to classify the first comparison result according to a range where the sequence similarity is located, form a fragment pool corresponding to each range of sequence similarity, and perform Blastn comparison on the fragment pool corresponding to the range of sequence similarity and the sequence name total nucleic acid library to obtain a second comparison result;
The searching unit 34 is configured to automatically control quality of the second comparison result based on the configured reject sequence control index, obtain a species characteristic tag, and compare the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species.
The embodiment of the invention also provides an electronic device, as shown in fig. 4, which includes one or more processors 41 and a memory 42, and in fig. 4, one processor 41 is taken as an example.
The controller may further include: an input device 43 and an output device 44.
The processor 41, the memory 42, the input device 43 and the output device 44 may be connected by a bus or otherwise, for example in fig. 4.
The Processor 41 may be a central processing unit (Central Processing Unit, abbreviated as CPU), the Processor 41 may be other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, abbreviated as DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component, or a combination of the foregoing, and the general purpose Processor may be a microprocessor or any conventional Processor.
The memory 42 serves as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the control methods in embodiments of the present invention. The processor 41 executes various functional applications of the server and data processing, i.e., implements the searching method of the species gene identification tag of the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 42.
Memory 42 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of a processing device operated by the server, or the like. In addition, memory 42 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 42 may optionally include memory located remotely from processor 41, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 43 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing means of the server. The output device 44 may include a display device such as a display screen.
One or more modules are stored in memory 42 that, when executed by one or more processors 41, perform the method illustrated in fig. 1.
It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, and the program may include the embodiment of the above-described motor control method when executed. The storage medium may be a magnetic disk, an optical disc, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a Flash Memory (FM), a hard disk (HARD DISK DRIVE HDD), or a Solid state disk (Solid-STATE DRIVE SSD); the storage medium may also comprise a combination of memories of the kind described above.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (9)

1. A method for searching for a species gene identification tag, comprising:
S1: receiving an input target species, and searching and comparing the target species in a sequence name total nucleic acid library to obtain species information of the target species, wherein the species information comprises a nucleic acid sequence, a species classification status and a name;
S2: comparing according to the classification status and names of the nucleic acid sequences and species to obtain a first comparison result;
extracting the sequence and the name of the genus of the target species from the sequence name total nucleic acid library according to the species classification status and the name to form a sequence pool of the genus of the target species;
carrying out K-mer segmentation on the nucleic acid sequence to form fragment pools corresponding to the K values, and extracting a name fragment mark library from the fragment pools corresponding to the K values;
Based on a configured first comprehensive threshold, comparing a fragment pool corresponding to the K value with a sequence pool of the genus of the target species to obtain a first comparison result;
wherein the first integrated threshold comprises: the alignment length is 100%, the sequence similarity is not higher than 95%, and the alignment score is not lower than 1e-5;
S3: classifying the first comparison result according to the range of the sequence similarity to form fragment pools corresponding to the sequence similarity of each range, and comparing the fragment pools corresponding to the sequence similarity of each range with the sequence name total nucleic acid library to obtain a second comparison result;
s4: and automatically controlling the quality of the second comparison result based on the configured reject sequence control index to obtain a species characteristic tag, and comparing the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species.
2. The method of claim 1, wherein prior to the receiving the input target species, the method further comprises:
obtaining species whole gene DNA and RNA sequences, the species whole gene DNA and RNA sequences comprising an NT database;
reverse transcription of the RNA sequence is converted into a DNA sequence, and a sequence name total nucleic acid library is constructed.
3. The method of claim 2, wherein the K-mer cleavage of the nucleic acid sequence comprises:
If the nucleic acid sequence is a whole genome sequence, carrying out K-mer segmentation on the whole genome sequence, wherein the K is 30 in initial value, 1 in step length and 300 in ending value;
If the nucleic acid sequence is a local sequence and the length of the local sequence is greater than or equal to 300, carrying out K-mer segmentation on the whole genome sequence, wherein the K initial value is 30, the step length is 1, and the end value is 300;
If the nucleic acid sequence is a local sequence and the length of the local sequence is less than 300, carrying out K-mer segmentation on the whole genome sequence, wherein the K initial value is 30, the step length is 1, and the end value is the length of the local sequence.
4. The method of claim 1, wherein the comparing the pool of fragments corresponding to the similarity of the range sequences to the total nucleic acid library of sequence names to obtain a second comparison result comprises:
Based on a configured second comprehensive threshold, comparing the fragment pool corresponding to the range sequence similarity with the sequence name total nucleic acid library, and removing sequences with sequence similarity higher than a preset value on other species to obtain a second comparison result;
wherein the second integrated threshold comprises: the alignment length is 100%, the sequence similarity is higher than 95%, and the alignment score is not lower than 1e-5.
5. The method of claim 1, wherein the automatically controlling the second alignment based on the configured culling sequence control index to obtain a species characteristic tag, and performing Blastn alignment on the species characteristic tag and a name fragment tag library to obtain a first identification tag and a first identification threshold of a target species, comprises:
and automatically controlling the quality of the second comparison result based on the configured reject sequence control index, wherein the reject sequence control index for automatic quality control comprises: GC content is higher than 70%, and the bases are arranged continuously;
Acquiring a species characteristic label, a distinguishing threshold value and a K-mer value;
Based on the configured rejection index, comparing the species characteristic tag with a name fragment tag library, and rejecting the position where the similarity of the repeated fragments and sequences is higher than a preset value to obtain a first identification tag and a first identification threshold of the target species;
The reject index takes the comparison length as 100%, the sequence similarity is not higher than 95%, and the comparison score is not lower than 1e-5 as a third comprehensive threshold.
6. The method according to claim 1, wherein the method further comprises:
and (3) taking the first identification tag in the step (S4) as a next nucleic acid sequence, and repeating the steps (S2) to (S4) to obtain a second identification tag and a second identification threshold of the target species.
7. A device for searching for a species gene identification tag, comprising:
The searching and comparing unit is used for receiving an input target species, searching and comparing the target species in the sequence name total nucleic acid library, and acquiring species information of the target species, wherein the species information comprises a nucleic acid sequence, a species classification status and a name;
The comparison unit is used for comparing according to the classification status and names of the nucleic acid sequences and the species to obtain a first comparison result;
extracting the sequence and the name of the genus of the target species from the sequence name total nucleic acid library according to the species classification status and the name to form a sequence pool of the genus of the target species;
carrying out K-mer segmentation on the nucleic acid sequence to form fragment pools corresponding to the K values, and extracting a name fragment mark library from the fragment pools corresponding to the K values;
Based on a configured first comprehensive threshold, comparing a fragment pool corresponding to the K value with a sequence pool of the genus of the target species to obtain a first comparison result;
wherein the first integrated threshold comprises: the alignment length is 100%, the sequence similarity is not higher than 95%, and the alignment score is not lower than 1e-5;
The Blastn comparison unit is used for classifying the first comparison result according to the range of the sequence similarity to form fragment pools corresponding to the sequence similarity of each range, and carrying out Blastn comparison on the fragment pools corresponding to the sequence similarity of each range and the sequence name total nucleic acid library to obtain a second comparison result;
The searching unit is used for automatically controlling the quality of the second comparison result based on the configured eliminating sequence control index, obtaining a species characteristic tag, and comparing the species characteristic tag with a name fragment tag library to obtain a first identification tag and a first identification threshold of the target species.
8. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of searching for a species gene identification tag according to any one of claims 1 to 6.
9. An electronic device, the electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method of searching for a species gene identification tag according to any one of claims 1 to 6.
CN202110901123.2A 2021-08-06 2021-08-06 Searching method and device for species gene identification tag and electronic equipment Active CN113764041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110901123.2A CN113764041B (en) 2021-08-06 2021-08-06 Searching method and device for species gene identification tag and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110901123.2A CN113764041B (en) 2021-08-06 2021-08-06 Searching method and device for species gene identification tag and electronic equipment

Publications (2)

Publication Number Publication Date
CN113764041A CN113764041A (en) 2021-12-07
CN113764041B true CN113764041B (en) 2024-04-23

Family

ID=78788600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110901123.2A Active CN113764041B (en) 2021-08-06 2021-08-06 Searching method and device for species gene identification tag and electronic equipment

Country Status (1)

Country Link
CN (1) CN113764041B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5968784A (en) * 1997-01-15 1999-10-19 Chugai Pharmaceutical Co., Ltd. Method for analyzing quantitative expression of genes
CN108154010A (en) * 2017-12-26 2018-06-12 东莞博奥木华基因科技有限公司 A kind of ctDNA low frequencies mutation sequencing data analysis method and device
CN110894542A (en) * 2019-12-31 2020-03-20 扬州大学 Primer for identifying types of GS5 gene and GLW7 gene of rice and application of primer
CN111261223A (en) * 2020-01-12 2020-06-09 湖南大学 CRISPR off-target effect prediction method based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5968784A (en) * 1997-01-15 1999-10-19 Chugai Pharmaceutical Co., Ltd. Method for analyzing quantitative expression of genes
CN108154010A (en) * 2017-12-26 2018-06-12 东莞博奥木华基因科技有限公司 A kind of ctDNA low frequencies mutation sequencing data analysis method and device
CN110894542A (en) * 2019-12-31 2020-03-20 扬州大学 Primer for identifying types of GS5 gene and GLW7 gene of rice and application of primer
CN111261223A (en) * 2020-01-12 2020-06-09 湖南大学 CRISPR off-target effect prediction method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
12S rRNA、Cyt b基因标记在几类野生动物检材鉴定中的差异;黄娅琳;;四川农业大学学报;20170630(02);第138-144页 *

Also Published As

Publication number Publication date
CN113764041A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN105886616B (en) Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof
CN108197434B (en) Method for removing human gene sequence in metagenome sequencing data
US9218450B2 (en) Accurate and fast mapping of reads to genome
US20180018422A1 (en) Systems and methods for nucleic acid-based identification
CN110692101A (en) Method for aligning targeted nucleic acid sequencing data
US10658069B2 (en) Biological sequence variant characterization
US11062790B2 (en) Method for thoroughly designing valid and ranked primers for genome-scale DNA sequence database
KR20140006846A (en) Data analysis of dna sequences
CA3005791A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN115719616B (en) Screening method and system for pathogen species specific sequences
CN107506614B (en) Bacterial ncRNA prediction method
US20210141833A1 (en) Optimizing k-mer databases by k-mer subtraction
CN107563148B (en) Ion index-based integral protein identification method and system
CN113096737B (en) Method and system for automatically analyzing pathogen type
CN113764041B (en) Searching method and device for species gene identification tag and electronic equipment
CN112259167A (en) Pathogen analysis method and device based on high-throughput sequencing and computer equipment
US20160103955A1 (en) Biological sequence tandem repeat characterization
CA2481905A1 (en) Mutation detection and identification
JP2008161056A (en) Dna sequence analyzer and method and program for analyzing dna sequence
CN110570908B (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
CN107679365A (en) The method of surname is efficiently inferred based on Y chromosome molecular labeling
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium
CN113284552B (en) Screening method and device for micro haplotypes
KR20190069929A (en) miRNA DATA ANALYSIS METHOD FOR SERVER
CN117059170A (en) Genomic protozoan pollutant detection method based on DNA bar code technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant