CN112259167B - Pathogen analysis method and device based on high-throughput sequencing and computer equipment - Google Patents

Pathogen analysis method and device based on high-throughput sequencing and computer equipment Download PDF

Info

Publication number
CN112259167B
CN112259167B CN202011137959.1A CN202011137959A CN112259167B CN 112259167 B CN112259167 B CN 112259167B CN 202011137959 A CN202011137959 A CN 202011137959A CN 112259167 B CN112259167 B CN 112259167B
Authority
CN
China
Prior art keywords
mers
sequence
genome
mer
pathogen
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011137959.1A
Other languages
Chinese (zh)
Other versions
CN112259167A (en
Inventor
于闯
张优劲
贺增泉
王今安
晋向前
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Technology Solutions Co Ltd
Original Assignee
BGI Technology Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Technology Solutions Co Ltd filed Critical BGI Technology Solutions Co Ltd
Priority to CN202011137959.1A priority Critical patent/CN112259167B/en
Publication of CN112259167A publication Critical patent/CN112259167A/en
Application granted granted Critical
Publication of CN112259167B publication Critical patent/CN112259167B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention provides a pathogen analysis method, a pathogen analysis device and computer equipment based on high-throughput sequencing, wherein the method comprises the steps of obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers; carrying out Hash calculation on the plurality of K-mers, and searching genome positions and weights corresponding to all the K-mers of each sequence in a pre-established genome Hash table according to the K-mers subjected to Hash calculation; calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence; and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen. The method can determine the species attribute only by determining the position and the weight of the K-mer of each sequence of the sample to be analyzed, and has high accuracy.

Description

Pathogen analysis method and device based on high-throughput sequencing and computer equipment
Technical Field
The invention relates to the technical field of data processing, in particular to a pathogen analysis method and device based on high-throughput sequencing, computer equipment and a readable storage medium.
Background
With the development of second-generation sequencing technologies, more and more second-generation sequencing-based technologies are used in the fields of medicine and scientific research. Of these, pathogen detection is a typical application. It is well known that a significant portion of the diseases occurring in humans are caused by the invasion of external pathogens, and it is critical to identify the pathogens as life-saving in order to treat the condition. Before the second generation sequencing technology is immature, the most common method is to determine whether the pathogen is virus or bacteria through blood routine examination, and then use antibiotics or interferons for treatment, but the method cannot effectively target the corresponding pathogen and is easy to cause drug abuse to cause drug resistance. There follows the laboratory culture of tissues and subsequent determination of the pathogen, but this approach has the disadvantage of being lengthy and highly dependent on manual work to determine the pathogen. The introduction of the second generation sequencing technology solves the problems, all genetic information in the tissue can be obtained through gene sequencing, the genome information of the pathogen can be accurately obtained according to the genetic information, and then the final judgment of the pathogen type is obtained through comparison according to a known database.
However, the current high-throughput gene detection method for pathogens is to collect a sample, extract and store the sample with nucleic acid, obtain a fastq file of a gene sequence through high-throughput sequencing, compare the sequence to the real position of a human genome through bwa (sequence comparison software), compare the sequence of the human genome after being compared with the sequence of other species information, finally determine the species of the pathogens according to the comparison sequence conditions, and generate an analysis report. This method requires repeated alignment of the human genome with the genome of a different species, resulting in lengthy analysis times and inefficient alignment.
Disclosure of Invention
In view of the above, the invention provides a pathogen analysis method, a pathogen analysis device, a pathogen analysis computer apparatus and a readable storage medium based on high-throughput sequencing to solve the technical problems of long analysis time and low comparison efficiency caused by the repeated comparison of human genomes and genomes of different species required by the current pathogen high-throughput gene detection method.
A pathogen analysis method based on high-throughput sequencing comprises the following steps:
obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers;
performing hash calculation on the plurality of K-mers, and searching in a pre-established genome hash table according to the K-mers subjected to the hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence;
and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
Alternatively,
the establishing step of the genome hash table comprises the following steps:
obtaining the sequence of a known human genome and the sequence of a pathogen genome;
selecting K-mers, and segmenting the sequence of the human genome and the sequence of the pathogen according to the K-mers to obtain a plurality of K-mers;
performing statistical analysis on a plurality of the K-mers and assigning a weight to each K-mer;
mapping each K-mer after the weight is distributed to a hash table by using a hash function, and establishing the genome hash table, wherein the key value of the genome hash table is a K-mer sequence, and the value of the genome hash table is the genome position, the species attribute and the weight corresponding to the K-mer.
Alternatively,
the step of performing statistical analysis on a plurality of the K-mers and assigning a weight to each K-mer comprises:
when one K-mer has no repetition in all K-mers, assigning a weight w to the K-mer without repetition i
When one K-mer is repeated multiple times among all K-mers and the repeated K-mers belong to the same species attribute, assigning a weight w to the repeated K-mers of the same species i =w i -n, wherein n is the number of repetitions;
when a K-mer is repeated multiple times among all K-mers and the repeated K-mers belong to different species attributes, it is a repetitionThe K-mers of multiple and different species assign a weight w i =(w i -ni)/k, where k denotes k species attributes, n is the number of repetitions, ni denotes the number of repetitions in the ith species attribute, i ═ 1,2, … … k.
Alternatively, the first and second liquid crystal display panels may be,
the step of obtaining the genome positions and weights corresponding to all K-mers of each sequence comprises:
counting the K-mers of each sequence, wherein if the genome positions corresponding to two K-mers in any sequence are different and the two K-mers have a linear relation, the weight of the two K-mers is w i =w i +w i /2。
Alternatively,
the step of obtaining the genomic positions and weights corresponding to all the K-mers of each sequence further comprises:
if the genome positions corresponding to a plurality of K-mers in any sequence are different and the plurality of K-mers have a linear relation, the weight of the plurality of K-mers is w i =w i +w i (ii)/2 m, wherein m represents the number of K-mers having a linear relationship.
Alternatively,
the step of performing classification analysis according to the genome positions corresponding to all K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen comprises the following steps:
screening all K-mers of which the corresponding genome positions are not found in a genome hash table when the species attribute of the pathogen cannot be determined by analyzing the genome positions corresponding to all the K-mers of one sequence and the total weight of one sequence;
finding out all K-mers with preset Hamming distances to all the screened K-mers;
performing hash calculation on all K-mers with the Hamming distances being preset values, and searching in a pre-established genome hash table according to the K-mers after the hash calculation to obtain genome positions and weights corresponding to all K-mers with the Hamming distances being preset values;
calculating the total weight of all the K-mers with the Hamming distance as a preset value according to the weights corresponding to all the K-mers with the Hamming distance as a preset value;
and carrying out classification analysis on the genome positions corresponding to all the K-mers with the preset Hamming distances and the total weight of all the K-mers with the preset Hamming distances to determine the species attributes of the pathogens.
Alternatively,
the length of the K-mer is 21bp in length.
A high-throughput sequencing-based pathogen analysis device, comprising:
the sample sequencing data acquisition module is used for acquiring sequencing data of a sample to be analyzed;
the sample K-mer obtaining module is used for segmenting each sequence of the sequencing data according to the K-mers to obtain a plurality of K-mers;
the position and weight obtaining module is used for carrying out Hash calculation on the plurality of K-mers and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
the sequence total weight calculation module is used for calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence;
and the species attribute determining module is used for performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers;
performing Hash calculation on the plurality of K-mers, and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence;
and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers;
performing hash calculation on the plurality of K-mers, and searching in a pre-established genome hash table according to the K-mers subjected to the hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence;
and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
According to the pathogen analysis method and device based on high-throughput sequencing, the computer equipment and the readable storage medium, sequencing data of a sample to be analyzed are firstly obtained, and each sequence of the sequencing data is segmented according to K-mers to obtain a plurality of K-mers; performing Hash calculation on the plurality of K-mers, and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence; and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen. The analysis method comprises the steps of preprocessing a known biological genome sequence (comprising K-mer segmentation, weight distribution and Hash operation) to obtain a genome Hash table, when a sample to be analyzed needs to be analyzed, carrying out K-mer segmentation on sequencing data of the sample, carrying out Hash operation on the obtained K-mer, then directly carrying out query on the genome Hash table to find a genome position and a weight corresponding to the K-mer, determining the genome position and the weight corresponding to each sequence of the sample according to the genome position and the weight corresponding to the K-mer, and then determining the species attribute of the sequence according to the genome position and the weight corresponding to each sequence; the K-mers of the same species have the same or similar properties, so that the species attribute can be determined only by determining the position and the weight of the K-mer of each sequence of the sample to be analyzed, the accuracy is high, and the comparison performance is greatly improved; in addition, each sequence of the sample does not need to be accurately compared, so that the analysis time is saved, and the analysis efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of an application environment of a high throughput sequencing-based pathogen analysis method in an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a pathogen analysis method based on high throughput sequencing according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a K-mer in an example of the present invention;
FIG. 4 is a schematic diagram illustrating a sample to be analyzed being partitioned and searched in a pre-established genome hash table according to an embodiment of the present invention;
FIG. 5 is a schematic illustration of determining a species attribute in an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a process of creating a genome hash table according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating the structure of the establishment of a genome hash table according to an embodiment of the present invention;
FIG. 8 is a schematic flow chart of a method for analyzing pathogens based on high throughput sequencing according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a pathogen analysis apparatus based on high throughput sequencing according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a computer device in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The noun interpretation:
1. high-throughput sequencing: (English: High-throughput sequencing), also known as "Next-generation" sequencing technology, is marked by the ability to perform sequencing on hundreds of thousands to millions of DNA molecules at a time in parallel, and by the short read length in general. The sequencing is to analyze the base sequence of a specific DNA fragment, i.e., the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G). The advent of rapid DNA sequencing methods has greatly facilitated biological and medical research and discovery.
2. Hash function: the Hash function is a method for creating a small digital "fingerprint" from any kind of data. The hash function compresses a message or data into a digest so that the amount of data becomes small, fixing the format of the data. This function mixes the data in a hash, recreating a fingerprint called a hash value (hash sums, or hashes). The hash value is typically represented by a short string of random letters and numbers. [1] Good hash functions rarely have hash collisions in the input domain. In hash tables and data processing, data is distinguished without suppressing conflicts, making database records more difficult to find.
3. Sequence alignment refers to the arrangement of two or more sequences together, indicating similarity. Intervals (generally indicated by dashed lines "-") may be inserted in the sequence. Corresponding identical or similar symbols (A, T (or U) in nucleic acids, C, G, one letter designation for amino acid residues in proteins) are arranged on the same column. This method is commonly used to study sequences evolved from a common ancestor, particularly biological sequences such as protein sequences or DNA sequences. In the alignment, mismatches correspond to mutations, while gaps correspond to insertions or deletions. Sequence alignment may also be used for studies such as language evolution or similarity between texts.
4. Bioinformatics: (English: Bioinformatics) studies biological problems using methods that apply mathematics, informatics, statistics, and computer science. The research materials and results of bioinformatics are various biological data, the research tools are computers, and the research methods comprise searching (collecting and screening), processing (editing, sorting, managing and displaying) and utilizing (calculating and simulating) the biological data. The main research directions at present are: sequence alignment, sequence assembly, gene identification, gene recombination, prediction of protein structure, gene expression, prediction of protein response, and creation of evolutionary models.
5. Pathogen detection: the method is characterized in that high-throughput sequencing is directly carried out on an infected specimen, species information of suspected pathogenic microorganisms is obtained through comparison of a special database for the microorganisms and intelligent algorithm analysis, a comprehensive and deep report is provided, a quick and accurate diagnosis basis is provided for difficult and serious infection, and the reasonable effect of antibiotics is promoted. According to different types of extracted nucleic acids, the method is divided into a DNA detection process and an RNA detection process. The DNA detection process is suitable for detecting intracellular parasitic bacteria such as mycobacterium tuberculosis and fungi with thick cell wall such as cryptococcus. The RNA detection process is suitable for detecting RNA viruses such as influenza virus, respiratory syncytial virus, coronavirus and the like.
6. Genome: in the fields of molecular biology and genetics, the genome is the sum of all the genetic material of an organism. These genetic materials include DNA or RNA (viral RNA). The genome comprises coding and non-coding DNA, mitochondrial DNA and chloroplast DNA.
K-mer: the biological information refers to a substring of a known sequence with the length of k, and is mainly used for calculation of a genome and analysis of the sequence. K-mers are sequence generated by known sequencing, comprising nucleotide sequences (such as ATCG), and are often used for genome assembly and alignment. In general, a K-mer of a sequence refers to the set of all K-length subsequences of the sequence.
8. And (4) statistical classification: statistical classification is a very important component of machine learning, and its goal is to determine to which class of known samples a new sample belongs based on certain characteristics of the known samples. Classification is an example of supervised learning, where discriminant functions are created to classify samples by computationally selecting feature parameters from samples provided by a known training set.
9. Clustering analysis: (Cluster analysis), also known as Cluster analysis, is a technique for statistical data analysis and is widely used in many fields, including machine learning, data mining, pattern recognition, image analysis, and biometric information. Clustering is to divide similar objects into different groups or more subsets (subsets) by static classification, so that all the object members in the same subset have similar attributes, which are usually included in a shorter spatial distance in a coordinate system.
10. Hamming distance: in the information theory, the Hamming distance (Hamming distance) between two equal-length character strings is the number of different characters at the corresponding positions of the two character strings. In other words, it is the number of characters that need to be replaced to convert one string into another.
11. Gene mutation: the biological meaning of (Mutation) is the alteration of a genetic gene in a cell, usually deoxyribonucleic acid present in the nucleus of the cell. It includes point mutations caused by single base changes, or deletions, duplications and insertions of multiple bases. The cause may be errors in the replication of the genetic gene during cell division, or the influence of chemicals, genotoxicity, radiation or viruses.
12. Point mutation: (point mutation) is a type of mutation that causes a single base nucleotide to be substituted for another nucleotide in the genetic material DNA or RNA. Generally, the term also includes insertions or deletions that act only on a single base pair.
The method is applied to the terminal 102 in fig. 1, the terminal can be a personal computer, a notebook computer, etc., the terminal 102 is in communication connection with the gene sequencing equipment 104, and the gene sequencing equipment 104 can be a gene sequencer, etc.
When the terminal 102 is connected to the gene sequencing device 104 through the local interface, the gene sequencing device 104 may send sample data after sequencing to the terminal 102. In addition, the terminal 102 may also obtain sample data after the completion of the sequencing in the gene sequencing device 104 by an instruction.
In one embodiment, as shown in fig. 2, a high throughput sequencing-based pathogen analysis method is provided, which is illustrated by the example of the method applied to the terminal in fig. 1, and comprises the following steps:
step S202, obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers;
the sequencing data of the sample to be analyzed refers to data obtained by performing high-throughput sequencing on any sample needing pathogen analysis. Sequencing data is a number (i.e., a plurality) of sequences, typically fastq files.
mer, which in the field of molecular biology means a monomeric unit (mer). The unit usually used in nucleic acid sequence represents nt or bp, for example, 100mer DNA represents the length of single strand of the DNA sequence is 100nt, or the length of double strand is 100 bp. And a K-mer is a sequence obtained by dividing a nucleic acid sequence into K-base strings, i.e., a sequence with a length of K bases is iteratively selected from a continuous nucleic acid sequence, and if the length of the nucleic acid sequence is L and the length of the K-mer is K, L-K + 1K-mers can be obtained. As shown in fig. 3, assuming that there is a certain sequence length of 21, and the selected k-mer length is set to 7, then (21-7+1 ═ 15) 7-mers are obtained.
In addition, K in the K-mer is a positive integer, wherein the value of K can be unfixed and can be any positive integer theoretically; however, in practice, the value of K should not be too small or too large, and when the value of K is too small, the formed K-mer sequence is too short, and the species certainty is poor (that is, the shorter the K-mer is, the less effective information is contained, and the possibility of existence of various species is higher); when the value of K is too large, the formed K-mer sequence is too long, the species certainty is good, but the efficiency of analysis and comparison in the later period of the too long sequence is low; therefore, the value of K is critical and can be usually selected by a probability model, and the value can be between dozens and twenty-few.
In an alternative embodiment, the length of the K-mer is 21bp in length.
Specifically, the sequence of the human genome is needed to be used when the pathogen is analyzed, the mutation rate of the human genome is one thousandth, the length of 21bp is calculated according to the classical profile and taken as the length of the K-mer, and the accuracy of later comparison can be further improved by adopting the K-mer with the length.
Step S204, carrying out Hash calculation on a plurality of K-mers, and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
each sequence can be divided into a plurality of K-mers, hash calculation and hash table query can be carried out according to one sequence, namely hash calculation can be carried out on all K-mers obtained by one sequence, and then a pre-established genome hash table is searched to find out genome positions and weights corresponding to all K-mers. In addition, each K-mer is independent, so that a parallel mode can be adopted during hash calculation and hash table lookup, and a hash table lookup mode is adopted, so that the time complexity of the database is low, and the lookup efficiency cannot be reduced when the database scale is increased. In addition, the genomic positions and weights for all K-mers in each sequence can be accomplished in the same manner. Referring to fig. 4, the whole process is substantially to match the sequencing data of the sample to the genome (i.e., the known human genome and pathogen genome).
In addition, the known human genome refers to the genome published in the human genome project; the known pathogen genome refers to the corresponding genome of all pathogens disclosed so far.
In addition, the K-mer value used when the fraction K-mer of the sequencing data of the sample to be analyzed is cut and the pre-established genome hash table is the same, so that the length of the K-mer sequence of the sample is ensured to be the same as that of the K-mer sequence in the genome hash table, and later searching and comparison are facilitated.
Step S206, calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence;
specifically, the total weight of each series is typically the sum of the weights corresponding to all K-mers in the series.
And S208, performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence, and determining the species attribute of the pathogen.
After the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence are obtained, carrying out classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence so as to determine the species attribute corresponding to each sequence, and finally determining the species attribute of the pathogen according to the species attributes corresponding to all the sequences. The classification analysis is usually a statistical analysis, and the specific process is shown in fig. 5.
In addition, the genomic position corresponding to the K-mer can be used to determine which species the sequence belongs to, and the total weight can be used to determine the likelihood that the sequence belongs to a species. It can be seen that the species attribute of the sequence can be substantially determined from these two parameters.
The pathogen analysis method based on high-throughput sequencing comprises the steps of firstly obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers; performing Hash calculation on the plurality of K-mers, and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers and mapping the weighted K-mers to the hash table by using a hash function; calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence; and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen. The analysis method comprises the steps of preprocessing a known biological genome sequence (comprising K-mer segmentation, weight distribution and Hash operation) to obtain a genome Hash table, when a sample to be analyzed needs to be analyzed, carrying out K-mer segmentation on sequencing data of the sample, carrying out Hash operation on the obtained K-mer, then directly carrying out query on the genome Hash table to find a genome position and a weight corresponding to the K-mer, determining the genome position and the weight corresponding to each sequence of the sample according to the genome position and the weight corresponding to the K-mer, and then determining the species attribute of the sequence according to the genome position and the weight corresponding to each sequence; because the K-mers of the same species have the same or similar properties, the species attribute can be determined only by determining the position and the weight of the K-mer of each sequence of the sample to be analyzed, the accuracy is high, and the comparison performance is greatly improved; in addition, each sequence of the sample does not need to be accurately compared, so that the analysis time is saved, and the analysis efficiency is improved.
In one embodiment, as shown in fig. 6, the establishing of the genome hash table includes:
step S602, obtaining the known human genome sequence and pathogen genome sequence;
s604, selecting K-mers, and segmenting the sequences of the human genomes and the pathogens according to the K-mers to obtain a plurality of K-mers;
step S606, performing statistical analysis on a plurality of K-mers, and distributing weight to each K-mer;
step S608, mapping each K-mer after the weight is distributed to a hash table by using a hash function, and establishing a genome hash table, wherein a key value of the genome hash table is a K-mer sequence, and a value of the genome hash table is a genome position, a species attribute and a weight corresponding to the K-mer.
Specifically, the establishment of the genome hash table is essentially a process of preprocessing a sequence of a known human genome and a sequence of a pathogen genome; as shown in fig. 7, the specific process is as follows: firstly, acquiring a sequence of a Human Genome (namely Human Genome) and a sequence of a pathogen (namely specifices A Genome-specifices Z Genome), then carrying out segmentation on the sequence of the Human Genome and the sequence of the pathogen by adopting K-mers to obtain a series of K-mers, and then distributing weight to each K-mer; and performing hash operation on each K-mer to form a hash table, wherein the hash table stores the species attribute, the genome position and the weight corresponding to each K-mer.
In addition, because of the similarity of genomes between closely related species, the same K-mer may exist in the genomic library of some similar species, and the species attribute corresponding to the K-mer should include more than one.
The method is adopted to preprocess the genome information of the organism, the known genome information is fully utilized, and the preprocessing only needs to be carried out once to quickly determine the pathogen information; in addition, the efficiency of pathogen detection procedures is not slowed down when the sequence of the pathogen genome is amplified.
In one embodiment, the step of statistically analyzing the plurality of K-mers and assigning a weight to each K-mer comprises:
when one K-mer has no repetition in all K-mers, assigning a weight w to the K-mer without repetition i
When one K-mer is repeated multiple times among all K-mers and the repeated K-mers belong to one species attribute, assigning a weight w to the repeated and same-species K-mer i =w i -n, wherein n is the number of repetitions;
when one K-mer is repeated multiple times among all K-mers, and the repeated K-mers belong to different species attributes, weights w are assigned to the repeated K-mers of different species i =(w i -ni)/k, where k denotes k species attributes, n is the number of repetitions, ni denotes the number of repetitions in the ith species attribute, i ═ 1,2, … … k.
Specifically, when performing weight assignment on a K-mer, it is necessary to consider whether the K-mer has repetition, the number of repetitions, and belongs to different species. If the K-kmer is not repeated, the weight value is w i Wherein w is i Is an integer which is a function of the number of the atoms,w i represents the weight of the ith K-mer; alternatively, w i May take 10; if this K-mer is repeated n times and is present in only one species, then its weight is w i 10-n; if the K-mer is repeated n times in different K species, each species being repeated n1, n2, … nk times, then its weight is w i =(10-ni)/k。
In the embodiment, a single K-mer is assigned with a higher weight, while K-mers which repeatedly occur for many times and exist in different species are assigned with a lower weight, wherein the occurrence frequency of the K-mers is less (particularly only one occurrence), and the accuracy of the result obtained in the later period when the K-mer is used as a reference to detect pathogens in a sample to be analyzed is also high; otherwise, the accuracy will decrease; therefore, different weights are assigned by comprehensively considering the situations, so that the accuracy of analysis is further improved.
In one embodiment, the step of obtaining the genomic positions and weights corresponding to all K-mers of each sequence comprises:
counting the K-mers of each sequence, and if the genome positions corresponding to two K-mers in any sequence are different and the two K-mers have a linear relation, the weight of the two K-mers is w i =w i +w i /2。
In one embodiment, the step of obtaining the genomic positions and weights corresponding to all K-mers of each sequence further comprises:
if the genome positions corresponding to a plurality of K-mers in any sequence are different and the plurality of K-mers have a linear relationship, the weight of the plurality of K-mers is w i =w i +w i (vi)/2 m, wherein m represents the number of K-mers having a linear relationship.
Specifically, after obtaining the species genome positions corresponding to all the K-mers in a sequence and the corresponding weights, counting the K-mers of the same sequence, and if the genome positions corresponding to two K-mers are not repeated (i.e., not identical) and the two K-mers have a linear relationship, then the basic weights w corresponding to the two K-mers are calculated i =w i +w i 2, and if m K-mers have the same linear relation, the weight is also increased to w i =w i +w i And/2 m. In this embodiment, the weight of the K-mers is adjusted according to the statistical analysis of the K-mers of the same sequence and the analysis result, so as to improve the accuracy of pathogen determination to the greatest extent.
In one embodiment, as shown in fig. 8, the step of performing classification analysis according to the genomic positions corresponding to all K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen includes:
s802, screening all K-mers of which the corresponding genome positions are not found in a genome hash table when the species attribute of the pathogen cannot be determined by analyzing the genome positions corresponding to all the K-mers of one sequence and the total weight of one sequence;
step S804, finding out all K-mers with preset Hamming distances to all screened K-mers;
step S806, performing hash calculation on all K-mers with Hamming distances of preset values, and searching in a pre-established genome hash table according to the K-mers after the hash calculation to obtain genome positions and weights corresponding to all K-mers with Hamming distances of preset values;
step S808, calculating the total weight of all the K-mers with the Hamming distance as the preset value according to the weights corresponding to all the K-mers with the Hamming distance as the preset value;
step S810, performing classification analysis on the genome positions corresponding to all the K-mers with the Hamming distances being preset values and the total weight of all the K-mers with the Hamming distances being preset values, and determining the species attributes of the pathogens.
Specifically, when the species attribute of a sequence cannot be determined through the classification analysis of the positions of all K-mers in the sequence and the total weight, that is, which species the sequence belongs to cannot be determined, a backtracking strategy is adopted to process K-mers whose positions are not compared, all K-mers whose hamming distances from the K-mers are preset values are taken out for hash search, and then known K-mer position information is selected for selection until the final species judgment of the sequence is obtained or an approximate species judgment is obtained (see the fuzzy comparison module part of fig. 5 in detail).
The preset value is a preset value, and the preset value is usually a positive integer. The predetermined value may be 1, which indicates the correlation or similarity between the selected K-mer and the original K-mer, and the smaller the value, the higher the similarity.
It should be understood that although the various steps in the flowcharts of fig. 2 and 6-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 6-7 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 9, there is provided a high throughput sequencing-based pathogen analysis device, comprising:
a sample sequencing data obtaining module 902, configured to obtain sequencing data of a sample to be analyzed;
a sample K-mer obtaining module 904, configured to segment each sequence of the sequencing data according to a K-mer to obtain multiple K-mers;
a position and weight obtaining module 906, configured to perform hash calculation on the plurality of K-mers, and perform lookup according to a genome hash table pre-established by the K-mers after the hash calculation, to obtain genome positions and weights corresponding to all K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
a sequence total weight calculation module 908, configured to calculate a total weight of each sequence according to weights corresponding to all K-mers of each sequence;
and a species attribute determining module 910, configured to perform classification analysis according to the genomic positions corresponding to all K-mers of each sequence and the total weight of each sequence, so as to determine a species attribute of the pathogen.
In one embodiment, the method comprises the following steps:
a genome sequence acquisition module for acquiring a sequence of a known human genome and a sequence of a pathogen genome;
the genome K-mer obtaining module is used for selecting K-mers and segmenting the sequences of human genomes and pathogens according to the K-mers to obtain a plurality of K-mers;
the K-mer weight distribution module is used for carrying out statistical analysis on the plurality of K-mers and distributing weight to each K-mer;
and the hash table establishing module is used for mapping each K-mer after the weight is distributed to the hash table by using a hash function, and establishing a genome hash table, wherein the key value of the genome hash table is a K-mer sequence, and the value of the genome hash table is the genome position, the species attribute and the weight corresponding to the K-mer.
In one embodiment, the K-mer weight assignment module is further configured to assign a weight w to a non-repeated K-mer when one K-mer has no repetition in all K-mers i
The K-mer weight distribution module is also used for distributing weight w to the K-mers which are repeated for multiple times and belong to the same species when one K-mer is repeated for multiple times in all the K-mers and the K-mers which are repeated for multiple times belong to the same species attribute i =w i -n, wherein n is the number of repetitions;
the K-mer weight distribution module is further used for distributing weight w to the K-mers which are repeated for multiple times and belong to different species when one K-mer is repeated for multiple times in all the K-mers and the K-mers which are repeated for multiple times belong to different species attributes i =(w i -ni)/k, wherein k representsk species attributes, n is the number of repetitions, ni represents the number of repetitions in the ith species attribute, i is 1,2, … … k.
In one embodiment, the position and weight obtaining module is further configured to count the K-mers of each sequence, and if the genomic positions corresponding to two K-mers in any one sequence are different and the two K-mers have a linear relationship, the weight of the two K-mers is w i =w i +w i /2。
In one embodiment, the position and weight obtaining module is further configured to, if the genomic positions corresponding to the plurality of K-mers in any one sequence are not the same and the plurality of K-mers have a linear relationship, weight of the plurality of K-mers is w i =w i +w i (vi)/2 m, wherein m represents the number of K-mers having a linear relationship.
In one embodiment, further comprising:
the K-mer screening module is used for screening out all K-mers of which the corresponding genome positions are not found in the genome hash table when the species attribute of the pathogen cannot be determined by analyzing the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence;
the K-mer finding module is used for finding out all K-mers of which the Hamming distances from all the screened K-mers are preset values;
the position and weight obtaining module is also used for carrying out Hash calculation on all the K-mers with the Hamming distances being preset values, and searching in a pre-established genome Hash table according to the K-mers after the Hash calculation to obtain genome positions and weights corresponding to all the K-mers with the Hamming distances being preset values;
the sequence total weight calculation module is also used for calculating the total weight of all the K-mers with the Hamming distances being preset values according to the weights corresponding to all the K-mers with the Hamming distances being preset values;
and the species attribute determining module is also used for carrying out classification analysis on the genome positions corresponding to all the K-mers with the preset Hamming distances and the total weight of all the K-mers with the preset Hamming distances so as to determine the species attributes of the pathogens.
In one embodiment, the length of the K-mer is 21bp in length.
For specific limitations of the pathogen analysis device based on high-throughput sequencing, reference may be made to the above limitations of the pathogen analysis method based on high-throughput sequencing, which are not described herein again. The various modules in the high-throughput sequencing-based pathogen analysis device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer equipment is used for storing data of the resistance equivalent model and the equivalent submodel, and storing the equivalent resistance, the working resistance and the contact resistance obtained in the process of executing calculation. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a high throughput sequencing based pathogen analysis method.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers; carrying out Hash calculation on the plurality of K-mers, and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer; calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence; and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
In one embodiment, the processor when executing the computer program further performs the steps of: establishing a genome hash table, comprising the following steps: obtaining the sequence of a known human genome and the sequence of a pathogen genome; selecting K-mers, and segmenting the sequences of human genomes and pathogens according to the K-mers to obtain a plurality of K-mers; performing statistical analysis on a plurality of K-mers, and distributing weight to each K-mer; mapping each K-mer after the weight is distributed to a hash table by using a hash function, and establishing a genome hash table, wherein the key value of the genome hash table is a K-mer sequence, and the value of the genome hash table is the genome position, the species attribute and the weight corresponding to the K-mer.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the step of performing a statistical analysis on the plurality of K-mers and assigning a weight to each K-mer comprises: when one K-mer has no repetition in all K-mers, assigning a weight w to the K-mer without repetition i (ii) a When a K-mer is repeated multiple times among all K-mers, and the K-mers repeated multiple times belong to the same K-merWhen the species is in the attribute, the weight w is distributed to the K-mer which is repeated for a plurality of times and is of the same species i =w i -n, wherein n is the number of repetitions; when one K-mer is repeated multiple times among all K-mers, and the repeated K-mers belong to different species attributes, weights w are assigned to the repeated K-mers of different species i =(w i -ni)/k, where k denotes k species attributes, n is the number of repetitions, ni denotes the number of repetitions in the ith species attribute, i ═ 1,2, … … k.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the step of obtaining the genome positions and weights corresponding to all K-mers of each sequence comprises: and counting the K-mers of each sequence, wherein if the genome positions corresponding to two K-mers in any sequence are different and the two K-mers have a linear relation, the weight of the two K-mers is wi-wi + wi/2.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the step of obtaining the genomic positions and weights corresponding to all the K-mers of each sequence further comprises: if the genome positions corresponding to a plurality of K-mers in any sequence are different and the plurality of K-mers have a linear relationship, the weight of the plurality of K-mers is w i =w i +w i (vi)/2 m, wherein m represents the number of K-mers having a linear relationship.
In one embodiment, the processor when executing the computer program further performs the steps of: screening all K-mers of which the corresponding genome positions are not found in a genome hash table when the species attribute of the pathogen cannot be determined by analyzing the genome positions corresponding to all the K-mers of one sequence and the total weight of one sequence; finding out all K-mers with preset Hamming distances to all the screened K-mers; performing hash calculation on all K-mers with the Hamming distances being preset values, and searching in a pre-established genome hash table according to the K-mers after the hash calculation to obtain genome positions and weights corresponding to all K-mers with the Hamming distances being preset values; calculating the total weight of all the K-mers with the Hamming distance as a preset value according to the weights corresponding to all the K-mers with the Hamming distance as a preset value; and carrying out classification analysis on the genome positions corresponding to all the K-mers with the preset Hamming distances and the total weight of all the K-mers with the preset Hamming distances to determine the species attributes of the pathogens.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the length of the K-mer is 21bp in length.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, performs the steps of: obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers; carrying out Hash calculation on the plurality of K-mers, and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer; calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence; and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
In one embodiment, the computer program when executed by the processor further performs the steps of: the method comprises the steps of establishing a genome hash table, comprising the following steps: obtaining the sequence of a known human genome and the sequence of a pathogen genome; selecting a K-mer, and segmenting a sequence of a human genome and a sequence of a pathogen according to the K-mer to obtain a plurality of K-mers; performing statistical analysis on a plurality of K-mers, and distributing weight to each K-mer; mapping each K-mer after the weight is distributed to a hash table by using a hash function, and establishing a genome hash table, wherein the key value of the genome hash table is a K-mer sequence, and the value of the genome hash table is the genome position, the species attribute and the weight corresponding to the K-mer.
In one embodiment, the computer program when executed by the processor further performs the steps of: the step of performing a statistical analysis on the plurality of K-mers and assigning a weight to each K-mer comprises: when one K-mer has no repetition in all K-mers, assigning a weight w to the K-mer without repetition i (ii) a When one K-mer is repeated multiple times among all K-mers and the repeated K-mers belong to the same species attribute, assigning a weight w to the repeated and same species K-mer i =w i -n, wherein n is the number of repetitions; when one K-mer is repeated multiple times among all K-mers, and the repeated K-mers belong to different species attributes, weights w are assigned to the repeated K-mers of different species i =(w i -ni)/k, where k denotes k species attributes, n is the number of repetitions, ni denotes the number of repetitions in the ith species attribute, i ═ 1,2, … … k.
In one embodiment, the computer program when executed by the processor further performs the steps of: the step of obtaining the genome positions and the weights corresponding to all the K-mers of each sequence comprises the following steps: counting the K-mers of each sequence, and if the genome positions corresponding to two K-mers in any sequence are different and the two K-mers have a linear relation, the weight of the two K-mers is w i =w i +w i /2。
In one embodiment, the computer program when executed by the processor further performs the steps of: the step of obtaining the genomic positions and weights corresponding to all the K-mers of each sequence further comprises: if the genome positions corresponding to a plurality of K-mers in any sequence are different and the plurality of K-mers have a linear relationship, the weight of the plurality of K-mers is w i =w i +w i (ii)/2 m, wherein m represents the number of K-mers having a linear relationship.
In one embodiment, the computer program when executed by the processor further performs the steps of: screening all K-mers of which the corresponding genome positions are not found in a genome hash table when the species attribute of the pathogen cannot be determined by analyzing the genome positions corresponding to all the K-mers of one sequence and the total weight of one sequence; finding out all K-mers with preset Hamming distances to all the screened K-mers; performing hash calculation on all K-mers with the Hamming distances being preset values, and searching in a pre-established genome hash table according to the K-mers after the hash calculation to obtain genome positions and weights corresponding to all K-mers with the Hamming distances being preset values; calculating the total weight of all the K-mers with the Hamming distance as a preset value according to the weights corresponding to all the K-mers with the Hamming distance as a preset value; and carrying out classification analysis on the genome positions corresponding to all the K-mers with the preset Hamming distances and the total weight of all the K-mers with the preset Hamming distances to determine the species attributes of the pathogens.
In one embodiment, the computer program when executed by the processor further performs the steps of: the length of the K-mer is 21bp in length.
Those skilled in the art will appreciate that all or part of the processes in the methods for implementing the embodiments can be implemented by hardware that is related to instructions of a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered as being described in the present specification.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A pathogen analysis method based on high-throughput sequencing is characterized by comprising the following steps:
obtaining sequencing data of a sample to be analyzed, and segmenting each sequence of the sequencing data according to K-mers to obtain a plurality of K-mers;
performing Hash calculation on the plurality of K-mers, and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence;
and performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
2. The method for pathogen analysis based on high-throughput sequencing of claim 1, wherein the step of establishing the genome hash table comprises:
obtaining the sequence of a known human genome and the sequence of a pathogen genome;
selecting K-mers, and segmenting the sequence of the human genome and the sequence of the pathogen according to the K-mers to obtain a plurality of K-mers;
performing statistical analysis on a plurality of the K-mers and assigning a weight to each K-mer;
mapping each K-mer after the weight is distributed to a hash table by using a hash function to obtain the genome hash table, wherein the key value of the genome hash table is a K-mer sequence, and the value of the genome hash table is the genome position, the species attribute and the weight corresponding to the K-mer.
3. The high throughput sequencing-based pathogen analysis method according to claim 2, wherein the step of performing a statistical analysis on a plurality of said K-mers and assigning a weight to each K-mer comprises:
when one K-mer has no repetition in all K-mers, assigning a weight w to the K-mer without repetition i
When one K-mer is repeated multiple times among all K-mers and the repeated K-mers belong to the same species attribute, assigning a weight w to the repeated K-mers of the same species i =w i -n, wherein n is the number of repetitions;
when one K-mer is repeated multiple times among all K-mers and the repeated multiple K-mers belong to different species attributes, assigning a weight w to the repeated multiple and different species of the K-mers i =(w i -ni)/k, where k denotes k species attributes, n is the number of repetitions, ni denotes the number of repetitions in the ith species attribute, i ═ 1,2, … … k, w i Represents the weight of the ith K-mer.
4. The high throughput sequencing-based pathogen analysis method according to any one of claims 1 to 3, wherein the step of obtaining the genomic positions and weights corresponding to all K-mers of each sequence comprises:
counting the K-mers of each sequence if there are two K-mers in any sequenceThe genomic positions are not the same and the two K-mers have a linear relationship, the weight of the two K-mers is w i =w i +w i /2。
5. The method for pathogen analysis based on high throughput sequencing of claim 4, wherein the step of obtaining genomic positions and weights corresponding to all K-mers of each sequence further comprises:
if the genome positions corresponding to a plurality of K-mers in any sequence are different and the plurality of K-mers have a linear relation, the weight of the plurality of K-mers is w i =w i +w i (vi)/2 m, wherein m represents the number of K-mers having a linear relationship.
6. The method for pathogen analysis based on high throughput sequencing according to claim 5, wherein the step of performing classification analysis based on genomic positions corresponding to all K-mers of each sequence and total weight of each sequence to determine species attributes of the pathogen comprises:
screening all K-mers of which the corresponding genome positions are not found in a genome hash table when the species attribute of a pathogen cannot be determined by analyzing the genome positions corresponding to all the K-mers of one sequence and the total weight of one sequence;
finding out all K-mers with preset Hamming distances to all the screened K-mers;
performing hash calculation on all K-mers with the Hamming distances being preset values, and searching in a pre-established genome hash table according to the K-mers after the hash calculation to obtain genome positions and weights corresponding to all K-mers with the Hamming distances being preset values;
calculating the total weight of all the K-mers with the Hamming distance as a preset value according to the weights corresponding to all the K-mers with the Hamming distance as a preset value;
and carrying out classification analysis on the genome positions corresponding to all the K-mers with the preset Hamming distances and the total weight of all the K-mers with the preset Hamming distances to determine the species attributes of the pathogens.
7. The method for pathogen analysis based on high throughput sequencing of claim 6, wherein the length of the K-mer is 21bp in length.
8. A pathogen analysis device based on high-throughput sequencing, comprising:
the sample sequencing data acquisition module is used for acquiring sequencing data of a sample to be analyzed;
the sample K-mer obtaining module is used for segmenting each sequence of the sequencing data according to the K-mers to obtain a plurality of K-mers;
the position and weight obtaining module is used for carrying out Hash calculation on the plurality of K-mers and searching in a pre-established genome Hash table according to the K-mers subjected to Hash calculation to obtain genome positions and weights corresponding to all the K-mers of each sequence; the pre-established genome hash table is obtained by performing K-mer segmentation on a known sequence of a human genome and a known sequence of a pathogen genome, distributing weights to the K-mers, and mapping the weighted K-mers to the hash table by using a hash function; wherein the length of the K-mer is a positive integer;
the sequence total weight calculation module is used for calculating the total weight of each sequence according to the weights corresponding to all the K-mers of each sequence;
and the species attribute determining module is used for performing classification analysis according to the genome positions corresponding to all the K-mers of each sequence and the total weight of each sequence to determine the species attribute of the pathogen.
9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202011137959.1A 2020-10-22 2020-10-22 Pathogen analysis method and device based on high-throughput sequencing and computer equipment Active CN112259167B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011137959.1A CN112259167B (en) 2020-10-22 2020-10-22 Pathogen analysis method and device based on high-throughput sequencing and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011137959.1A CN112259167B (en) 2020-10-22 2020-10-22 Pathogen analysis method and device based on high-throughput sequencing and computer equipment

Publications (2)

Publication Number Publication Date
CN112259167A CN112259167A (en) 2021-01-22
CN112259167B true CN112259167B (en) 2022-09-23

Family

ID=74264598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011137959.1A Active CN112259167B (en) 2020-10-22 2020-10-22 Pathogen analysis method and device based on high-throughput sequencing and computer equipment

Country Status (1)

Country Link
CN (1) CN112259167B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299345A (en) * 2021-06-30 2021-08-24 中国人民解放军军事科学院军事医学研究院 Virus gene classification method and device and electronic equipment
CN113539378A (en) * 2021-07-16 2021-10-22 明科生物技术(杭州)有限公司 Data analysis method, system, equipment and storage medium of virus database

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN108710784A (en) * 2018-05-16 2018-10-26 中科政兴(上海)医疗科技有限公司 A kind of genetic transcription variation probability and the algorithm in the direction that makes a variation
CN110610741A (en) * 2019-08-29 2019-12-24 上海伯杰医疗科技有限公司 Human pathogen identification method and device and electronic equipment
CN111020018A (en) * 2019-11-28 2020-04-17 天津金匙医学科技有限公司 Macrogenomics-based pathogenic microorganism detection method and kit
CN111370064A (en) * 2020-03-19 2020-07-03 山东大学 Rapid gene sequence classification method and system based on SIMD hash function

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN108710784A (en) * 2018-05-16 2018-10-26 中科政兴(上海)医疗科技有限公司 A kind of genetic transcription variation probability and the algorithm in the direction that makes a variation
CN110610741A (en) * 2019-08-29 2019-12-24 上海伯杰医疗科技有限公司 Human pathogen identification method and device and electronic equipment
CN111020018A (en) * 2019-11-28 2020-04-17 天津金匙医学科技有限公司 Macrogenomics-based pathogenic microorganism detection method and kit
CN111370064A (en) * 2020-03-19 2020-07-03 山东大学 Rapid gene sequence classification method and system based on SIMD hash function

Also Published As

Publication number Publication date
CN112259167A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
US11756652B2 (en) Systems and methods for analyzing sequence data
US20190318806A1 (en) Variant Classifier Based on Deep Neural Networks
AU2023282274A1 (en) Variant classifier based on deep neural networks
Horimoto et al. Statistical estimation of cluster boundaries in gene expression profile data
US11347810B2 (en) Methods of automatically and self-consistently correcting genome databases
CN112259167B (en) Pathogen analysis method and device based on high-throughput sequencing and computer equipment
CN110797088B (en) Whole genome resequencing analysis and method for whole genome resequencing analysis
CN111710364B (en) Method, device, terminal and storage medium for acquiring flora marker
US11809498B2 (en) Optimizing k-mer databases by k-mer subtraction
Lawrence et al. Assignment of position-specific error probability to primary DNA sequence data
CN109949866B (en) Method and device for detecting pathogen operation group, computer equipment and storage medium
CN112885412B (en) Genome annotation method, apparatus, visualization platform and storage medium
Vilo et al. Regulatory sequence analysis: application to the interpretation of gene expression
Wei et al. Comparison of methods for biological sequence clustering
Yang et al. A CpGCluster-teaching–learning-based optimization for prediction of CpG islands in the human genome
Nawaz et al. PSAC-PDB: Analysis and classification of protein structures
Coarfa et al. Pash 2.0: scaleable sequence anchoring for next-generation sequencing technologies
Moskowitz et al. Nonparametric analysis of contributions to variance in genomics and epigenomics data
Semwal et al. Pr [m]: An algorithm for protein motif discovery
WO2001024147A2 (en) Method and apparatus for extracting attributes from sequence strings and biopolymer materials
NL2021473B1 (en) DEEP LEARNING-BASED FRAMEWORK FOR IDENTIFYING SEQUENCE PATTERNS THAT CAUSE SEQUENCE-SPECIFIC ERRORS (SSEs)
Aljouie et al. Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning
Cauteruccio et al. Algorithms for strings and sequences: Searching motifs
CN110910958A (en) Gene positioning method, gene positioning device, computer equipment and storage medium
Kalaiselvi et al. Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant