CN118173171A

CN118173171A - Low-sequence number nucleic acid identification method and system

Info

Publication number: CN118173171A
Application number: CN202410175612.8A
Authority: CN
Inventors: 马长剑; 董艳; 周文婧; 贡雪; 于潇; 杨思雨
Original assignee: Shenyang Jinyu Medical Testing Institute Co ltd
Current assignee: Shenyang Jinyu Medical Testing Institute Co ltd
Filing date: 2024-02-07
Publication date: 2024-06-11

Abstract

The invention relates to the field of low-sequence nucleic acid identification, and discloses a low-sequence number nucleic acid identification method and a system, wherein the method comprises the following steps: pretreating an infection sample to obtain a mixed sample, sequencing the mixed sample to obtain a parasitic strain sequence and a host sequence, sequentially extracting the parasitic strain sequence, and performing the following operations on the extracted parasitic strain sequences: obtaining the number of the parasitic strain sequences and the number of the host sequences, obtaining an initial ratio average value by using the number of the host sequences and the number of the parasitic strain sequences, calculating an identification index by using a pre-constructed identification index calculation formula, a standard ratio set and the initial ratio average value, obtaining an initial nucleic acid set based on a plurality of identification indexes, and determining the parasitic nucleic acid category corresponding to the extracted parasitic strain sequences by using a nucleic acid amplification technology and the initial nucleic acid set to finish the low-sequence nucleic acid identification. The invention mainly aims to realize the identification of low-sequence number nucleic acid and improve the accuracy of the identification of the low-sequence number nucleic acid.

Description

Low-sequence number nucleic acid identification method and system

Technical Field

The invention relates to a method and a system for identifying low-sequence number nucleic acid, belonging to the field of low-sequence nucleic acid identification.

Background

Nucleic acid detection is a molecular diagnostic technique that analyzes the base sequence in an infected sample to determine and identify parasitic species, and low-sequence-number nucleic acids are low-molecular-weight, low-sequence-complexity nucleic acids that consist of fewer base pairs.

Currently, low-sequence nucleic acid identification generally uses microarray technology, which requires a large number of starting samples and special chips and equipment, high-throughput sequencing, which is a method of determining the entire genome or transcriptome, and real-time quantitative PCR, which is used to detect specific genes or nucleic acid fragments.

Although the three methods of microarray technology, high-throughput sequencing and real-time quantitative PCR can all be used for identifying low-sequence nucleic acid, the microarray technology requires special chips and equipment, the cost is high, the identification accuracy of low-sequence nucleic acid with low abundance is not high, the high-throughput sequencing is difficult to completely detect low-sequence nucleic acid with low abundance, the identification accuracy of low-sequence nucleic acid with low abundance is not high, the real-time quantitative PCR can not detect a plurality of genes or nucleic acid fragments at the same time, and when unknown parasitic strains exist in an infected sample, the method is difficult to be used independently.

Disclosure of Invention

The invention provides a method and a system for identifying low-sequence number nucleic acid, which mainly aim to realize the identification of the low-sequence number nucleic acid and improve the accuracy of the identification of the low-sequence number nucleic acid.

In order to achieve the above object, the present invention provides a method for identifying a low-sequence number nucleic acid, comprising:

Obtaining an infection sample, preprocessing the infection sample to obtain a mixed sample, sequencing the mixed sample to obtain one or more parasitic strain sequences and host sequences, sequentially extracting the parasitic strain sequences from the one or more parasitic strain sequences, and performing the following operations on the extracted parasitic strain sequences:

Obtaining a plurality of unit detection samples based on the mixed sample, sequentially extracting the unit detection samples from the plurality of unit detection samples, and performing the following operations on the extracted unit detection samples:

Obtaining the number of the parasitic strain sequences based on the extracted parasitic strain sequences, obtaining the number of the host sequences based on the mixed sample, and obtaining an initial ratio by utilizing the number of the host sequences and the number of the parasitic strain sequences;

summarizing a plurality of initial ratios to obtain an initial ratio set, and acquiring an initial ratio average value of the extracted parasitic strain sequences based on the initial ratio set;

Sequentially extracting standard parasitic strains from a pre-constructed standard parasitic strain sequence library, acquiring a standard ratio set based on the extracted standard parasitic strains, and calculating an identification index by utilizing a pre-constructed identification index calculation formula, the standard ratio set and an initial ratio average value;

Summarizing a plurality of identification indexes to obtain an identification index set, and acquiring an initial nucleic acid subset based on the identification index set, wherein the initial nucleic acid subset comprises one or more initial nucleic acid categories;

and determining the parasitic nucleic acid category corresponding to the extracted parasitic strain sequence by utilizing a pre-constructed nucleic acid amplification technology and an initial nucleic acid set, and completing the low-sequence nucleic acid identification.

Optionally, the obtaining an infection specimen, and preprocessing the infection specimen to obtain a mixed sample includes:

Obtaining three unit specimens based on the infection specimen to obtain a first unit specimen, a second unit specimen and a third unit specimen;

Performing first host removal processing on a first unit specimen based on a preset host DNA removal proportion to obtain a first sample, performing second host removal processing on a second unit specimen and performing third host removal processing on a third unit specimen according to a host DN A removal proportion to obtain a second sample and a third sample, wherein the first host removal processing, the second host removal processing and the third host removal processing are three different host removal processing methods;

a mixed sample is obtained based on the first sample, the second sample, and the third sample.

Optionally, the obtaining the initial ratio average value of the extracted parasitic strain sequence based on the initial ratio set includes:

the calculation formula of the initial ratio is as follows:

Wherein xi represents the initial ratio of the unit detection sample with the sequence number i, i represents the sequence number of the extracted unit detection sample, ai represents the number of parasitic strain sequences of the unit detection sample with the sequence number i, and bi represents the number of host sequences of the unit detection sample with the sequence number i;

the calculation formula of the initial ratio average value is as follows:

Wherein P is the average value of the initial ratios, and n is the total number of the initial ratios in the initial ratio set.

Optionally, the sequentially extracting standard parasitic strains from the pre-constructed standard parasitic strain sequence library includes:

Obtaining a plurality of standard parasitic strains, sequentially extracting the standard parasitic strains from the plurality of standard parasitic strains, and executing the following operations on the extracted standard parasitic strains:

obtaining a plurality of standard infection samples based on a standard parasitic strain, sequentially extracting standard infection samples from the plurality of standard infection samples, and performing the following operations on each of the extracted standard infection samples:

obtaining the number of standard parasitic strains, the number of standard host sequences, host information and parasitic strain information based on the standard infection sample, calculating initial standard ratio values according to the number of standard parasitic strains and the number of standard host sequences, and summarizing a plurality of initial standard ratio values to obtain an initial standard ratio set;

Obtaining abnormal values in an initial standard ratio set based on the host information, removing the abnormal values in the initial standard ratio set to obtain a first optimized set, obtaining an optimized mean value and an optimized standard deviation according to the first optimized set, and screening the first optimized set based on the optimized mean value and the optimized standard deviation to obtain a standard ratio set;

constructing a standard ratio library corresponding to the standard parasitic strain based on the standard ratio set;

summarizing a standard ratio library corresponding to each standard parasitic strain in the plurality of standard parasitic strains to obtain a standard parasitic strain sequence library;

And sequentially extracting standard parasitic strains from the standard parasitic strain sequence library.

Optionally, the obtaining a standard ratio set based on the extracted standard parasitic strain, and calculating the identification index by using a pre-constructed identification index calculation formula, the standard ratio set and an initial ratio mean value includes:

Constructing an identification index calculation formula, acquiring a standard mean value based on a standard ratio set, and calculating an identification index based on the identification index calculation formula, the standard ratio set and an initial ratio mean value, wherein the identification index calculation formula is as follows:

Wherein V represents the identification index and X represents the standard mean.

Optionally, the obtaining an initial nucleic acid subset based on the identification index set includes:

screening the extracted identification index set by using a preset identification threshold value to obtain an optimized index set;

Sequentially extracting optimization indexes from the optimization index set, and arranging the optimization indexes in the optimization index set in a numerical decreasing manner based on the extracted optimization indexes to obtain a decreasing index set;

sequentially extracting decremental indexes from the decremental index set, and acquiring standard parasitic strains corresponding to the extracted decremental indexes based on the extracted decremental indexes;

Obtaining a standard strain set based on the standard parasitic strain, wherein the standard parasitic strain set comprises one or more standard parasitic strains;

and obtaining an initial nucleic acid subset based on the standard strain set.

Optionally, the screening the extracted identification index set by using a preset identification threshold value to obtain an optimized index set includes:

sequentially extracting the identification indexes from the identification index set, and comparing the extracted identification indexes with an identification threshold value;

if the identification index is larger than the identification threshold value, confirming that the identification index is an optimization index;

If the identification index is smaller than the identification threshold value, eliminating the identification index;

and summarizing the plurality of optimization indexes to obtain an optimization index set.

Optionally, the discrimination threshold is 50%.

Optionally, the determining the parasitic nucleic acid class corresponding to the extracted parasitic strain sequence using the pre-constructed nucleic acid amplification technique and the initial nucleic acid set comprises:

sequentially extracting initial nucleic acid categories from the initial nucleic acid category set, and performing the following operations on all the extracted nucleic acid categories:

Acquiring an amplification primer corresponding to the extracted initial nucleic acid category based on the extracted initial nucleic acid category, and acquiring an amplified sample by using a pre-constructed nucleic acid amplification technology, the amplification primer and a plurality of unit detection samples;

Obtaining a post-amplification sample set based on the post-amplification sample, wherein the post-amplification sample set comprises one or more;

and obtaining the parasitic nucleic acid category corresponding to the extracted parasitic strain sequence based on the amplified sample set.

In order to solve the above problems, the present invention also provides a low-sequence number nucleic acid identification system comprising:

the device comprises an initial ratio average value acquisition module, a sampling module and a sampling module, wherein the initial ratio average value acquisition module is used for acquiring an infection sample, preprocessing the infection sample to obtain a mixed sample, sequencing the mixed sample to obtain one or more parasitic strain sequences and a host sequence, sequentially extracting the parasitic strain sequences from the one or more parasitic strain sequences, and executing the following operations on the extracted parasitic strain sequences: obtaining a plurality of unit detection samples based on the mixed sample, sequentially extracting the unit detection samples from the plurality of unit detection samples, and performing the following operations on the extracted unit detection samples: obtaining the number of the parasitic strain sequences based on the extracted parasitic strain sequences, obtaining the number of the host sequences based on the mixed sample, and obtaining an initial ratio by utilizing the number of the host sequences and the number of the parasitic strain sequences; summarizing a plurality of initial ratios to obtain an initial ratio set, and acquiring an initial ratio average value of the extracted parasitic strain sequences based on the initial ratio set;

the fungus primary identification module is used for sequentially extracting standard parasitic strains from a pre-constructed standard parasitic strain sequence library, acquiring a standard ratio set based on the extracted standard parasitic strains, and calculating an identification index by utilizing a pre-constructed identification index calculation formula, the standard ratio set and an initial ratio mean; summarizing a plurality of identification indexes to obtain an identification index set, and acquiring an initial nucleic acid subset based on the identification index set, wherein the initial nucleic acid subset comprises one or more initial nucleic acid categories;

the parasitic nucleic acid category acquisition module is used for determining the parasitic nucleic acid category corresponding to the extracted parasitic strain sequence by utilizing a pre-constructed nucleic acid amplification technology and an initial nucleic acid unique set so as to finish the low-sequence nucleic acid identification.

In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to implement the low-sequence number nucleic acid discrimination method described above.

In order to solve the above-described problems, the present invention also provides a computer-readable storage medium having stored therein at least one instruction that is executed by a processor in an electronic device to implement the low-sequence number nucleic acid authentication method described above.

Compared with the problems in the background art, the embodiment of the invention firstly acquires an infection sample, pretreats the infection sample to obtain a mixed sample, uses three different host removal processing methods in the pretreatment process of the infection sample, reduces the influence on the parasitic strain sequences of the parasitic strains in the infection sample caused by method selection in the host removal processing process, performs sequencing on the mixed sample to obtain one or more parasitic strain sequences and a host sequence, sequentially extracts the parasitic strain sequences from the one or more parasitic strain sequences, and performs the following operations on the extracted parasitic strain sequences: a plurality of unit test samples are obtained based on the mixed sample, the unit detection samples are sequentially extracted from the plurality of unit detection samples, the mixed sample is split into the plurality of unit detection samples, each unit detection sample is detected, the probability of inaccurate detection results caused by improper operation in the sequencing process is reduced, and the following operations are carried out on the extracted unit detection samples: obtaining the number of the parasitic strain sequences based on the extracted parasitic strain sequences, obtaining the number of the host sequences based on the mixed sample, and obtaining an initial ratio by utilizing the number of the host sequences and the number of the parasitic strain sequences; summarizing a plurality of initial ratios to obtain an initial ratio set, obtaining an initial ratio average value of the extracted parasitic strain sequences based on the initial ratio set, enabling the initial ratio average value obtained based on the initial ratio set to be more representative, reducing errors caused by low sequence numbers of parasitic strains in the process of obtaining the number of host sequences and the number of parasitic strain sequences, sequentially extracting standard parasitic strains from a pre-constructed standard parasitic strain sequence library, obtaining a standard ratio set based on the extracted standard parasitic strains, calculating an identification index by using a pre-constructed identification index calculation formula, the standard ratio set and the initial ratio average value, comparing the similarity degree between the standard parasitic strains corresponding to the standard average value and the parasitic strains corresponding to the extracted parasitic strain sequences extracted from the mixed sample and corresponding to the initial ratio average value based on the identification index, the larger the discrimination index is, the higher the probability that the parasitic strains corresponding to the standard parasitic strain and the extracted parasitic strain sequence are the same parasitic strain is, the multiple discrimination indexes are summarized to obtain a discrimination index set, and an initial nucleic acid subset is obtained based on the discrimination index set, wherein the initial nucleic acid subset comprises one or more initial nucleic acid subsets, the embodiment of the invention fully considers the interference caused by the similarity between genes, screens the discrimination indexes, simultaneously considers the influence caused by improper experimental operation, and preferentially judges whether the standard parasitic strain with larger discrimination index is the parasitic strain corresponding to the extracted parasitic strain sequence, the method and the device improve the verification efficiency, utilize a pre-constructed nucleic acid amplification technology and an initial nucleic acid set to determine the parasitic nucleic acid category corresponding to the extracted parasitic strain sequence, and finish low-sequence nucleic acid identification. Compared with the background technology, the method has the advantages that the ratio of the parasitic strain sequences is improved through pretreatment of the mixed samples, so that the detection accuracy is improved in the subsequent treatment, meanwhile, the parasitic strains corresponding to the extracted parasitic strain sequences are screened by using a standard parasitic strain sequence library and an identification index, and then are identified again by using a PC R amplification technology, so that the problems that the low abundance is difficult to detect, the unknown parasitic strains cannot be detected, the various parasitic strains cannot be detected simultaneously and the like in the background technology are solved, and the accuracy of identifying the low-sequence number nucleic acid is improved. The invention provides a low-sequence number nucleic acid identification method, a system, an electronic device and a computer readable storage medium, which mainly aim to realize low-sequence number nucleic acid identification and improve the accuracy of the low-sequence number nucleic acid identification.

Drawings

FIG. 1 is a flow chart of a low-sequence number nucleic acid identification method according to an embodiment of the invention;

FIG. 2 is a schematic flow chart of sequentially extracting standard parasitic strains from a pre-constructed standard parasitic strain sequence library in a low-sequence number nucleic acid identification method according to an embodiment of the present invention;

FIG. 3 is a functional block diagram of a low-sequence-number nucleic acid discrimination system according to an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an electronic device for implementing the low-sequence number nucleic acid identification method according to an embodiment of the invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the application provides a low-sequence number nucleic acid identification method. The execution subject of the low-sequence number nucleic acid identification method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the low-sequence number nucleic acid discrimination method may be performed by software or hardware installed in a terminal device or a server device. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Example 1:

referring to fig. 1, a flow chart of a low-sequence number nucleic acid identification method according to an embodiment of the invention is shown. In this embodiment, the low-sequence number nucleic acid identification method includes:

s1, acquiring an infection specimen, and preprocessing the infection specimen to obtain a mixed sample;

It should be noted that, in the embodiment of the present invention, the infection specimen is a specimen obtained from a patient with clinical lower respiratory tract infection, and the parasitic bacteria species in the infection specimen are identified based on the infection specimen;

further, the obtaining an infection specimen, and preprocessing the infection specimen to obtain a mixed sample, includes:

In the embodiment of the invention, the unit samples are samples obtained by equally dividing an infection sample into three parts, wherein the three parts are a first unit sample, a second unit sample and a third unit sample respectively, and the host DN A removal proportion is a proportion preset by a person for detecting nucleic acid.

It will be appreciated that the de-hosting treatment described in the examples of the present invention is a treatment to remove host DN A from infected specimens. When detecting the nucleic acid sequence of an infected specimen, if the content of the host gene nucleic acid is too high, the nucleic acid of the parasitic strain is relatively low, so that the nucleic acid sequence of the parasitic strain is detected with a low sequence number, thereby affecting the reliability of identifying the type of the parasitic strain, if excessive host removal treatment is performed, DN A of the parasitic strain is easily removed in the process of performing the host removal treatment, and the reliability of identifying the type of the parasitic strain is also affected, so that the infected specimen needs to be subjected to the host removal treatment, and host DN A in the infected specimen is subjected to the host removal operation according to the host DN A removal proportion in the process of performing the host removal treatment.

It should be noted that, the first host removal process, the second host removal process, and the third host removal process in the embodiment of the present invention are three different host removal processes, where the host removal process is a process for removing a portion of the host DN a in the infection sample, and the different host removal processes have different effects on the parasitic strain in the infection sample. In order to reduce the influence on the parasitic strain sequence of the parasitic strain in the infected specimen caused by the selection of the method in the process of the host removal treatment, three different host removal treatments are selected to respectively pretreat the infected specimen.

Optionally, the first decommissioning treatment, the second decommissioning treatment, and the third decommissioning treatment are respectively: the method of adsorbing and removing the host dna by using the protein magnetic beads, the method of preferentially lysing host cells and explaining the host DN a and the method of removing the host DN a by using the nested PC R are all the prior art, and are not described herein.

It should be noted that the first sample, the second sample, and the third sample are three samples obtained through the first host removal process, the second host removal process, and the third host removal process, and the mixed sample is a sample obtained by mixing equal parts of the first sample, the second sample, and the third sample.

S2, sequencing the mixed sample to obtain one or more parasitic strain sequences and a host sequence, sequentially extracting the parasitic strain sequences from the one or more parasitic strain sequences, and performing the following operations on the extracted parasitic strain sequences: obtaining a plurality of unit detection samples based on the mixed sample, and sequentially extracting the unit detection samples from the plurality of unit detection samples;

it can be appreciated that in performing a sequencing operation on a mixed sample, since one or more parasitic species may exist in the mixed sample and the parasitic species sequences corresponding to different parasitic species are different, performing an identification operation on the parasitic species in the mixed sample requires performing the identification operation sequentially on the one or more parasitic species sequences. The parasitic strain sequence is DN A sequence of the parasitic strain, and the species of the parasitic strain can be identified through identifying the parasitic strain sequence.

It should be noted that, the host sequence is the dna sequence of the host, sequencing is a gene detection technique, and the gene sequence can be analyzed and determined from the sample, and the technique is the prior art and is not described herein. The unit test sample is one of a plurality of samples after the sample is aliquoted. The nucleic acid concentration of the mixed sample is high enough, and the aim of splitting the mixed sample into a plurality of unit detection samples and detecting each unit detection sample is to reduce the problem of inaccurate detection results caused by improper operation in the sequencing process without dividing the mixed sample into a plurality of unit detection samples, which causes DN A of parasitic strains to be undetected in the unit detection samples.

S3, performing the following operations on all the extracted unit detection samples: obtaining the number of the parasitic strain sequences based on the extracted parasitic strain sequences, obtaining the number of the host sequences based on the mixed sample, and obtaining an initial ratio by utilizing the number of the host sequences and the number of the parasitic strain sequences; summarizing a plurality of initial ratios to obtain an initial ratio set, and acquiring an initial ratio average value of the extracted parasitic strain sequences based on the initial ratio set;

It can be understood that in the embodiment of the present invention, the mixed sample is divided into a plurality of unit detection samples, and based on the initial ratio of the extracted parasitic strain sequence obtained by each unit detection sample in the plurality of unit detection samples, the initial ratios corresponding to each unit detection sample are summarized, so that the initial ratios collectively include a plurality of initial ratios, and the number of the plurality of initial ratios is equal to the number of unit detection samples in the plurality of unit detection samples. The beneficial effects of this step are: the initial ratio average value obtained based on the initial ratio set is more representative, so that errors caused by low sequence numbers of parasitic strains in the process of obtaining the number of host sequences and the number of parasitic strain sequences are reduced.

Further, the obtaining the initial ratio average value of the extracted parasitic strain sequence based on the initial ratio set includes:

the calculation formula of the initial ratio is as follows:

the calculation formula of the initial ratio average value is as follows:

It is understood that the number of host sequences is the number of D N A sequence fragments of the host detected in the unit detection sample, the number of parasitic strain sequences is the number of D N A sequence fragments of the parasitic strain detected in the unit detection sample, and both the number of host sequences and the number of parasitic strain sequences can be obtained by sequencing results.

S4, sequentially extracting standard parasitic strains from a pre-constructed standard parasitic strain sequence library, acquiring a standard ratio set based on the extracted standard parasitic strains, and calculating an identification index by utilizing a pre-constructed identification index calculation formula, the standard ratio set and an initial ratio average value;

it can be understood that the standard parasitic strain sequence library is a database constructed based on known parasitic strains, and a standard ratio set is obtained based on the standard parasitic strain sequence library, and optionally, data sources such as the standard parasitic strains, the number of standard parasitic strain sequences, the number of standard host sequences and the like in the standard parasitic strain sequence library include: existing literature, existing databases, and other published data.

In another embodiment of the present invention, referring to fig. 2, the sequentially extracting standard parasitic strains from the pre-constructed standard parasitic strain sequence library includes:

S41, obtaining a plurality of standard parasitic strains, sequentially extracting the standard parasitic strains from the plurality of standard parasitic strains, and executing the following operations on the extracted standard parasitic strains:

Obtaining a plurality of standard infection samples based on a standard parasitic strain, sequentially extracting standard infection samples from the plurality of standard infection samples, and performing the following operations on all of the extracted standard infection samples;

S42, acquiring the number of standard parasitic strains, the number of standard host sequences, host information and parasitic strain information based on the standard infection sample, calculating initial standard ratio values according to the number of standard parasitic strains and the number of standard host sequences, and summarizing a plurality of initial standard ratio values to obtain an initial standard ratio set;

S43, acquiring abnormal values in an initial standard ratio set based on the host information, removing the abnormal values in the initial standard ratio set to obtain a first optimization set, acquiring an optimization mean and an optimization standard deviation according to the first optimization set, and screening the first optimization set based on the optimization mean and the optimization standard deviation to obtain a standard ratio set;

s44, constructing a standard ratio library corresponding to the standard parasitic strain based on the standard ratio set.

S45, summarizing a standard ratio library corresponding to each standard parasitic strain in the plurality of standard parasitic strains to obtain a standard parasitic strain sequence library;

s46, sequentially extracting standard parasitic strains from the standard parasitic strain sequence library.

It is understood that the standard parasitic species are known species obtained from a database. The standard infection sample is a sample of a standard parasitic species infection. The number of the standard parasitic strain sequences is the number of DN A sequence fragments of the standard parasitic strain in the standard infection sample, and the number of the standard host sequences is the number of host DN A sequence fragments in the standard infection sample. The host information includes the disease history, sex, age, D N A sequence segment, etc. of the host, and the parasitic strain information includes the kind of standard parasitic strain, DN A sequence segment, etc. The initial standard ratio is the ratio of the number of the standard parasitic strain sequences to the number of the standard host sequences obtained by the database, and the initial standard ratio set is a set of a plurality of initial standard ratios.

It should be noted that, because the standard parasitic strains and the related data sources in the database are wide, errors exist in the initial standard ratio obtained based on the same standard parasitic strain, and a plurality of initial standard ratios are summarized to obtain an initial standard ratio set, the initial standard ratio set includes the initial standard ratio corresponding to the same standard parasitic strain, and the data cleaning is performed on the initial standard ratio set, so that the cleaned initial standard ratio set is more representative.

It is understood that the outlier is an initial standard ratio obtained by easily generating a large error in the nucleic acid identification result due to a potentially serious disease of the host, variation of parasitic strains, species difference between the host of the standard infection sample and the host of the mixed sample, and the like, so that the portion of the standard initial ratio should be removed, and the first optimized set is a set of initial standard ratios remaining after removing outliers in the initial standard ratio set.

It should be noted that, the number of the initial standard ratios in the first optimization set is enough, so that a larger error is not generated due to insufficient number of the initial standard ratios when the optimization mean and the optimization standard deviation are obtained based on the first optimization set, the optimization mean is the mean of all the initial standard ratios in the first optimization set, and the optimization standard deviation is the standard deviation of all the initial standard ratios in the first optimization set. Optionally, a screening operation is performed on the first optimization set using a normal distributed 3 sigma principle, i.e. only the initial standard ratio of the first optimization set based on the inside of the (u-3 sigma, u+3 sigma) interval is kept, which aims at removing coarse errors in the first optimization set.

In detail, the standard ratio set is a set of all initial standard ratios after the initial standard ratios in the first optimized set are screened, the standard ratio set comprises a plurality of standard ratios, and the standard ratios are initial standard ratios reserved after the initial standard ratios in the first optimized set are screened. The standard ratio library is a database constructed based on a standard ratio set.

Specifically, the obtaining the standard ratio set based on the extracted standard parasitic strain, and calculating the identification index by using a pre-constructed identification index calculation formula, the standard ratio set and an initial ratio mean value includes:

It can be understood that the discrimination index is an index for discriminating which parasitic strain the sequence of the parasitic strain belongs to, and is used for comparing the similarity degree between the standard parasitic strain corresponding to the standard average value and the parasitic strain corresponding to the extracted parasitic strain sequence extracted from the mixed sample corresponding to the initial ratio average value, and the larger the discrimination index, the higher the probability that the parasitic strain corresponding to the standard parasitic strain and the extracted parasitic strain sequence is the same parasitic strain. The standard mean is the mean of all standard ratios in the set of standard ratios.

Exemplary, the parasitic strain sequences corresponding to the X strain are extracted from the mixed sample, and the initial ratio average value corresponding to the X parasitic strain sequences is obtained based on the X parasitic strain sequences. Sequentially extracting standard strains from a standard parasitic strain sequence library: and after the strain A, the strain B and the strain C are extracted, extracting a standard ratio set corresponding to the strain A from a standard parasitic strain sequence library, acquiring a standard average value corresponding to the strain A based on the standard ratio set corresponding to the strain A, calculating the identification indexes of the strain A and the strain X by using the standard average value corresponding to the strain A, the initial ratio average value corresponding to the parasitic strain sequence and an identification formula, and describing the similarity degree of the strain A and the strain X, and similarly, executing the same identification index calculation operation on the strain B and the strain C.

S5, summarizing a plurality of identification indexes to obtain an identification index set, and acquiring an initial nucleic acid subset based on the identification index set, wherein the initial nucleic acid subset comprises one or more initial nucleic acid categories;

In the embodiment of the present invention, since the calculation of the identification index is required to be performed on the standard parasitic strain in the standard parasitic strain sequence library, there will be a plurality of identification indexes, and the identification index set is a set of all the identification indexes.

Further, the obtaining an initial nucleic acid subset based on the identification index set includes:

and obtaining an initial nucleic acid subset based on the standard strain set.

It should be noted that, the identification threshold is the minimum value of the identification index, and when the identification index is lower than the identification threshold, it is indicated that the standard parasitic strain corresponding to the identification index is not the parasitic strain corresponding to the extracted parasitic strain sequence, that is, the standard parasitic strain may be generated due to improper operation and contamination of the mixed sample. The optimization index set is a set of identification indexes larger than an identification threshold value, and the descending index set is a set of identification indexes in the optimization index set in descending order from large to small, and the larger the identification index is, the higher the probability that the parasitic strain corresponding to the standard parasitic strain and the extracted parasitic strain sequence is the same parasitic strain is, so that the method is used for preferentially judging whether the standard parasitic strain with larger identification index is the parasitic strain corresponding to the extracted parasitic strain sequence in the subsequent verification experiment, and the verification efficiency is improved.

It is understood that a standard bacterial species set is a collection of multiple standard parasitic bacterial species that includes one or more standard parasitic bacterial species due to similarity of gene segments. The initial nucleic acid subset is a set of standard parasitic strains for which the parasitic strain corresponding to the extracted parasitic strain sequence may be.

Further, the screening the extracted identification index set by using a preset identification threshold value to obtain an optimized index set includes:

Further, the discrimination threshold is 50%.

It is understood that the optimization index is a set of discrimination indexes obtained by screening discrimination indexes in the set of discrimination indexes.

Exemplary, an X parasitic strain is obtained based on an X parasitic strain sequence, and standard strains are sequentially extracted from a standard parasitic strain sequence library: the method comprises the steps of respectively calculating an X-A identification index, an X-B identification index, an X-C identification index and an X-D identification index of a strain A, a strain B, a strain C and a strain D, wherein gene fragments of the strain A and the strain B are similar, so that the X-A identification index and the X-B identification index are 80% and higher than an identification threshold value, the X-C identification index is 40% and lower than the identification threshold value, the X-D identification index is 60% and higher than the identification threshold value, the strain C is removed, the optimization index set comprises the X-A identification index, the X-B identification index and 80%, the X-D identification index is 60%, and the arrangement sequence of all identification indexes of a decreasing index set is as follows: the method comprises the steps of sequentially obtaining an A strain, a B strain and a D strain based on the identification indexes in decreasing index sets, wherein a standard strain set comprises the A strain, the B strain and the D strain, and an initial nucleic acid class set comprises the A strain, the B strain and the D strain, namely the parasitic strain corresponding to the extracted parasitic strain sequence can be the A strain, the B strain and the D strain.

S6, determining the category of the parasitic nucleic acid corresponding to the extracted parasitic strain sequence by utilizing a pre-constructed nucleic acid amplification technology and an initial nucleic acid set, and completing the identification of the low-sequence nucleic acid;

In the embodiment of the invention, in order to further verify whether the standard parasitic strains in the initial nucleic acid class set are the parasitic strains corresponding to the extracted parasitic strain sequences, a pre-constructed nucleic acid amplification technology is utilized to re-identify the parasitic strains in the mixed sample.

Specifically, the determining the parasitic nucleic acid class corresponding to the extracted parasitic strain sequence by using the pre-constructed nucleic acid amplification technology and the initial nucleic acid set comprises the following steps:

Optionally, the nucleic acid amplification technology is a PC R amplification technology, which is a prior art and will not be described in detail herein.

It should be noted that, because of the similarity of genes between the standard parasitic strains in the initial nucleic acid class set, the amplification primer is designed for each standard parasitic strain in the initial nucleic acid class set, the amplified sample is a sample obtained by using a PC R amplification technique, the amplified sample set is a set of amplified samples, and the parasitic nucleic acid class corresponding to the parasitic strain corresponding to the extracted parasitic strain sequence can be determined based on the amplified sample, which is the prior art and will not be described herein.

Compared with the problems in the background art, the embodiment of the invention firstly acquires an infection sample, pretreats the infection sample to obtain a mixed sample, uses three different host removal processing methods in the pretreatment process of the infection sample, reduces the influence on the parasitic strain sequences of the parasitic strains in the infection sample caused by method selection in the host removal processing process, performs sequencing on the mixed sample to obtain one or more parasitic strain sequences and a host sequence, sequentially extracts the parasitic strain sequences from the one or more parasitic strain sequences, and performs the following operations on the extracted parasitic strain sequences: a plurality of unit test samples are obtained based on the mixed sample, the unit detection samples are sequentially extracted from the plurality of unit detection samples, the mixed sample is split into the plurality of unit detection samples, each unit detection sample is detected, the probability of inaccurate detection results caused by improper operation in the sequencing process is reduced, and the following operations are carried out on the extracted unit detection samples: obtaining the number of the parasitic strain sequences based on the extracted parasitic strain sequences, obtaining the number of the host sequences based on the mixed sample, and obtaining an initial ratio by utilizing the number of the host sequences and the number of the parasitic strain sequences; summarizing a plurality of initial ratios to obtain an initial ratio set, obtaining an initial ratio average value of the extracted parasitic strain sequences based on the initial ratio set, enabling the initial ratio average value obtained based on the initial ratio set to be more representative, reducing errors caused by low sequence numbers of parasitic strains in the process of obtaining the number of host sequences and the number of parasitic strain sequences, sequentially extracting standard parasitic strains from a pre-constructed standard parasitic strain sequence library, obtaining a standard ratio set based on the extracted standard parasitic strains, calculating an identification index by using a pre-constructed identification index calculation formula, the standard ratio set and the initial ratio average value, comparing the similarity degree between the standard parasitic strains corresponding to the standard average value and the parasitic strains corresponding to the extracted parasitic strain sequences extracted from the mixed sample and corresponding to the initial ratio average value based on the identification index, the larger the discrimination index is, the higher the probability that the parasitic strains corresponding to the standard parasitic strain and the extracted parasitic strain sequence are the same parasitic strain is, the multiple discrimination indexes are summarized to obtain a discrimination index set, and an initial nucleic acid subset is obtained based on the discrimination index set, wherein the initial nucleic acid subset comprises one or more initial nucleic acid subsets, the embodiment of the invention fully considers the interference caused by the similarity between genes, screens the discrimination indexes, simultaneously considers the influence caused by improper experimental operation, and preferentially judges whether the standard parasitic strain with larger discrimination index is the parasitic strain corresponding to the extracted parasitic strain sequence, the method and the device improve the verification efficiency, utilize a pre-constructed nucleic acid amplification technology and an initial nucleic acid set to determine the parasitic nucleic acid category corresponding to the extracted parasitic strain sequence, and finish low-sequence nucleic acid identification. Compared with the background technology, the method has the advantages that the ratio of the parasitic strain sequences is improved through pretreatment of the mixed samples, so that the detection accuracy is improved in the subsequent treatment, meanwhile, the parasitic strains corresponding to the extracted parasitic strain sequences are screened by using a standard parasitic strain sequence library and an identification index, and then are identified again by using a PC R amplification technology, so that the problems that the low abundance is difficult to detect, the unknown parasitic strains cannot be detected, the various parasitic strains cannot be detected simultaneously and the like in the background technology are solved, and the accuracy of identifying the low-sequence number nucleic acid is improved.

Example 2:

FIG. 3 is a functional block diagram of a low-sequence-number nucleic acid discrimination system according to an embodiment of the present invention.

The low-sequence number nucleic acid identification system 100 of the present invention may be installed in an electronic device. Depending on the function implemented, the low-sequence number nucleic acid identification system 100 may include an initial ratio average acquisition module 101, a fungus primary identification module 102, and a parasitic nucleic acid class acquisition module 103. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

The initial ratio average value obtaining module 101 is configured to obtain an infection sample, pre-process the infection sample to obtain a mixed sample, perform sequencing on the mixed sample to obtain one or more parasitic strain sequences and a host sequence, sequentially extract the parasitic strain sequences from the one or more parasitic strain sequences, and perform the following operations on the extracted parasitic strain sequences: obtaining a plurality of unit detection samples based on the mixed sample, sequentially extracting the unit detection samples from the plurality of unit detection samples, and performing the following operations on the extracted unit detection samples: obtaining the number of the parasitic strain sequences based on the extracted parasitic strain sequences, obtaining the number of the host sequences based on the mixed sample, and obtaining an initial ratio by utilizing the number of the host sequences and the number of the parasitic strain sequences; summarizing a plurality of initial ratios to obtain an initial ratio set, and acquiring an initial ratio average value of the extracted parasitic strain sequences based on the initial ratio set;

The fungus primary identification module 102 is configured to sequentially extract standard parasitic strains from a pre-constructed standard parasitic strain sequence library, obtain a standard ratio set based on the extracted standard parasitic strains, and calculate an identification index by using a pre-constructed identification index calculation formula, the standard ratio set and an initial ratio average; summarizing a plurality of identification indexes to obtain an identification index set, and acquiring an initial nucleic acid subset based on the identification index set, wherein the initial nucleic acid subset comprises one or more initial nucleic acid categories;

The parasitic nucleic acid class obtaining module 103 is configured to determine a parasitic nucleic acid class corresponding to the extracted parasitic strain sequence by using a pre-constructed nucleic acid amplification technology and an initial nucleic acid set, and complete the low-sequence nucleic acid identification.

In detail, the modules in the low-sequence-number nucleic acid identification system 100 in the embodiment of the present invention use the same technical means as the low-sequence-number nucleic acid identification method described in fig. 1 and can produce the same technical effects, and are not described herein.

Example 3:

Fig. 4 is a schematic structural diagram of an electronic device for implementing a low-sequence number nucleic acid identification method according to an embodiment of the invention.

The electronic device 1 may comprise a processor 10, a memory 11, a bus 12 and a communication interface 13, and may further comprise a computer program, such as a low-sequence number nucleic acid discrimination program, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SART MEDIACARD, SMC), a secure digital (Secure Diital, SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a low-sequence number nucleic acid discrimination program, etc., but also for temporarily storing data that has been output or is to be output.

The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processinunit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, executes or executes programs or modules (e.g., a low-sequence number nucleic acid discrimination program or the like) stored in the memory 11, and invokes data stored in the memory 11 to perform various functions of the electronic device 1 and process data.

The bus may be a peripheral component interconnect standard (PERIPHERAL COPONENTINTER CONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.

Fig. 4 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.

For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.

The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a keyboard (Key board), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an LED (Oranic Liht-Eittin Diode, organic light emitting diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The low-sequence number nucleic acid discrimination program stored in the memory 11 in the electronic device 1 is a combination of a plurality of instructions that, when executed in the processor 10, can realize:

Specifically, the specific implementation method of the above instructions by the processor 10 may refer to descriptions of related steps in the corresponding embodiments of fig. 1 to 3, which are not repeated herein.

Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read-only memory (ROM, read-Only Meory).

The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A method for identifying a low-sequence number nucleic acid, comprising

2. The method for identifying a low-sequence number nucleic acid according to claim 1, wherein the steps of obtaining an infected specimen, and pretreating the infected specimen to obtain a mixed sample, comprise:

Performing first host removal processing on a first unit specimen based on a preset host DNA removal proportion to obtain a first specimen, performing second host removal processing on a second unit specimen and performing third host removal processing on a third unit specimen according to the host DNA removal proportion to obtain a second specimen and a third specimen, wherein the first host removal processing, the second host removal processing and the third host removal processing are three different host removal processing methods;

3. The method of claim 1, wherein the obtaining an initial ratio average of the extracted parasitic species sequences based on the initial ratio set comprises:

the calculation formula of the initial ratio is as follows:

the calculation formula of the initial ratio average value is as follows:

4. The method for identifying a low-sequence number nucleic acid according to claim 3, wherein the sequential extraction of the standard parasitic species from the library of pre-constructed standard parasitic species sequences comprises:

5. The method of claim 4, wherein the obtaining a standard ratio set based on the extracted standard parasitic species, and calculating the discrimination index using a pre-constructed discrimination index calculation formula, standard ratio set and initial ratio mean, comprises:

6. The method of low sequence number nucleic acid identification of claim 1, wherein said obtaining an initial nucleic acid subset based on said identification index set comprises:

and obtaining an initial nucleic acid subset based on the standard strain set.

7. The method of claim 6, wherein the screening the extracted discrimination index set with a predetermined discrimination threshold to obtain an optimized index set comprises:

8. The method of identifying a low sequence number nucleic acid of claim 6, wherein the identification threshold is 50%.

9. The method of any one of claims 1 to 8, wherein determining the class of parasitic nucleic acids corresponding to the extracted parasitic species sequence using a pre-constructed nucleic acid amplification technique and an initial nucleic acid set comprises:

10. A low sequence number nucleic acid identification system, the system comprising: