CN113539371B

CN113539371B - Sequence encoding method and device and readable storage medium

Info

Publication number: CN113539371B
Application number: CN202110756922.5A
Authority: CN
Inventors: 李毅; 季强; 樊青远; 张博; 宋昆
Original assignee: Southern University of Science and Technology
Current assignee: Southern University of Science and Technology
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2023-06-23
Anticipated expiration: 2041-07-05
Also published as: CN113539371A

Abstract

The application provides a sequence coding method and device and a readable storage medium. The coding method of the sequence comprises the following steps: acquiring a plurality of first sequence codes; the first sequence code is the sequence code of the appointed bit number corresponding to the base of the preset bit number; screening the plurality of first sequence codes according to a preset screening algorithm to obtain a plurality of screened first sequence codes; determining a plurality of representative sequence codes from the plurality of first sequence codes after screening according to an incremental clustering algorithm based on distances between sequences corresponding to each sequence code in the plurality of first sequence codes after screening; splicing the plurality of representative sequence codes to determine a plurality of second sequence codes; the number of bits encoded by the second sequence is greater than the specified number of bits; generating a plurality of final sequence encodings from the plurality of second sequence encodings; the final sequence encodes the corresponding nucleic acid sequence for labeling the nucleic acid to be tested. The method is used for effectively generating the sequence codes with low false recognition rate.

Description

Sequence encoding method and device and readable storage medium

Technical Field

The present application relates to the field of nucleic acid sequence encoding technology, and in particular, to a method and apparatus for encoding a sequence, and a readable storage medium.

Background

The third generation sequencing technology of nucleic acid utilizes the coding of known sequence to generate corresponding known sequence, and connects the corresponding known sequence with the head of unknown sequenced sequence to realize the identification of unknown sequenced sequence and further realize multiplexing.

In the prior art, a sequence code is generated by using a local mutation iteration method and the like, and the method possibly falls into the situation of local optimization rather than global optimization, so that a sequence corresponding to the sequence code is easy to be identified by mistake.

If the optimal sequence set is found in the space of the solution from the perspective of global optimization, the problem of false recognition can be solved. However, the storage capabilities of current conventional computer architectures have not been able to search through large solution spaces.

Therefore, the prior art lacks a method for efficiently generating a sequence code with a low false recognition rate.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and apparatus for encoding a sequence, and a readable storage medium, so as to implement efficient generation of a sequence encoding with low false recognition rate.

In a first aspect, an embodiment of the present application provides a method for encoding a sequence, including: acquiring a plurality of first sequence codes; the first sequence code is a sequence code of an index bit number corresponding to a base with a preset bit number; screening the plurality of first sequence codes according to a preset screening algorithm to obtain a plurality of screened first sequence codes; determining a plurality of representative sequence codes from the screened plurality of first sequence codes based on the distance between sequences corresponding to each sequence code in the screened plurality of first sequence codes according to an incremental clustering algorithm; splicing the plurality of representative sequence codes to determine a plurality of second sequence codes; the number of bits encoded by the second sequence is greater than the specified number of bits; generating a plurality of final sequence encodings from the plurality of second sequence encodings; the final sequence codes for a corresponding nucleic acid sequence for labeling the nucleic acid to be tested.

In the embodiment of the application, compared with the prior art, a global segmentation and combination search method is adopted, the sequence codes with the index numbers are screened, the representative sequence codes are determined from the screened sequence codes, then splicing is carried out based on the representative sequence codes, the sequence codes with higher digits are obtained, and finally the final sequence codes are generated based on the sequence codes with higher digits. In the process, long-chain sequence codes can be generated continuously and iteratively by using an incremental clustering algorithm and a splicing mode, the problem that the whole long-chain sequence space is too large to store is solved, the problem that local optimal points but not global optimal points are easily searched by a genetic algorithm and the like is also solved, further, the effective generation of the sequence codes with low false recognition rate can be realized, and the sequence codes are applied to multiplexing application scenes, so that the effective sequencing of multiple nucleic acids can be realized.

As a possible implementation manner, the determining, according to an incremental clustering algorithm, a plurality of representative sequence codes from the screened plurality of first sequence codes based on distances between sequences corresponding to respective sequence codes in the screened plurality of first sequence codes includes: for the current sequence code to be judged, calculating the distance between the sequence corresponding to the current sequence code to be judged and the sequences respectively corresponding to the plurality of previously determined representative sequence codes; if the distance between the sequence corresponding to the current sequence code to be judged and each sequence in the sequences corresponding to the plurality of the representative sequence codes determined in advance is larger than a preset threshold value, determining the current sequence code to be judged as the representative sequence code; and if the distance between the sequence corresponding to the current sequence code to be judged and at least one sequence in the sequences respectively corresponding to the plurality of the previously determined representative sequence codes is smaller than or equal to a preset threshold value, determining that the current sequence code to be judged is not the representative sequence code.

In the embodiment of the application, when an incremental clustering algorithm is applied, the distance between the sequence codes is combined, and whether the sequence codes are representative sequence codes or not is effectively judged; and the accurate judgment of each sequence code can be realized by the iterative judgment mode, so that the global search is realized.

As a possible implementation manner, the calculating a distance between the sequence corresponding to the currently to-be-judged sequence code and the sequence corresponding to each of the plurality of previously determined representative sequence codes includes: and calculating the distance between the sequence corresponding to the currently-to-be-judged sequence code and the sequences respectively corresponding to the plurality of previously-determined representative sequence codes through a preset distance algorithm and a DTW (Dynamic Time Warping) function.

In the embodiment of the application, the effective calculation of the distance can be realized through a dynamic time normalization algorithm based on the distance algorithm; and the sequence code generated by the representative sequence code screened based on the distance has better effect when applied.

As a possible implementation manner, the calculating, by using a preset distance algorithm and a DTW function, a distance between a sequence corresponding to the currently to-be-determined sequence code and a sequence corresponding to each of a plurality of previously determined representative sequence codes includes: generating current signals respectively corresponding to a sequence corresponding to the current sequence code to be judged and sequences respectively corresponding to a plurality of previously determined representative sequence codes; and determining the distance between the sequence corresponding to the sequence code to be judged currently and the sequences corresponding to the plurality of the representative sequence codes determined in advance according to the preset distance algorithm, the DTW function and the current signal.

In the embodiment of the application, by generating the current signal, the distance can be effectively calculated based on the current signal.

As a possible implementation manner, the preset distance algorithm is: the pap distance algorithm or the euclidean distance algorithm.

In the embodiment of the application, the effective calculation of the distance can be realized through the dynamic time warping algorithm based on the Pasteur distance/Euclidean distance, and the sequence code generated by the representative sequence code screened based on the distance has better effect in application.

As a possible implementation manner, the generating a plurality of final sequence codes according to the plurality of second sequence codes includes: screening the plurality of second sequence codes to obtain a plurality of screened second sequence codes; determining a plurality of new representative sequence codes from the plurality of screened second sequence codes based on distances between sequences corresponding to each sequence code in the plurality of screened second sequence codes according to an incremental clustering algorithm; splicing the plurality of new representative sequence codes to determine a plurality of third sequence codes; the third sequence encodes a greater number of bits than the second sequence; generating a plurality of final sequence codes according to the plurality of third sequence codes.

In the embodiment of the application, after the sequence codes with higher digits are obtained based on the representative sequence code splicing, the global search process can be iterated, so that more final sequence codes are generated, and the method and the device can be better applied to multiplexing application scenes.

As a possible implementation manner, the encoding method further includes: determining a marker sequence code from the plurality of final sequence codes; determining a check code corresponding to the marker sequence code; synthesizing a marker nucleic acid sequence corresponding to the marker sequence code, and synthesizing a check nucleic acid sequence corresponding to the check code; and sequentially connecting the labeled nucleic acid sequence, the check nucleic acid sequence and the nucleic acid to be detected to obtain labeled nucleic acid to be detected.

In the embodiment of the application, when the plurality of sequence codes are applied, the check codes can be determined for the marking sequence codes, then the nucleic acid sequence to be detected is marked based on the sequence corresponding to the sequence codes and the sequence corresponding to the check codes, and the sequence corresponding to the check codes can verify the sequence corresponding to the sequence codes, so that the effective marking of the nucleic acid sequence to be detected is realized.

As a possible implementation manner, the encoding method further includes: when the nanopore sequencing is carried out on the nucleic acid to be detected, obtaining a current signal of a nanopore corresponding to the labeled nucleic acid to be detected; separating the current signal corresponding to the marker nucleic acid sequence and the current signal corresponding to the check nucleic acid sequence from the current signals; inputting the current signal corresponding to the labeled nucleic acid sequence into a pre-trained detection model to obtain a mark corresponding to the labeled nucleic acid sequence; determining a reduced marker nucleic acid sequence according to the current signal corresponding to the check nucleic acid sequence; determining the corresponding identifier of the reduced marker nucleic acid sequence according to the preset corresponding relationship between the final sequence code and the sequence code identifier; comparing the identifier corresponding to the marker nucleic acid sequence with the identifier corresponding to the reduced marker nucleic acid sequence; and processing the current signal corresponding to the labeled nucleic acid sequence according to the comparison result.

In the embodiment of the application, when the nucleic acid to be detected is sequenced, the identifier corresponding to the labeled nucleic acid sequence is determined through the current signal corresponding to the labeled nucleic acid sequence, the identifier corresponding to the reduced labeled nucleic acid sequence is determined through the verification nucleic acid sequence, finally, the two identifiers are compared, and the current signal corresponding to the labeled nucleic acid sequence can be effectively processed according to the comparison result.

As a possible implementation manner, the processing the current signal corresponding to the labeled nucleic acid sequence according to the comparison result includes: if the identifier corresponding to the marker nucleic acid sequence is inconsistent with the identifier corresponding to the reduced marker nucleic acid sequence, storing the identifier corresponding to the marker nucleic acid sequence and the current signal corresponding to the marker nucleic acid sequence; the encoding method further includes: and when training the initial detection model, taking the identification corresponding to the marker nucleic acid sequence and the current signal corresponding to the marker nucleic acid sequence as training data.

In the embodiment of the application, if the identifier corresponding to the labeled nucleic acid sequence is inconsistent with the identifier corresponding to the reduced labeled nucleic acid sequence, this indicates that there may be a false identification of the labeled nucleic acid sequence, and this may be used as training data to train the detection model, so as to improve the accuracy of the detection model.

As a possible implementation manner, the screening the plurality of first sequence codes according to a preset screening algorithm to obtain a plurality of screened first sequence codes includes: screening the plurality of first sequence codes according to a minimum free energy algorithm, and screening the plurality of first sequence codes according to a repeated sequence checking method to obtain a plurality of screened first sequence codes; the minimum free energy algorithm is used for screening out codes corresponding to sequences with space structures, and the repeated sequence checking method is used for screening out codes corresponding to specified repeated sequences.

In the embodiment of the application, codes corresponding to sequences with space structures can be screened out through a minimum free energy algorithm; screening codes corresponding to the appointed repeated sequences by a repeated sequence checking method; thereby realizing effective screening of sequence codes.

In a second aspect, an embodiment of the present application provides a coding apparatus for a sequence, including: various functional modules for implementing the coding method of the sequence described in the first aspect and any possible implementation manner of the first aspect.

In a third aspect, embodiments of the present application provide a readable storage medium having stored thereon a computer program which, when executed by a computer, performs the method of encoding a sequence as described in the first aspect and any one of the possible implementations of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of encoding a sequence provided by an embodiment of the present application;

fig. 2 is a schematic structural diagram of a coding device for a sequence according to an embodiment of the present application.

Icon: 200-sequence encoding means; 210-an acquisition module; 220-a processing module.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The sequence coding method provided by the embodiment of the application can be applied to the multiplexing application scene of the third-generation nucleic acid sequencing, and in the multiplexing application scene, the sequence coding for marking the nucleic acid is firstly generated, and then the corresponding marking sequence is synthesized. When the nucleic acid to be detected is sequenced, the synthesized marking sequence is connected to the head of the nucleic acid to be detected, so as to realize the marking of the nucleic acid to be detected. After the corresponding current signals are obtained by utilizing the nanopore sequencing technology, the characteristics of the tag sequence and the circuit signals of the nucleic acid to be detected are combined, and the current signals corresponding to the tag sequence and the current signals corresponding to the nucleic acid to be detected are separated from the tag sequence and the circuit signals corresponding to the nucleic acid to be detected. Based on the current signal corresponding to the marker sequence, the identification corresponding to the marker sequence is determined, and then the identification of the nucleic acid to be detected is determined, which nucleic acid is the nucleic acid in the plurality of paths of nucleic acids to be detected can be understood. The sequence of the nucleic acid to be detected can be determined based on the current signal corresponding to the nucleic acid to be detected.

In the embodiment of the application, on one hand, a sequence coding method is provided to realize the effective generation of the sequence codes with low false recognition rate; on the other hand, an application mode of the generated sequence code is described.

The execution main body of the technical scheme provided by the embodiment of the application can comprise code generating equipment and nucleic acid sequencing equipment, wherein the code generating equipment and the nucleic acid sequencing equipment can be the same equipment or different equipment, and can be electronic equipment with data processing capability such as a computer.

It is understood that the bases include four kinds in total: ATCG; the sequence codes represent only the arrangement and combination of these four bases, and based on this arrangement and combination, the corresponding sequences can be synthesized. For example, assume that the sequence codes: AACAGGACCAGGCGAAG when synthesizing the corresponding sequence, four bases are synthesized according to the arrangement and combination mode in the sequence code, and the sequence code corresponding sequence can be obtained.

In addition, bases with different numbers have different numbers of permutation and combination modes, and taking 40-bit bases as an example, the corresponding permutation and combination modes are as follows: 4 ⁴⁰ Seed, if from these 4 ⁴⁰ The method provided by the embodiment of the application can realize the effective generation of the sequence codes.

Referring next to fig. 1, a flowchart of a coding method of a sequence according to an embodiment of the present application is shown, where the coding method includes:

step 110: a plurality of first sequence encodings is acquired. Wherein the first sequence code is a sequence code of an index bit number corresponding to a base of a preset bit number.

Step 120: and screening the plurality of first sequence codes according to a preset screening algorithm to obtain a plurality of screened first sequence codes.

Step 130: and determining a plurality of representative sequence codes from the plurality of first sequence codes after screening according to an incremental clustering algorithm based on the distance between sequences corresponding to each sequence code in the plurality of first sequence codes after screening.

Step 140: and splicing the plurality of representative sequence codes to determine a plurality of second sequence codes. Wherein the second sequence encodes a number of bits greater than the specified number of bits.

Step 150: a plurality of final sequence encodings is generated from the plurality of second sequence encodings. Wherein the final sequence encodes a corresponding nucleic acid sequence for labeling the nucleic acid to be tested.

In the embodiment of the application, compared with the prior art, a global segmentation and combination search method is adopted, the sequence codes with the index numbers are screened, the representative sequence codes are determined from the screened sequence codes, then splicing is carried out based on the representative sequence codes, the sequence codes with higher digits are obtained, and finally the final sequence codes are generated based on the sequence codes with higher digits. In the process, the long-chain sequence codes can be generated continuously and iteratively by using an incremental clustering algorithm and a splicing mode, the problem that the whole long-chain sequence space is too large to store is solved, the problem that local optimal points are easy to search through a genetic algorithm and the like is also solved, further, the effective generation of the sequence codes with low false recognition rate can be realized, and the sequence codes are applied to multiplexing application scenes, so that the effective sequencing of multiple nucleic acids can be realized.

Next, detailed embodiments of the encoding method of the sequence will be described.

In step 110, the predetermined number of digits is the number of digits of the base, and the first sequence code is a sequence code of a specified digit among a plurality of sequence codes of the base of the predetermined number of digits. Such as: the preset number of bits can be 40 bits, and the arrangement and combination modes of the corresponding bases are 4 in total ⁴⁰ A kind of module is assembled in the module and the module is assembled in the module. The sequence encoding of the specified bits may be the encoding of a short sequence, such as: assuming that the preset number of bits is 40 bits, the first sequence encoding may be: at least one of the shorter k-mer sequence codes of 9-mer,10-mer,11-mer, etc.

The number of first sequence codes is plural, and the acquisition process of step 110 may be understood as a determination process of these shorter sequence codes, which may be determined by using a mature sequence code determination method.

In step 120, a plurality of first sequence codes are screened. As an alternative embodiment, the step comprises: screening the plurality of first sequence codes according to a minimum free energy algorithm, and screening the plurality of first sequence codes according to a repeated sequence checking method to obtain screened plurality of first sequence codes; the minimum free energy algorithm is used for screening out codes corresponding to sequences with spatial structures, and the repeated sequence checking method is used for screening out codes corresponding to specified repeated sequences.

In this embodiment, the codes corresponding to the sequences with spatial structures can be screened out by a least free energy algorithm; screening codes corresponding to the appointed repeated sequences by a repeated sequence checking method; thereby realizing effective screening of sequence codes.

After the plurality of first sequence codes after the filtering is obtained in step 120, in step 130, a plurality of representative sequence codes are determined from the plurality of first sequence codes after the filtering according to an incremental clustering algorithm based on distances between sequences corresponding to respective ones of the plurality of first sequence codes after the filtering.

The incremental clustering algorithm is an iterative algorithm. In the embodiment of the application, based on the iterative algorithm, global searching can be performed on a plurality of first sequence codes, and a plurality of representative sequence codes can be determined from the first sequence codes. For ease of understanding, an example of the incremental clustering process is described.

A judgment standard of representing sequence codes is preset, a first representing sequence code is determined from a plurality of first sequence codes according to the judgment standard, whether the sequence codes meet the judgment standard is judged in sequence for each subsequent sequence code to be judged, if yes, the sequence code to be judged is the representing sequence code, and if not, the sequence code to be judged is not the representing sequence code. By repeating the determination process, all representative sequence codes can be determined therefrom.

In the embodiment of the present application, the judgment criteria may be: the distance between the sequence corresponding to the sequence code to be judged and the determined sequence corresponding to the representative sequence code is larger than a preset threshold value. Thus, as an alternative embodiment, step 130 includes: for the current sequence code to be judged, calculating the distance between the sequence corresponding to the current sequence code to be judged and the sequences respectively corresponding to the plurality of previously determined representative sequence codes; if the distance between the sequence corresponding to the current sequence code to be judged and each sequence in the sequences corresponding to the plurality of the previously determined representative sequence codes is larger than a preset threshold value, determining the current sequence code to be judged as the representative sequence code; if the distance between the sequence corresponding to the current sequence code to be judged and at least one sequence in the sequences respectively corresponding to the plurality of the previously determined representative sequence codes is smaller than or equal to a preset threshold value, determining that the current sequence code to be judged is not the representative sequence code.

The previously determined representative sequence codes are each determined before the currently to-be-judged sequence codes. Such as: assuming that the currently to-be-judged sequence code is the 5 th sequence code, if the 2 nd and 3 rd sequence codes have been determined to be the representative sequence codes, the 1 st sequence code and the 4 th sequence code determine not to be the representative sequence codes, the previously determined representative code sequences are the 2 nd and 3 rd sequence codes.

Continuing with the above example, first randomly determining one sequence code from the plurality of first sequence codes as a representative sequence code, i.e., determining a first representative sequence code; then, for the second sequence code (a residual sequence code except the representative sequence code in the plurality of first sequence codes), calculating the distance between the sequence corresponding to the second sequence code and the sequence corresponding to the representative sequence code, and if the distance is larger than a preset distance value, the second sequence code is also the representative sequence code; if the distance is less than or equal to the preset distance value, the second sequence code is not representative of the sequence code. And continuing to judge the third sequence code (a residual sequence code except the representative sequence code and the second sequence code in the plurality of first sequence codes) until all the sequence codes in the first sequence code are traversed, and determining the plurality of representative sequence codes.

It can be understood that if the distances between the sequence corresponding to the current sequence code to be determined and the sequences corresponding to the plurality of representative sequence codes determined in advance are all greater than the preset threshold, the current sequence code to be determined is the representative sequence code, otherwise, the current sequence code to be determined is not the representative sequence code. For example, there are 5 previously determined representative sequence codes, and if the distance values between the sequence corresponding to the current sequence code to be determined and the sequences corresponding to the 5 representative sequence codes are both greater than a preset threshold, the current sequence code to be determined is the representative sequence code; if at least one distance value smaller than or equal to a preset threshold exists between the sequence corresponding to the current sequence code to be judged and the sequences corresponding to the 5 representative sequence codes respectively, the current sequence code to be judged is not the representative sequence code.

In the incremental clustering process, the calculation of the distance between sequences corresponding to the sequence codes is important. As an alternative embodiment, the distance is calculated by: and calculating the distance between sequences corresponding to the sequence codes through a preset distance algorithm and a DTW function, namely calculating the distance between the sequence corresponding to the sequence code to be judged currently and the sequence corresponding to each of the plurality of previously determined representative sequence codes through the preset distance algorithm and the DTW function.

The distance algorithm is used for calculating the distance between the points, the DTW function is used for calculating the distance between the sequences, and the two algorithms are combined, so that the effective calculation of the distance between the sequences can be realized. The distance between the points can be the distance between two sequence currents at the time points corresponding to the current time distribution sequence obtained by simulation software.

In the embodiment of the present application, the preset distance algorithm may be a papanicolaou distance or a euclidean distance. Through the dynamic time normalization algorithm based on the Papanicolaou distance, the effective calculation of the distance can be realized, and the sequence code generated by the representative sequence code screened based on the distance has better effect when applied.

As an optional implementation manner, calculating, by a preset distance algorithm and a DTW function, a distance between a sequence corresponding to a currently to-be-judged sequence code and a sequence corresponding to each of a plurality of previously determined representative sequence codes, includes: generating current signals respectively corresponding to a sequence corresponding to the current sequence code to be judged and sequences respectively corresponding to the plurality of previously determined representative sequence codes; and determining the distance between the sequence corresponding to the currently-to-be-judged sequence code and the sequence corresponding to the plurality of previously-determined representative sequence codes respectively through a preset distance algorithm, a DTW function and a current signal.

In such an embodiment, by generating the current signal, an efficient calculation of the distance based on the current signal may be achieved. However, the algorithm adopted in the embodiment of the present application is different from the existing distance algorithm, so that how to calculate the distance based on the current signal is not described in detail in the embodiment of the present application.

After determining the plurality of representative sequence encodings in step 130, the plurality of representative sequence encodings are concatenated to determine a plurality of second sequence encodings in step 140. The second sequence encodes a number of bits greater than the specified number of bits.

When splicing, splicing representing sequence codes can be carried out according to preset splicing rules. Such as: the second plurality of sequences are code spliced into a plurality of long k-mer sequences of unequal length (e.g., 20-mer to 30-mer sequences). Furthermore, when each splicing is performed, sequences with different lengths can be selected, so that the lengths of the sequences obtained by each splicing are different.

Further, in step 150, a plurality of final sequence encodings are generated from the plurality of second sequence encodings. As an alternative embodiment, based on the plurality of second sequence encodings, a global search may be continued, step 150 comprising: screening the plurality of second sequence codes to obtain a plurality of screened second sequence codes; determining a plurality of new representative sequence codes from the plurality of screened second sequence codes based on the distance between sequences corresponding to each sequence code in the plurality of screened second sequence codes according to an incremental clustering algorithm; splicing the plurality of new representative sequence codes to determine a plurality of third sequence codes; the number of bits encoded by the third sequence is greater than the number of bits encoded by the second sequence; a plurality of final sequence encodings is generated from the plurality of third sequence encodings.

In this embodiment, after obtaining the sequence code with higher bit number based on the representative sequence code splicing, the global search process can be iterated, so as to generate more final sequence codes, which can be better applied to the multiplexing application scene.

In this embodiment, therefore, a plurality of second sequence codes are actually used as the short sequence codes in step 110, and the short sequence codes are screened, clustered, and spliced into a higher number of sequence codes. The generated third sequence code can be used as a final result, and can be further screened, clustered and spliced to obtain a sequence code with higher digits until the requirement is met; of course, the second sequence code obtained in step 150 may also be used as the final sequence code, and the number of iterations is not limited in the embodiments of the present application.

The plurality of final sequence codes generated in step 150 may be used to label the nucleic acid to be tested, and the application of the final sequence codes to the labeling will be described.

As an alternative application, a marker sequence code is determined from a plurality of final sequence codes; determining a check code corresponding to the marker sequence code; synthesizing a marker sequence to code a corresponding marker nucleic acid sequence and synthesizing a check code to code a corresponding check nucleic acid sequence; and sequentially connecting the labeled nucleic acid sequence, the check nucleic acid sequence and the nucleic acid to be detected to obtain the labeled nucleic acid to be detected.

Wherein, in determining the coding of the marker sequence, the determination can be made according to the number of nucleic acids to be tested. For example, if there are a total of 3 nucleic acids to be tested, then three sequence codes can be randomly selected from the final sequence codes.

After determining the marker sequence code, a corresponding check code may be generated based on the marker sequence code. When generating the check code, the marker sequence code may be converted to a digital code, then the digital check code may be generated from the digital code, and then the digital check code may be converted to a (sequence) check code. In practical application, only the check code is ensured, and the marker sequence code can be restored, so that the generation mode of the check code is not limited in the embodiment of the application.

Corresponding marker nucleic acid sequences can be synthesized based on marker sequence encoding, and corresponding check nucleic acid sequences can be synthesized based on check encoding, embodiments of which are well-established in the art and are not described in detail in the examples of this application.

When the nucleic acid to be detected is required to be sequenced, the labeled nucleic acid sequence, the check nucleic acid sequence and the nucleic acid to be detected are sequentially connected, so that the labeled nucleic acid to be detected can be obtained, wherein the connection between the sequences can be realized through a ligase. In addition, a marker nucleic acid sequence and a check nucleic acid sequence are attached to the head of the nucleic acid to be tested.

In the embodiment of the present application, since the generated multiple final sequence codes also need to be distinguished, corresponding identifiers may be set for the sequence codes. Such as: a total of 100 sequence codes, the identifiers corresponding to the 100 sequence codes are respectively: number 001-number 100; such as: the identification corresponding to the 3 rd sequence code is: 003; the 20 th sequence code corresponds to the mark 020.

Furthermore, when the nanopore sequencing is carried out on the nucleic acid to be detected, obtaining a current signal of the nanopore corresponding to the labeled nucleic acid to be detected; separating a current signal corresponding to the marker nucleic acid sequence and a current signal corresponding to the check nucleic acid sequence from the current signals; inputting the current signal corresponding to the labeled nucleic acid sequence into a pre-trained detection model to obtain a mark corresponding to the labeled nucleic acid sequence; determining a reduced marker nucleic acid sequence according to the current signal corresponding to the check nucleic acid sequence; determining a corresponding identifier of the restored marker nucleic acid sequence according to a preset corresponding relationship between a final sequence code and a sequence code identifier; comparing the mark corresponding to the marked nucleic acid sequence with the mark corresponding to the reduced marked nucleic acid sequence; and processing the current signal corresponding to the labeled nucleic acid sequence according to the comparison result.

In this embodiment, when the current signal corresponding to the marker nucleic acid sequence and the current signal corresponding to the check nucleic acid sequence are separated, separation may be performed according to a preset current characteristic that is respectively matched with the two current signals, where the basis of the current characteristic is that the marker nucleic acid sequence and the check nucleic acid sequence are at the head of the nucleic acid to be detected. Such as: the current signal corresponding to the labeled nucleic acid sequence is a current signal meeting a first preset current value in a first time period; the check nucleic acid sequence is a current signal meeting a second preset current value in a second time period. The second time period is after the first time period, and the first preset current value is smaller than the second preset current value. The specific values of the first time period, the second time period, the first preset current and the second preset current can be preset by combining the nanopore environment in the specific application scene and the condition of the nucleic acid to be detected, and are not limited in the embodiment of the application.

The pre-trained detection model is a neural network model, which may be a convolutional neural network, a cyclic neural network, or a transducer, and is not limited in the embodiments of the present application.

The training data corresponding to the detection model comprises: the current signal corresponding to the marker nucleic acid sequence separated from the nanopore current signal and the marker sequence code corresponding to the marker nucleic acid sequence, i.e. the marker corresponding to the marker nucleic acid sequence, can also be understood as the kind of the marker sequence code corresponding to the marker nucleic acid sequence. In addition to this partial data, negative example data based on the current signal corresponding to the existing marker nucleic acid sequence and the corresponding negative type judgment result (i.e., judgment of the structure with errors) may be included in the training data, see the description in the subsequent embodiments. The negative example data can be directly used as training data or used as a verification data set to verify the accuracy of the detection model, and the detection model is adjusted according to the verification result.

In short, with respect to the detection model, by training in advance, after inputting a current signal corresponding to the marker nucleic acid sequence into the model, the detection model may output an identification corresponding to the marker nucleic acid sequence.

In addition to the identification obtained based on the detection model, the marker sequence code can be restored by checking the code, as described in the previous embodiments. Thus, the code of the check nucleic acid may be determined based on the current signal corresponding to the check nucleic acid sequence, and then the code of the marker sequence may be restored based on the code of the check nucleic acid. And determining the mark corresponding to the mark sequence code according to the preset corresponding relation between the final sequence code and the sequence code mark, wherein the mark is the mark corresponding to the restored mark nucleic acid sequence.

And comparing the mark corresponding to the marked nucleic acid sequence with the mark corresponding to the reduced marked nucleic acid sequence to obtain a comparison result of whether the mark and the reduced marked nucleic acid sequence are consistent, and further processing the circuit signal corresponding to the marked nucleic acid sequence based on the comparison result.

As an alternative embodiment, the process includes: if the label corresponding to the labeled nucleic acid sequence is inconsistent with the label corresponding to the reduced labeled nucleic acid sequence, storing the label corresponding to the labeled nucleic acid sequence and the current signal corresponding to the labeled nucleic acid sequence. Correspondingly, the coding method further comprises the following steps: when training the initial detection model, the identification corresponding to the labeled nucleic acid sequence and the current signal corresponding to the labeled nucleic acid sequence are used as training data.

In this embodiment, if the label corresponding to the labeled nucleic acid sequence and the label corresponding to the reduced labeled nucleic acid sequence are not identical, it is indicated that there may be a false recognition of the labeled nucleic acid sequence, and this may be used as training data to train the detection model. That is, the label corresponding to the labeled nucleic acid sequence and the current signal corresponding to the labeled nucleic acid sequence can be used as negative training data of the detection model described in the foregoing embodiment, and in this way, the accuracy of the detection model can be improved.

In combination with the description of the foregoing embodiment, in the embodiment of the present application, on one hand, a global segmentation and combination search method is adopted, and in the process, the distance is calculated based on the pasteurized distance and the DTW, so as to finally realize effective generation of the sequence code with low false recognition rate. On the other hand, when the corresponding sequence codes are applied, the training data of the detection model is obtained by utilizing the detection result of the sequence corresponding to the sequence codes, so that the detection precision of the sequence corresponding to the sequence codes can be continuously improved, and the sequence codes can be more effectively utilized.

Based on the same inventive concept, please refer to fig. 2, in an embodiment of the present application, there is further provided a sequence encoding apparatus 200, including: an acquisition module 210 and a processing module 220.

The acquisition module 210 is configured to: acquiring a plurality of first sequence codes; the first sequence code is the sequence code of the index bit number corresponding to the base of the preset bit number.

The processing module 220 is configured to: screening the plurality of first sequence codes according to a preset screening algorithm to obtain a plurality of screened first sequence codes; determining a plurality of representative sequence codes from the screened plurality of first sequence codes based on the distance between sequences corresponding to each sequence code in the screened plurality of first sequence codes according to an incremental clustering algorithm; splicing the plurality of representative sequence codes to determine a plurality of second sequence codes; the number of bits encoded by the second sequence is greater than the specified number of bits; generating a final sequence code from the plurality of second sequence codes; the final sequence codes for a corresponding nucleic acid sequence for labeling the nucleic acid to be tested.

In the embodiment of the present application, the processing module 220 is specifically configured to: for the current sequence code to be judged, calculating the distance between the sequence corresponding to the current sequence code to be judged and the sequences respectively corresponding to the plurality of previously determined representative sequence codes; if the distance between the sequence corresponding to the current sequence code to be judged and each sequence in the sequences corresponding to the plurality of the representative sequence codes determined in advance is larger than a preset threshold value, determining the current sequence code to be judged as the representative sequence code; and if the distance between the sequence corresponding to the current sequence code to be judged and at least one sequence in the sequences respectively corresponding to the plurality of the previously determined representative sequence codes is smaller than or equal to a preset threshold value, determining that the current sequence code to be judged is not the representative sequence code.

In the embodiment of the present application, the processing module 220 is specifically configured to: and calculating the distance between the sequence corresponding to the currently-to-be-judged sequence code and the sequences respectively corresponding to the plurality of previously-determined representative sequence codes through a preset distance algorithm and a DTW (Dynamic Time Warping) function.

In the embodiment of the present application, the processing module 220 is specifically further configured to: generating current signals respectively corresponding to a sequence corresponding to the current sequence code to be judged and sequences respectively corresponding to a plurality of previously determined representative sequence codes; and determining the distance between the sequence corresponding to the sequence code to be judged currently and the sequences corresponding to the plurality of the representative sequence codes determined in advance according to the preset distance algorithm, the DTW function and the current signal.

In the embodiment of the present application, the processing module 220 is specifically further configured to: screening the plurality of second sequence codes to obtain a plurality of screened second sequence codes; determining a plurality of new representative sequence codes from the plurality of screened second sequence codes based on distances between sequences corresponding to each sequence code in the plurality of screened second sequence codes according to an incremental clustering algorithm; splicing the plurality of new representative sequence codes to determine a plurality of third sequence codes; the third sequence encodes a greater number of bits than the second sequence; generating a plurality of final sequence codes according to the plurality of third sequence codes.

In an embodiment of the present application, the coding device 200 of the sequence further comprises a nucleic acid sequencing module for: determining a marker sequence code from the plurality of final sequence codes; determining a check code corresponding to the marker sequence code; synthesizing a marker nucleic acid sequence corresponding to the marker sequence code, and synthesizing a check nucleic acid sequence corresponding to the check code; and sequentially connecting the labeled nucleic acid sequence, the check nucleic acid sequence and the nucleic acid to be detected to obtain labeled nucleic acid to be detected.

In embodiments of the present application, the nucleic acid sequencing module is further configured to: when the nanopore sequencing is carried out on the nucleic acid to be detected, obtaining a current signal of a nanopore corresponding to the labeled nucleic acid to be detected; separating the current signal corresponding to the marker nucleic acid sequence and the current signal corresponding to the check nucleic acid sequence from the current signals; inputting the current signal corresponding to the labeled nucleic acid sequence into a pre-trained detection model to obtain a mark corresponding to the labeled nucleic acid sequence; determining a reduced marker nucleic acid sequence according to the current signal corresponding to the check nucleic acid sequence; determining the corresponding identifier of the reduced marker nucleic acid sequence according to the preset corresponding relationship between the final sequence code and the sequence code identifier; comparing the identifier corresponding to the marker nucleic acid sequence with the identifier corresponding to the reduced marker nucleic acid sequence; and processing the current signal corresponding to the labeled nucleic acid sequence according to the comparison result.

In embodiments of the present application, the nucleic acid sequencing module is specifically for: and if the identifier corresponding to the marker nucleic acid sequence is inconsistent with the identifier corresponding to the reduced marker nucleic acid sequence, storing the identifier corresponding to the marker nucleic acid sequence and the current signal corresponding to the marker nucleic acid sequence. The sequence encoding device 200 further includes a training module, configured to use the identifier corresponding to the marker nucleic acid sequence and the current signal corresponding to the marker nucleic acid sequence as training data when training the initial detection model.

In the embodiment of the present application, the processing module 220 is specifically further configured to: screening the plurality of first sequence codes according to a minimum free energy algorithm, and screening the plurality of first sequence codes according to a repeated sequence checking method to obtain a plurality of screened first sequence codes; the minimum free energy algorithm is used for screening out codes corresponding to sequences with space structures, and the repeated sequence checking method is used for screening out codes corresponding to specified repeated sequences.

The sequence encoding apparatus 200 corresponds to the sequence encoding method in the foregoing embodiment, and thus, the embodiments of the respective modules of the sequence encoding apparatus 200 may refer to the embodiments of the respective steps of the sequence encoding method, and will not be described again here.

Based on the same inventive concept, the embodiments of the present application also provide a readable storage medium having stored thereon a computer program which, when executed by a computer, performs the encoding method of the sequences in the foregoing embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. A method of encoding a sequence, comprising:

acquiring a plurality of first sequence codes; the first sequence code is a sequence code of an index bit number corresponding to a base with a preset bit number;

screening the plurality of first sequence codes according to a preset screening algorithm to obtain a plurality of screened first sequence codes;

Determining a plurality of representative sequence codes from the screened plurality of first sequence codes based on distances between sequences corresponding to each sequence code in the screened plurality of first sequence codes according to an incremental clustering algorithm, wherein the method comprises the following steps:

for the current sequence code to be judged, calculating the distance between the sequence corresponding to the current sequence code to be judged and the sequences respectively corresponding to the plurality of previously determined representative sequence codes; if the distance between the sequence corresponding to the current sequence code to be judged and each sequence in the sequences corresponding to the plurality of the representative sequence codes determined in advance is larger than a preset threshold value, determining the current sequence code to be judged as the representative sequence code;

if the distance between the sequence corresponding to the current sequence code to be judged and at least one sequence in the sequences respectively corresponding to the plurality of the previously determined representative sequence codes is smaller than or equal to a preset threshold value, determining that the current sequence code to be judged is not the representative sequence code;

splicing the plurality of representative sequence codes to determine a plurality of second sequence codes; the number of bits encoded by the second sequence is greater than the specified number of bits;

Generating a plurality of final sequence encodings from the plurality of second sequence encodings, comprising: screening the plurality of second sequence codes to obtain a plurality of screened second sequence codes;

determining a plurality of new representative sequence codes from the plurality of screened second sequence codes based on distances between sequences corresponding to each sequence code in the plurality of screened second sequence codes according to an incremental clustering algorithm;

splicing the plurality of new representative sequence codes to determine a plurality of third sequence codes; the third sequence encodes a greater number of bits than the second sequence;

generating a plurality of final sequence encodings from the plurality of third sequence encodings;

the final sequence codes for a corresponding nucleic acid sequence for labeling the nucleic acid to be tested.

2. The encoding method according to claim 1, wherein the calculating the distance between the sequence corresponding to the currently-to-be-determined sequence code and the sequence corresponding to each of the plurality of previously-determined representative sequence codes includes:

and calculating the distance between the sequence corresponding to the current sequence code to be judged and the sequences corresponding to the plurality of previously determined representative sequence codes respectively through a preset distance algorithm and a DTW function.

3. The encoding method according to claim 2, wherein the calculating, by a preset distance algorithm and a DTW function, a distance between a sequence corresponding to the currently to-be-determined sequence code and a sequence corresponding to each of a plurality of previously determined representative sequence codes includes:

generating current signals respectively corresponding to a sequence corresponding to the current sequence code to be judged and sequences respectively corresponding to a plurality of previously determined representative sequence codes;

and determining the distance between the sequence corresponding to the sequence code to be judged currently and the sequences corresponding to the plurality of the representative sequence codes determined in advance according to the preset distance algorithm, the DTW function and the current signal.

4. The encoding method according to claim 2, wherein the preset distance algorithm is: the pap distance algorithm or the euclidean distance algorithm.

5. The encoding method according to claim 1, characterized in that the encoding method further comprises:

determining a marker sequence code from the plurality of final sequence codes;

determining a check code corresponding to the marker sequence code;

synthesizing a marker nucleic acid sequence corresponding to the marker sequence code, and synthesizing a check nucleic acid sequence corresponding to the check code;

And sequentially connecting the labeled nucleic acid sequence, the check nucleic acid sequence and the nucleic acid to be detected to obtain labeled nucleic acid to be detected.

6. The encoding method according to claim 5, characterized in that the encoding method further comprises:

when the nanopore sequencing is carried out on the nucleic acid to be detected, obtaining a current signal of a nanopore corresponding to the labeled nucleic acid to be detected;

separating the current signal corresponding to the marker nucleic acid sequence and the current signal corresponding to the check nucleic acid sequence from the current signals;

inputting the current signal corresponding to the labeled nucleic acid sequence into a pre-trained detection model to obtain a mark corresponding to the labeled nucleic acid sequence;

determining a reduced marker nucleic acid sequence according to the current signal corresponding to the check nucleic acid sequence;

determining the corresponding identifier of the reduced marker nucleic acid sequence according to the preset corresponding relationship between the final sequence code and the sequence code identifier;

comparing the identifier corresponding to the marker nucleic acid sequence with the identifier corresponding to the reduced marker nucleic acid sequence;

and processing the current signal corresponding to the labeled nucleic acid sequence according to the comparison result.

7. The coding method according to claim 6, wherein the processing the current signal corresponding to the marker nucleic acid sequence according to the alignment result comprises:

if the identifier corresponding to the marker nucleic acid sequence is inconsistent with the identifier corresponding to the reduced marker nucleic acid sequence, storing the identifier corresponding to the marker nucleic acid sequence and the current signal corresponding to the marker nucleic acid sequence;

the encoding method further includes:

and when training the initial detection model, taking the identification corresponding to the marker nucleic acid sequence and the current signal corresponding to the marker nucleic acid sequence as training data.

8. The encoding method according to claim 1, wherein the screening the plurality of first sequence codes according to a preset screening algorithm to obtain a screened plurality of first sequence codes comprises:

screening the plurality of first sequence codes according to a minimum free energy algorithm, and screening the plurality of first sequence codes according to a repeated sequence checking method to obtain a plurality of screened first sequence codes; the minimum free energy algorithm is used for screening out codes corresponding to sequences with space structures, and the repeated sequence checking method is used for screening out codes corresponding to specified repeated sequences.

9. A device for encoding a sequence, comprising:

an acquisition module for acquiring a plurality of first sequence codes; the first sequence code is a sequence code of an index bit number corresponding to a base with a preset bit number;

a processing module for:

determining a plurality of representative sequence codes from the screened plurality of first sequence codes based on the distance between sequences corresponding to each sequence code in the screened plurality of first sequence codes according to an incremental clustering algorithm;

generating a final sequence code from the plurality of second sequence codes; the final sequence codes a corresponding nucleic acid sequence for marking the nucleic acid to be detected;

the processing module is specifically used for calculating the distance between a sequence corresponding to the current sequence code to be judged and sequences respectively corresponding to a plurality of previously determined representative sequence codes aiming at the current sequence code to be judged; if the distance between the sequence corresponding to the current sequence code to be judged and each sequence in the sequences corresponding to the plurality of the representative sequence codes determined in advance is larger than a preset threshold value, determining the current sequence code to be judged as the representative sequence code;

The processing module is specifically used for screening the plurality of second sequence codes to obtain a plurality of screened second sequence codes; determining a plurality of new representative sequence codes from the plurality of screened second sequence codes based on distances between sequences corresponding to each sequence code in the plurality of screened second sequence codes according to an incremental clustering algorithm; splicing the plurality of new representative sequence codes to determine a plurality of third sequence codes; the third sequence encodes a greater number of bits than the second sequence; generating a plurality of final sequence codes according to the plurality of third sequence codes.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when run by a computer, performs the encoding method of the sequence according to any of claims 1-8.