CN111326215B

CN111326215B - Method and system for searching nucleic acid sequence based on k-tuple frequency

Info

Publication number: CN111326215B
Application number: CN202010083043.6A
Authority: CN
Inventors: 王颖; 白佳兴
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-02-07
Filing date: 2020-02-07
Publication date: 2022-04-29
Anticipated expiration: 2040-02-07
Also published as: CN111326215A

Abstract

The invention discloses a method and a system for searching a nucleic acid sequence based on k-tuple frequency. The method comprises the following steps: determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searched_ij，i＝1,…,4^k，j＝1,…,4^kWherein X is_ijRepresents the number of times k-tuple i appears before k-tuple j; determination of X_ijIs represented by a vector of (i) ═ 1, …,4^k，j＝1,…,4^k(ii) a According to all X_ij，i＝1,…,4^k，j＝1,…,4^kDetermining a vector representation of all tuples, and determining a dimension reduction representation of the nucleic acid sequence to be searched according to the vector representation of all tuples; calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched; and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database. The invention has the advantages of small required storage space and high calculation efficiency.

Description

Method and system for searching nucleic acid sequence based on k-tuple frequency

Technical Field

The invention relates to the technical field of nucleic acid sequence search, in particular to a method and a system for searching a nucleic acid sequence based on k-tuple frequency.

Background

With the rapid development of sequencing technology, the biological field generates unprecedented massive sequence data. One fundamental problem involved in many biological studies is the comparison of these generated sequences. Conventional sequence comparison is based on sequence registration, however, the method based on sequence registration requires a large amount of computing power and time, and also relies on a large reference sequence.

Registration-free sequence comparison methods are more computationally efficient than traditional alignment-based methods and have been widely applied to alignment of genomes and metagenomes. The existing sequence comparison methods without registration are all based on the frequency of the fixed length k-tuple sequence to directly form or form the similarity of a vector comparison sequence after certain normalization. However, these methods require computation and storage of k-tuple with long length and markov order models based on k-tuple, and require long computation time and huge storage space to obtain good results, thereby limiting the application of registration-free methods to large-scale data.

Disclosure of Invention

The invention aims to provide a method and a system for searching nucleic acid sequences based on k-tuple frequency, which have the advantages of small required storage space and high calculation efficiency.

In order to achieve the purpose, the invention provides the following scheme:

a method for searching a nucleic acid sequence based on k-tuple frequency, comprising:

determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searched_ijWherein X is_ijRepresents the number of times k-tuple i appears before k-tuple j;

determination of X_ijIs a vector of_ijWherein, in the step (A),

i＝1,…,4^k，j＝1,…,4^k，b_irepresenting the deviation caused by the sequence background noise pair k-tuple i,

represents the deviation, w, caused by the sequence background noise pair k-tuple j_iRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,

represents the corresponding weight vector when k-tuple i appears after k-tuple j;

determining a vector representation of k-tuple i

Determining a dimensionality-reduced representation of the nucleic acid sequence to be searched

k is tuple length, 4^kTuple number;

calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;

and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.

Optionally, said determining X_ijIs a vector of_ijThe method specifically comprises the following steps:

solving by least square method

Obtaining all X_ijVector representation of

Optionally, the calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in the gene database specifically includes:

according to

Calculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,

optionally, the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the output gene database specifically includes:

outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.

Alternatively, the nucleic acid sequence is a DNA sequence or an RNA sequence.

The present invention also provides a nucleic acid sequence search system based on k-tuple frequency, comprising:

a k-tuple frequency determination module for determining the frequency X of occurrence of adjacent k-tuple pairs in the nucleic acid sequence to be searched_ijWherein X is_ijRepresents the number of times k-tuple i appears before k-tuple j;

X_ijvector representation determination module for determining X_ijIs a vector of_ijWherein, in the step (A),

a k-tuple vector representation module for determining a vector representation of k-tuple i

A dimensionality reduction representation module for determining dimensionality reduction representation of the nucleic acid sequence to be searched

k is tuple length, 4^kTuple number;

the dissimilarity degree calculation module is used for calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimension reduction and each nucleic acid sequence in the gene database, wherein the nucleic acid sequence in the gene database is obtained after dimension reduction is carried out by adopting the same method as the dimension reduction method of the nucleic acid sequence to be searched;

and the matched nucleic acid output module is used for outputting a nucleic acid sequence with relatively low dissimilarity degree with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.

Optionally, X is_ijThe vector representation determining module specifically includes:

X_ija vector representation determination unit for solving by least squares

To obtain X_ijVector representation of

Optionally, the dissimilarity degree calculating module specifically includes:

a dissimilarity degree calculating unit for calculating a dissimilarity degree based on

optionally, the matched nucleic acid output module specifically includes:

and the matched nucleic acid output unit is used for outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.

Alternatively, the nucleic acid sequence is a DNA sequence or an RNA sequence.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a method and a system for searching a nucleic acid sequence based on k-tuple frequency, which are used for performing dimensionality reduction representation on a nucleic acid sequence to be searched based on the frequency of simultaneous occurrence of k-tuple adjacent to the context in the nucleic acid sequence, then calculating the degree of dissimilarity between the nucleic acid sequence to be searched after dimensionality reduction and the nucleic acid sequence in a nucleic acid database, wherein the nucleic acid sequence in the nucleic acid database is also the nucleic acid sequence after dimensionality reduction, and finally outputting the nucleic acid sequence with relatively small dissimilarity with the nucleic acid to be searched. Because the invention carries out dimensionality reduction on the nucleic acid sequence, the nucleic acid sequence after dimensionality reduction is compared when the nucleic acid to be searched is searched, the calculation time is reduced, and the calculation efficiency is improved. Moreover, because the nucleic acid database stores the nucleic acid sequences after dimensionality reduction, the storage space required by the system is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for searching a nucleic acid sequence based on k-tuple frequency according to an embodiment of the present invention;

FIG. 2 is a block diagram of a method for searching a nucleic acid sequence based on k-tuple frequency in an embodiment of the present invention;

FIG. 3 is a diagram showing a nucleic acid sequence search system based on k-tuple frequency in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

In a first aspect, the present invention provides a method for searching a nucleic acid sequence based on k-tuple frequency, as shown in FIG. 1, the method comprising the steps of:

step 101: determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searched_ijWherein X is_ijRepresents the number of times k-tuple i appears before k-tuple j;

step 102: determination of X_ijIs a vector of_ijWherein, in the step (A),

step 103: determining a vector representation of k-tuple i

Step 104: determining a dimensionality-reduced representation of the nucleic acid sequence to be searched

k is tuple length, 4^kTuple number;

step 105: calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;

step 106: and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.

On the basis of the above embodiment, as a preferred embodiment of the present invention, step 101 can be implemented by the following method:

scanning the whole nucleic acid sequence from beginning to end by using a sliding window with the length of 2k aiming at a section of nucleic acid sequence G, calculating the frequency of the adjacent k-tuple pairs appearing in the whole nucleic acid sequence at the same time to obtain a k-tuple co-occurrence matrix which is marked as

Wherein X_ijIndicates the number of times k-tuple i appears before k-tuple j, i is 1, …,4^k，j＝1,…,4^k。

On the basis of the above embodiment, as a preferred implementation manner of the present invention, step 102 may be implemented by the following method:

solving by least square method

To obtain X_ijReduced dimension representation of

·，·>Representing the inner product between two vectors, i.e.

X_ijThe skewed distribution of non-negative co-occurrences is avoided by the logarithmic transformation. It should be noted that this is because

And

is different, therefore X_ijAnd X_jiThere is asymmetry between them.

On the basis of the above example, as a preferred embodiment of the present invention, step 105 can be implemented by the following method:

according to

i＝1,…,4^k,j＝1,…,4^kCalculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,

on the basis of the above embodiment, as a preferred implementation manner of the present invention, step 106 can be implemented by the following method:

outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value and is artificially determined.

On the basis of the above examples, as a preferred embodiment of the present invention, the nucleic acid sequence is a DNA sequence or an RNA sequence. In addition, the present invention is also applicable to transcriptomes, metagenomes, and macrotranscriptomes.

As shown in fig. 2, the present invention comprises two key processing steps: low-dimensional vector space representation and dissimilarity calculation of nucleic acid sequences. Given a genomic nucleic acid sequence as input, the nucleic acid sequence is mapped to a low dimensional space by learning a low dimensional representation of its sequence by vector inner product optimization. The adjacent k-tuple pairs occurring simultaneously above and below in the nucleic acid sequence were counted using a sliding window of 1bp (1 base) step size. Based on the count values of k-tuple pairs, each k-tuple is converted into a vector by vector plus biased inner product fitting optimization, and the sequence compression of the massive nucleic acid database is represented as a point in vector space. Thereafter, for newly input query sequencing data or long DNA sequences, the low-dimensional vector representation of each k-tuple is also learned, and the sequence set which is most similar to the query sequence in the database is obtained by calculating the Manhattan distance between the query sequence and all the sequences in the database.

In a second aspect, the present invention provides a nucleic acid sequence search system based on k-tuple frequency, as shown in FIG. 3, the system comprising:

a k-tuple frequency determination module 301 for determining the adjacent k-tuple pairs in the nucleic acid sequence to be searchedFrequency of occurrence X_ijWherein X is_ijRepresents the number of times k-tuple i appears before k-tuple j;

X_ijvector representation determination module 302 for determining X_ijIs a vector of_ijWherein, in the step (A),

a k-tuple vector representation module 303 for determining a vector representation of k-tuple i

A dimension-reduced representation module 304 for determining a dimension-reduced representation of the nucleic acid sequence to be searched

k is tuple length, 4^kTuple number;

a dissimilarity degree calculating module 305, configured to calculate dissimilarity degrees between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, where the nucleic acid sequence in the gene database is a nucleic acid sequence obtained after dimensionality reduction by using a method the same as that for the nucleic acid sequence to be searched;

and a matched nucleic acid output module 306, configured to output a nucleic acid sequence in the gene database, where the nucleic acid sequence to be searched has a relatively low dissimilarity with the reduced dimension.

In addition to the above examples, as a preferred embodiment of the present invention, the X_ijThe vector representation determining module 302 specifically includes:

X_ija vector representation determination unit for solving by least squares

To obtain X_ijReduced dimension representation of

In addition to the above embodiments, as a preferred embodiment of the present invention, the dissimilarity degree calculating module 305 specifically includes:

i＝1,…,4^k，j＝1,…,4^k。

on the basis of the above embodiments, as a preferred embodiment of the present invention, the matched nucleic acid output module 306 specifically includes:

On the basis of the above examples, as a preferred embodiment of the present invention, the nucleic acid sequence is a DNA sequence or an RNA sequence.

The present invention learns a low-dimensional vector space representation of a nucleic acid sequence based on the k-tuple frequency of the context in the sequence, thereby representing a vast database of nucleic acid sequences as vectors in a low-dimensional space. The nucleic acid sequences to be searched are mapped to a low-dimensional vector space through representation learning, accurate and rapid comparison of the nucleic acid sequences is achieved through distance calculation of vectors in the low-dimensional space, and rapid query of gigabytes of data in a minute level or even a second level is achieved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for searching a nucleic acid sequence based on k-tuple frequency, comprising:

determination of X_ijIs a vector of_ijWherein, in the step (A),

b_irepresenting the deviation caused by the sequence background noise pair k-tuple i,

representing by sequence backDeviation, w, caused by scene noise on k-tuple j_iRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,

determining a vector representation of k-tuple i

k is tuple length, 4^kTuple number;

2. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein X is determined_ijIs a vector of_ijThe method specifically comprises the following steps:

solving by least square method

Obtaining all X_ijVector representation of

3. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the step of calculating the degree of dissimilarity between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database comprises:

according to

4. the method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the method for searching a nucleic acid sequence to be searched for in the export gene database has a relatively low degree of dissimilarity with the nucleic acid sequence to be searched for after dimensionality reduction comprises:

5. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the nucleic acid sequence is a DNA sequence or an RNA sequence.

6. A nucleic acid sequence search system based on k-tuple frequency, comprising:

k is tuple length, 4^kTuple number;

7. The system for searching a nucleic acid sequence based on k-tuple frequency according to claim 6, wherein X is_ijThe vector representation determining module specifically includes:

X_ija vector representation determination unit for solving by least squares

Obtaining all X_ijVector representation of

8. The system for searching a nucleic acid sequence based on k-tuple frequency according to claim 6, wherein the dissimilarity calculation module comprises:

9. the k-tuple frequency based nucleic acid sequence search system according to claim 6, wherein the matching nucleic acid output module comprises:

10. The k-tuple frequency-based nucleic acid sequence search system according to claim 6, wherein the nucleic acid sequence is a DNA sequence or an RNA sequence.