CN111326215B - Method and system for searching nucleic acid sequence based on k-tuple frequency - Google Patents

Method and system for searching nucleic acid sequence based on k-tuple frequency Download PDF

Info

Publication number
CN111326215B
CN111326215B CN202010083043.6A CN202010083043A CN111326215B CN 111326215 B CN111326215 B CN 111326215B CN 202010083043 A CN202010083043 A CN 202010083043A CN 111326215 B CN111326215 B CN 111326215B
Authority
CN
China
Prior art keywords
nucleic acid
acid sequence
tuple
searched
dissimilarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010083043.6A
Other languages
Chinese (zh)
Other versions
CN111326215A (en
Inventor
王颖
白佳兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202010083043.6A priority Critical patent/CN111326215B/en
Publication of CN111326215A publication Critical patent/CN111326215A/en
Application granted granted Critical
Publication of CN111326215B publication Critical patent/CN111326215B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for searching a nucleic acid sequence based on k-tuple frequency. The method comprises the following steps: determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedij,i=1,…,4k,j=1,…,4kWherein X isijRepresents the number of times k-tuple i appears before k-tuple j; determination of XijIs represented by a vector of (i) ═ 1, …,4k,j=1,…,4k(ii) a According to all Xij,i=1,…,4k,j=1,…,4kDetermining a vector representation of all tuples, and determining a dimension reduction representation of the nucleic acid sequence to be searched according to the vector representation of all tuples; calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched; and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database. The invention has the advantages of small required storage space and high calculation efficiency.

Description

Method and system for searching nucleic acid sequence based on k-tuple frequency
Technical Field
The invention relates to the technical field of nucleic acid sequence search, in particular to a method and a system for searching a nucleic acid sequence based on k-tuple frequency.
Background
With the rapid development of sequencing technology, the biological field generates unprecedented massive sequence data. One fundamental problem involved in many biological studies is the comparison of these generated sequences. Conventional sequence comparison is based on sequence registration, however, the method based on sequence registration requires a large amount of computing power and time, and also relies on a large reference sequence.
Registration-free sequence comparison methods are more computationally efficient than traditional alignment-based methods and have been widely applied to alignment of genomes and metagenomes. The existing sequence comparison methods without registration are all based on the frequency of the fixed length k-tuple sequence to directly form or form the similarity of a vector comparison sequence after certain normalization. However, these methods require computation and storage of k-tuple with long length and markov order models based on k-tuple, and require long computation time and huge storage space to obtain good results, thereby limiting the application of registration-free methods to large-scale data.
Disclosure of Invention
The invention aims to provide a method and a system for searching nucleic acid sequences based on k-tuple frequency, which have the advantages of small required storage space and high calculation efficiency.
In order to achieve the purpose, the invention provides the following scheme:
a method for searching a nucleic acid sequence based on k-tuple frequency, comprising:
determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
determination of XijIs a vector ofijWherein, in the step (A),
Figure BDA0002380974210000011
i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,
Figure BDA0002380974210000021
represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,
Figure BDA0002380974210000022
represents the corresponding weight vector when k-tuple i appears after k-tuple j;
determining a vector representation of k-tuple i
Figure BDA0002380974210000023
Determining a dimensionality-reduced representation of the nucleic acid sequence to be searched
Figure BDA0002380974210000024
k is tuple length, 4kTuple number;
calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;
and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
Optionally, said determining XijIs a vector ofijThe method specifically comprises the following steps:
solving by least square method
Figure BDA0002380974210000025
Obtaining all XijVector representation of
Figure BDA0002380974210000026
Optionally, the calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in the gene database specifically includes:
according to
Figure BDA0002380974210000027
Calculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
Figure BDA0002380974210000031
optionally, the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the output gene database specifically includes:
outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
Alternatively, the nucleic acid sequence is a DNA sequence or an RNA sequence.
The present invention also provides a nucleic acid sequence search system based on k-tuple frequency, comprising:
a k-tuple frequency determination module for determining the frequency X of occurrence of adjacent k-tuple pairs in the nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
Xijvector representation determination module for determining XijIs a vector ofijWherein, in the step (A),
Figure BDA0002380974210000032
Figure BDA0002380974210000033
i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,
Figure BDA0002380974210000034
represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,
Figure BDA0002380974210000035
represents the corresponding weight vector when k-tuple i appears after k-tuple j;
a k-tuple vector representation module for determining a vector representation of k-tuple i
Figure BDA0002380974210000036
A dimensionality reduction representation module for determining dimensionality reduction representation of the nucleic acid sequence to be searched
Figure BDA0002380974210000037
k is tuple length, 4kTuple number;
the dissimilarity degree calculation module is used for calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimension reduction and each nucleic acid sequence in the gene database, wherein the nucleic acid sequence in the gene database is obtained after dimension reduction is carried out by adopting the same method as the dimension reduction method of the nucleic acid sequence to be searched;
and the matched nucleic acid output module is used for outputting a nucleic acid sequence with relatively low dissimilarity degree with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
Optionally, X isijThe vector representation determining module specifically includes:
Xija vector representation determination unit for solving by least squares
Figure BDA0002380974210000041
To obtain XijVector representation of
Figure BDA0002380974210000042
Optionally, the dissimilarity degree calculating module specifically includes:
a dissimilarity degree calculating unit for calculating a dissimilarity degree based on
Figure BDA0002380974210000043
Calculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
Figure BDA0002380974210000044
optionally, the matched nucleic acid output module specifically includes:
and the matched nucleic acid output unit is used for outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
Alternatively, the nucleic acid sequence is a DNA sequence or an RNA sequence.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a method and a system for searching a nucleic acid sequence based on k-tuple frequency, which are used for performing dimensionality reduction representation on a nucleic acid sequence to be searched based on the frequency of simultaneous occurrence of k-tuple adjacent to the context in the nucleic acid sequence, then calculating the degree of dissimilarity between the nucleic acid sequence to be searched after dimensionality reduction and the nucleic acid sequence in a nucleic acid database, wherein the nucleic acid sequence in the nucleic acid database is also the nucleic acid sequence after dimensionality reduction, and finally outputting the nucleic acid sequence with relatively small dissimilarity with the nucleic acid to be searched. Because the invention carries out dimensionality reduction on the nucleic acid sequence, the nucleic acid sequence after dimensionality reduction is compared when the nucleic acid to be searched is searched, the calculation time is reduced, and the calculation efficiency is improved. Moreover, because the nucleic acid database stores the nucleic acid sequences after dimensionality reduction, the storage space required by the system is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for searching a nucleic acid sequence based on k-tuple frequency according to an embodiment of the present invention;
FIG. 2 is a block diagram of a method for searching a nucleic acid sequence based on k-tuple frequency in an embodiment of the present invention;
FIG. 3 is a diagram showing a nucleic acid sequence search system based on k-tuple frequency in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
In a first aspect, the present invention provides a method for searching a nucleic acid sequence based on k-tuple frequency, as shown in FIG. 1, the method comprising the steps of:
step 101: determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
step 102: determination of XijIs a vector ofijWherein, in the step (A),
Figure BDA0002380974210000051
i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,
Figure BDA0002380974210000052
represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,
Figure BDA0002380974210000053
represents the corresponding weight vector when k-tuple i appears after k-tuple j;
step 103: determining a vector representation of k-tuple i
Figure BDA0002380974210000061
Step 104: determining a dimensionality-reduced representation of the nucleic acid sequence to be searched
Figure BDA0002380974210000062
k is tuple length, 4kTuple number;
step 105: calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;
step 106: and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
On the basis of the above embodiment, as a preferred embodiment of the present invention, step 101 can be implemented by the following method:
scanning the whole nucleic acid sequence from beginning to end by using a sliding window with the length of 2k aiming at a section of nucleic acid sequence G, calculating the frequency of the adjacent k-tuple pairs appearing in the whole nucleic acid sequence at the same time to obtain a k-tuple co-occurrence matrix which is marked as
Figure BDA0002380974210000063
Wherein XijIndicates the number of times k-tuple i appears before k-tuple j, i is 1, …,4k,j=1,…,4k
On the basis of the above embodiment, as a preferred implementation manner of the present invention, step 102 may be implemented by the following method:
solving by least square method
Figure BDA0002380974210000064
To obtain XijReduced dimension representation of
Figure BDA0002380974210000071
·,·>Representing the inner product between two vectors, i.e.
Figure BDA0002380974210000072
XijThe skewed distribution of non-negative co-occurrences is avoided by the logarithmic transformation. It should be noted that this is because
Figure BDA0002380974210000073
And
Figure BDA0002380974210000074
is different, therefore XijAnd XjiThere is asymmetry between them.
On the basis of the above example, as a preferred embodiment of the present invention, step 105 can be implemented by the following method:
according to
Figure BDA0002380974210000075
i=1,…,4k,j=1,…,4kCalculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
Figure BDA0002380974210000076
on the basis of the above embodiment, as a preferred implementation manner of the present invention, step 106 can be implemented by the following method:
outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value and is artificially determined.
On the basis of the above examples, as a preferred embodiment of the present invention, the nucleic acid sequence is a DNA sequence or an RNA sequence. In addition, the present invention is also applicable to transcriptomes, metagenomes, and macrotranscriptomes.
As shown in fig. 2, the present invention comprises two key processing steps: low-dimensional vector space representation and dissimilarity calculation of nucleic acid sequences. Given a genomic nucleic acid sequence as input, the nucleic acid sequence is mapped to a low dimensional space by learning a low dimensional representation of its sequence by vector inner product optimization. The adjacent k-tuple pairs occurring simultaneously above and below in the nucleic acid sequence were counted using a sliding window of 1bp (1 base) step size. Based on the count values of k-tuple pairs, each k-tuple is converted into a vector by vector plus biased inner product fitting optimization, and the sequence compression of the massive nucleic acid database is represented as a point in vector space. Thereafter, for newly input query sequencing data or long DNA sequences, the low-dimensional vector representation of each k-tuple is also learned, and the sequence set which is most similar to the query sequence in the database is obtained by calculating the Manhattan distance between the query sequence and all the sequences in the database.
In a second aspect, the present invention provides a nucleic acid sequence search system based on k-tuple frequency, as shown in FIG. 3, the system comprising:
a k-tuple frequency determination module 301 for determining the adjacent k-tuple pairs in the nucleic acid sequence to be searchedFrequency of occurrence XijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
Xijvector representation determination module 302 for determining XijIs a vector ofijWherein, in the step (A),
Figure BDA0002380974210000081
Figure BDA0002380974210000082
i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,
Figure BDA0002380974210000083
represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,
Figure BDA0002380974210000084
represents the corresponding weight vector when k-tuple i appears after k-tuple j;
a k-tuple vector representation module 303 for determining a vector representation of k-tuple i
Figure BDA0002380974210000085
A dimension-reduced representation module 304 for determining a dimension-reduced representation of the nucleic acid sequence to be searched
Figure BDA0002380974210000086
k is tuple length, 4kTuple number;
a dissimilarity degree calculating module 305, configured to calculate dissimilarity degrees between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, where the nucleic acid sequence in the gene database is a nucleic acid sequence obtained after dimensionality reduction by using a method the same as that for the nucleic acid sequence to be searched;
and a matched nucleic acid output module 306, configured to output a nucleic acid sequence in the gene database, where the nucleic acid sequence to be searched has a relatively low dissimilarity with the reduced dimension.
In addition to the above examples, as a preferred embodiment of the present invention, the XijThe vector representation determining module 302 specifically includes:
Xija vector representation determination unit for solving by least squares
Figure BDA0002380974210000091
To obtain XijReduced dimension representation of
Figure BDA0002380974210000092
In addition to the above embodiments, as a preferred embodiment of the present invention, the dissimilarity degree calculating module 305 specifically includes:
a dissimilarity degree calculating unit for calculating a dissimilarity degree based on
Figure BDA0002380974210000093
Calculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
Figure BDA0002380974210000094
i=1,…,4k,j=1,…,4k
on the basis of the above embodiments, as a preferred embodiment of the present invention, the matched nucleic acid output module 306 specifically includes:
and the matched nucleic acid output unit is used for outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
On the basis of the above examples, as a preferred embodiment of the present invention, the nucleic acid sequence is a DNA sequence or an RNA sequence.
The present invention learns a low-dimensional vector space representation of a nucleic acid sequence based on the k-tuple frequency of the context in the sequence, thereby representing a vast database of nucleic acid sequences as vectors in a low-dimensional space. The nucleic acid sequences to be searched are mapped to a low-dimensional vector space through representation learning, accurate and rapid comparison of the nucleic acid sequences is achieved through distance calculation of vectors in the low-dimensional space, and rapid query of gigabytes of data in a minute level or even a second level is achieved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A method for searching a nucleic acid sequence based on k-tuple frequency, comprising:
determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
determination of XijIs a vector ofijWherein, in the step (A),
Figure FDA0002380974200000017
Figure FDA0002380974200000018
birepresenting the deviation caused by the sequence background noise pair k-tuple i,
Figure FDA0002380974200000012
representing by sequence backDeviation, w, caused by scene noise on k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,
Figure FDA0002380974200000013
represents the corresponding weight vector when k-tuple i appears after k-tuple j;
determining a vector representation of k-tuple i
Figure FDA0002380974200000014
Determining a dimensionality-reduced representation of the nucleic acid sequence to be searched
Figure FDA0002380974200000015
k is tuple length, 4kTuple number;
calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;
and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
2. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein X is determinedijIs a vector ofijThe method specifically comprises the following steps:
solving by least square method
Figure FDA0002380974200000016
Obtaining all XijVector representation of
Figure FDA0002380974200000021
3. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the step of calculating the degree of dissimilarity between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database comprises:
according to
Figure FDA0002380974200000022
Calculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
Figure FDA0002380974200000023
4. the method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the method for searching a nucleic acid sequence to be searched for in the export gene database has a relatively low degree of dissimilarity with the nucleic acid sequence to be searched for after dimensionality reduction comprises:
outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
5. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the nucleic acid sequence is a DNA sequence or an RNA sequence.
6. A nucleic acid sequence search system based on k-tuple frequency, comprising:
a k-tuple frequency determination module for determining the frequency X of occurrence of adjacent k-tuple pairs in the nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
Xijvector representation determination module for determining XijIs a vector ofijWherein, in the step (A),
Figure FDA0002380974200000024
Figure FDA0002380974200000025
birepresenting the deviation caused by the sequence background noise pair k-tuple i,
Figure FDA0002380974200000026
represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,
Figure FDA0002380974200000031
represents the corresponding weight vector when k-tuple i appears after k-tuple j;
a k-tuple vector representation module for determining a vector representation of k-tuple i
Figure FDA0002380974200000032
A dimensionality reduction representation module for determining dimensionality reduction representation of the nucleic acid sequence to be searched
Figure FDA0002380974200000033
k is tuple length, 4kTuple number;
the dissimilarity degree calculation module is used for calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimension reduction and each nucleic acid sequence in the gene database, wherein the nucleic acid sequence in the gene database is obtained after dimension reduction is carried out by adopting the same method as the dimension reduction method of the nucleic acid sequence to be searched;
and the matched nucleic acid output module is used for outputting a nucleic acid sequence with relatively low dissimilarity degree with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
7. The system for searching a nucleic acid sequence based on k-tuple frequency according to claim 6, wherein X isijThe vector representation determining module specifically includes:
Xija vector representation determination unit for solving by least squares
Figure FDA0002380974200000034
Obtaining all XijVector representation of
Figure FDA0002380974200000035
8. The system for searching a nucleic acid sequence based on k-tuple frequency according to claim 6, wherein the dissimilarity calculation module comprises:
a dissimilarity degree calculating unit for calculating a dissimilarity degree based on
Figure FDA0002380974200000036
Calculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
Figure FDA0002380974200000041
9. the k-tuple frequency based nucleic acid sequence search system according to claim 6, wherein the matching nucleic acid output module comprises:
and the matched nucleic acid output unit is used for outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
10. The k-tuple frequency-based nucleic acid sequence search system according to claim 6, wherein the nucleic acid sequence is a DNA sequence or an RNA sequence.
CN202010083043.6A 2020-02-07 2020-02-07 Method and system for searching nucleic acid sequence based on k-tuple frequency Active CN111326215B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010083043.6A CN111326215B (en) 2020-02-07 2020-02-07 Method and system for searching nucleic acid sequence based on k-tuple frequency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010083043.6A CN111326215B (en) 2020-02-07 2020-02-07 Method and system for searching nucleic acid sequence based on k-tuple frequency

Publications (2)

Publication Number Publication Date
CN111326215A CN111326215A (en) 2020-06-23
CN111326215B true CN111326215B (en) 2022-04-29

Family

ID=71168902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010083043.6A Active CN111326215B (en) 2020-02-07 2020-02-07 Method and system for searching nucleic acid sequence based on k-tuple frequency

Country Status (1)

Country Link
CN (1) CN111326215B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112863593B (en) * 2021-02-05 2024-02-20 厦门大学 Identification feature extraction method and system based on skin metagenome data
CN113921082B (en) * 2021-10-27 2023-04-07 云舟生物科技(广州)股份有限公司 Gene search weight adjustment method, computer storage medium, and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002027024A2 (en) * 2000-09-28 2002-04-04 Office Of The Staff Judge Advocate U.S. Army Medical Research And Material Command Automated method of identifying and archiving nucleic acid sequences
CN1598821A (en) * 2004-09-07 2005-03-23 东南大学 Seaching method of genome sequence data based on characteristic
CN105950707A (en) * 2016-03-30 2016-09-21 广州精科生物技术有限公司 Method and system for determining nucleic acid sequence
CN106202999A (en) * 2016-07-21 2016-12-07 厦门大学 Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002027024A2 (en) * 2000-09-28 2002-04-04 Office Of The Staff Judge Advocate U.S. Army Medical Research And Material Command Automated method of identifying and archiving nucleic acid sequences
CN1598821A (en) * 2004-09-07 2005-03-23 东南大学 Seaching method of genome sequence data based on characteristic
CN105950707A (en) * 2016-03-30 2016-09-21 广州精科生物技术有限公司 Method and system for determining nucleic acid sequence
CN106202999A (en) * 2016-07-21 2016-12-07 厦门大学 Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Comparison of Metatranscriptomic Samples Based on k-Tuple Frequencies;Wang Ying et al.;《PLOS ONE》;20140102;第9卷(第1期);第1-19页 *
Effect of k-tuple length on sample-comparison with high-throughput sequencing data;Wang Ying et al.;《Biochemical and Biophysical Research Communications》;20160122;第469卷(第4期);第1021-1027页 *
基于k-tuple频度统计的微生物群落;刘麟;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20140815(第8期);第A006-197页 *

Also Published As

Publication number Publication date
CN111326215A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
US20200058374A1 (en) Systems and methods for adaptive local alignment for graph genomes
EP2499569B1 (en) Clustering method and system
Sato et al. RNA secondary structural alignment with conditional random fields
CN105183833B (en) Microblog text recommendation method and device based on user model
WO2015081754A1 (en) Genome compression and decompression
CN111326215B (en) Method and system for searching nucleic acid sequence based on k-tuple frequency
CN110597844B (en) Unified access method for heterogeneous database data and related equipment
CN108399268B (en) Incremental heterogeneous graph clustering method based on game theory
CN111026877A (en) Knowledge verification model construction and analysis method based on probability soft logic
Saini et al. Probabilistic expression of spatially varied amino acid dimers into general form of Chou׳ s pseudo amino acid composition for protein fold recognition
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
EP3703061A1 (en) Image retrieval
CN111309930A (en) Medical knowledge graph entity alignment method based on representation learning
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
Ni et al. Applying frequency chaos game representation with perceptual image hashing to gene sequence phylogenetic analyses
CN117235137B (en) Professional information query method and device based on vector database
CN114266249A (en) Mass text clustering method based on birch clustering
CN118095278A (en) Co-reference resolution document level relation extraction method based on pre-training model
CN110377721B (en) Automatic question answering method, device, storage medium and electronic equipment
CN105162648B (en) Corporations' detection method based on backbone network extension
US20200142910A1 (en) Data clustering apparatus and method based on range query using cf tree
CN115357691A (en) Semantic retrieval method, system, equipment and computer readable storage medium
CN112632287B (en) Electric power knowledge graph construction method and device
CN107608972B (en) Multi-text quick summarization method
CN111275201A (en) Sub-graph division based distributed implementation method for semi-supervised learning of graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared