CN111326215B - Method and system for searching nucleic acid sequence based on k-tuple frequency - Google Patents
Method and system for searching nucleic acid sequence based on k-tuple frequency Download PDFInfo
- Publication number
- CN111326215B CN111326215B CN202010083043.6A CN202010083043A CN111326215B CN 111326215 B CN111326215 B CN 111326215B CN 202010083043 A CN202010083043 A CN 202010083043A CN 111326215 B CN111326215 B CN 111326215B
- Authority
- CN
- China
- Prior art keywords
- nucleic acid
- acid sequence
- tuple
- searched
- dissimilarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 150000007523 nucleic acids Chemical group 0.000 title claims abstract description 155
- 108091028043 Nucleic acid sequence Proteins 0.000 title claims abstract description 135
- 238000000034 method Methods 0.000 title claims abstract description 49
- 239000013598 vector Substances 0.000 claims abstract description 57
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 37
- 108020004707 nucleic acids Proteins 0.000 claims description 15
- 102000039446 nucleic acids Human genes 0.000 claims description 15
- 238000010586 diagram Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for searching a nucleic acid sequence based on k-tuple frequency. The method comprises the following steps: determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedij,i=1,…,4k,j=1,…,4kWherein X isijRepresents the number of times k-tuple i appears before k-tuple j; determination of XijIs represented by a vector of (i) ═ 1, …,4k,j=1,…,4k(ii) a According to all Xij,i=1,…,4k,j=1,…,4kDetermining a vector representation of all tuples, and determining a dimension reduction representation of the nucleic acid sequence to be searched according to the vector representation of all tuples; calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched; and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database. The invention has the advantages of small required storage space and high calculation efficiency.
Description
Technical Field
The invention relates to the technical field of nucleic acid sequence search, in particular to a method and a system for searching a nucleic acid sequence based on k-tuple frequency.
Background
With the rapid development of sequencing technology, the biological field generates unprecedented massive sequence data. One fundamental problem involved in many biological studies is the comparison of these generated sequences. Conventional sequence comparison is based on sequence registration, however, the method based on sequence registration requires a large amount of computing power and time, and also relies on a large reference sequence.
Registration-free sequence comparison methods are more computationally efficient than traditional alignment-based methods and have been widely applied to alignment of genomes and metagenomes. The existing sequence comparison methods without registration are all based on the frequency of the fixed length k-tuple sequence to directly form or form the similarity of a vector comparison sequence after certain normalization. However, these methods require computation and storage of k-tuple with long length and markov order models based on k-tuple, and require long computation time and huge storage space to obtain good results, thereby limiting the application of registration-free methods to large-scale data.
Disclosure of Invention
The invention aims to provide a method and a system for searching nucleic acid sequences based on k-tuple frequency, which have the advantages of small required storage space and high calculation efficiency.
In order to achieve the purpose, the invention provides the following scheme:
a method for searching a nucleic acid sequence based on k-tuple frequency, comprising:
determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
determination of XijIs a vector ofijWherein, in the step (A),i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,represents the corresponding weight vector when k-tuple i appears after k-tuple j;
Determining a dimensionality-reduced representation of the nucleic acid sequence to be searchedk is tuple length, 4kTuple number;
calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;
and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
Optionally, said determining XijIs a vector ofijThe method specifically comprises the following steps:
Optionally, the calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in the gene database specifically includes:
according toCalculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
optionally, the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the output gene database specifically includes:
outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
Alternatively, the nucleic acid sequence is a DNA sequence or an RNA sequence.
The present invention also provides a nucleic acid sequence search system based on k-tuple frequency, comprising:
a k-tuple frequency determination module for determining the frequency X of occurrence of adjacent k-tuple pairs in the nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
Xijvector representation determination module for determining XijIs a vector ofijWherein, in the step (A), i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,represents the corresponding weight vector when k-tuple i appears after k-tuple j;
A dimensionality reduction representation module for determining dimensionality reduction representation of the nucleic acid sequence to be searchedk is tuple length, 4kTuple number;
the dissimilarity degree calculation module is used for calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimension reduction and each nucleic acid sequence in the gene database, wherein the nucleic acid sequence in the gene database is obtained after dimension reduction is carried out by adopting the same method as the dimension reduction method of the nucleic acid sequence to be searched;
and the matched nucleic acid output module is used for outputting a nucleic acid sequence with relatively low dissimilarity degree with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
Optionally, X isijThe vector representation determining module specifically includes:
Xija vector representation determination unit for solving by least squaresTo obtain XijVector representation of
Optionally, the dissimilarity degree calculating module specifically includes:
a dissimilarity degree calculating unit for calculating a dissimilarity degree based onCalculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
optionally, the matched nucleic acid output module specifically includes:
and the matched nucleic acid output unit is used for outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
Alternatively, the nucleic acid sequence is a DNA sequence or an RNA sequence.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a method and a system for searching a nucleic acid sequence based on k-tuple frequency, which are used for performing dimensionality reduction representation on a nucleic acid sequence to be searched based on the frequency of simultaneous occurrence of k-tuple adjacent to the context in the nucleic acid sequence, then calculating the degree of dissimilarity between the nucleic acid sequence to be searched after dimensionality reduction and the nucleic acid sequence in a nucleic acid database, wherein the nucleic acid sequence in the nucleic acid database is also the nucleic acid sequence after dimensionality reduction, and finally outputting the nucleic acid sequence with relatively small dissimilarity with the nucleic acid to be searched. Because the invention carries out dimensionality reduction on the nucleic acid sequence, the nucleic acid sequence after dimensionality reduction is compared when the nucleic acid to be searched is searched, the calculation time is reduced, and the calculation efficiency is improved. Moreover, because the nucleic acid database stores the nucleic acid sequences after dimensionality reduction, the storage space required by the system is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for searching a nucleic acid sequence based on k-tuple frequency according to an embodiment of the present invention;
FIG. 2 is a block diagram of a method for searching a nucleic acid sequence based on k-tuple frequency in an embodiment of the present invention;
FIG. 3 is a diagram showing a nucleic acid sequence search system based on k-tuple frequency in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
In a first aspect, the present invention provides a method for searching a nucleic acid sequence based on k-tuple frequency, as shown in FIG. 1, the method comprising the steps of:
step 101: determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
step 102: determination of XijIs a vector ofijWherein, in the step (A),i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,represents the corresponding weight vector when k-tuple i appears after k-tuple j;
Step 104: determining a dimensionality-reduced representation of the nucleic acid sequence to be searchedk is tuple length, 4kTuple number;
step 105: calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;
step 106: and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
On the basis of the above embodiment, as a preferred embodiment of the present invention, step 101 can be implemented by the following method:
scanning the whole nucleic acid sequence from beginning to end by using a sliding window with the length of 2k aiming at a section of nucleic acid sequence G, calculating the frequency of the adjacent k-tuple pairs appearing in the whole nucleic acid sequence at the same time to obtain a k-tuple co-occurrence matrix which is marked asWherein XijIndicates the number of times k-tuple i appears before k-tuple j, i is 1, …,4k,j=1,…,4k。
On the basis of the above embodiment, as a preferred implementation manner of the present invention, step 102 may be implemented by the following method:
solving by least square methodTo obtain XijReduced dimension representation of·,·>Representing the inner product between two vectors, i.e.XijThe skewed distribution of non-negative co-occurrences is avoided by the logarithmic transformation. It should be noted that this is becauseAndis different, therefore XijAnd XjiThere is asymmetry between them.
On the basis of the above example, as a preferred embodiment of the present invention, step 105 can be implemented by the following method:
according toi=1,…,4k,j=1,…,4kCalculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,
on the basis of the above embodiment, as a preferred implementation manner of the present invention, step 106 can be implemented by the following method:
outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value and is artificially determined.
On the basis of the above examples, as a preferred embodiment of the present invention, the nucleic acid sequence is a DNA sequence or an RNA sequence. In addition, the present invention is also applicable to transcriptomes, metagenomes, and macrotranscriptomes.
As shown in fig. 2, the present invention comprises two key processing steps: low-dimensional vector space representation and dissimilarity calculation of nucleic acid sequences. Given a genomic nucleic acid sequence as input, the nucleic acid sequence is mapped to a low dimensional space by learning a low dimensional representation of its sequence by vector inner product optimization. The adjacent k-tuple pairs occurring simultaneously above and below in the nucleic acid sequence were counted using a sliding window of 1bp (1 base) step size. Based on the count values of k-tuple pairs, each k-tuple is converted into a vector by vector plus biased inner product fitting optimization, and the sequence compression of the massive nucleic acid database is represented as a point in vector space. Thereafter, for newly input query sequencing data or long DNA sequences, the low-dimensional vector representation of each k-tuple is also learned, and the sequence set which is most similar to the query sequence in the database is obtained by calculating the Manhattan distance between the query sequence and all the sequences in the database.
In a second aspect, the present invention provides a nucleic acid sequence search system based on k-tuple frequency, as shown in FIG. 3, the system comprising:
a k-tuple frequency determination module 301 for determining the adjacent k-tuple pairs in the nucleic acid sequence to be searchedFrequency of occurrence XijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
Xijvector representation determination module 302 for determining XijIs a vector ofijWherein, in the step (A), i=1,…,4k,j=1,…,4k,birepresenting the deviation caused by the sequence background noise pair k-tuple i,represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,represents the corresponding weight vector when k-tuple i appears after k-tuple j;
A dimension-reduced representation module 304 for determining a dimension-reduced representation of the nucleic acid sequence to be searchedk is tuple length, 4kTuple number;
a dissimilarity degree calculating module 305, configured to calculate dissimilarity degrees between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, where the nucleic acid sequence in the gene database is a nucleic acid sequence obtained after dimensionality reduction by using a method the same as that for the nucleic acid sequence to be searched;
and a matched nucleic acid output module 306, configured to output a nucleic acid sequence in the gene database, where the nucleic acid sequence to be searched has a relatively low dissimilarity with the reduced dimension.
In addition to the above examples, as a preferred embodiment of the present invention, the XijThe vector representation determining module 302 specifically includes:
Xija vector representation determination unit for solving by least squaresTo obtain XijReduced dimension representation of
In addition to the above embodiments, as a preferred embodiment of the present invention, the dissimilarity degree calculating module 305 specifically includes:
a dissimilarity degree calculating unit for calculating a dissimilarity degree based onCalculating the degree of dissimilarity between the nucleic acid sequence y to be searched and the nucleic acid sequence x in the gene database,i=1,…,4k,j=1,…,4k。
on the basis of the above embodiments, as a preferred embodiment of the present invention, the matched nucleic acid output module 306 specifically includes:
and the matched nucleic acid output unit is used for outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
On the basis of the above examples, as a preferred embodiment of the present invention, the nucleic acid sequence is a DNA sequence or an RNA sequence.
The present invention learns a low-dimensional vector space representation of a nucleic acid sequence based on the k-tuple frequency of the context in the sequence, thereby representing a vast database of nucleic acid sequences as vectors in a low-dimensional space. The nucleic acid sequences to be searched are mapped to a low-dimensional vector space through representation learning, accurate and rapid comparison of the nucleic acid sequences is achieved through distance calculation of vectors in the low-dimensional space, and rapid query of gigabytes of data in a minute level or even a second level is achieved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (10)
1. A method for searching a nucleic acid sequence based on k-tuple frequency, comprising:
determining the frequency X of occurrence of adjacent k-tuple pairs in a nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
determination of XijIs a vector ofijWherein, in the step (A), birepresenting the deviation caused by the sequence background noise pair k-tuple i,representing by sequence backDeviation, w, caused by scene noise on k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,represents the corresponding weight vector when k-tuple i appears after k-tuple j;
Determining a dimensionality-reduced representation of the nucleic acid sequence to be searchedk is tuple length, 4kTuple number;
calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database, wherein the nucleic acid sequence in the gene database is obtained after dimensionality reduction is carried out by adopting a method the same as the dimensionality reduction method of the nucleic acid sequence to be searched;
and outputting the nucleic acid sequence with relatively low dissimilarity with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
3. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the step of calculating the degree of dissimilarity between the nucleic acid sequence to be searched after dimensionality reduction and each nucleic acid sequence in a gene database comprises:
4. the method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the method for searching a nucleic acid sequence to be searched for in the export gene database has a relatively low degree of dissimilarity with the nucleic acid sequence to be searched for after dimensionality reduction comprises:
outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
5. The method for searching a nucleic acid sequence based on k-tuple frequency according to claim 1, wherein the nucleic acid sequence is a DNA sequence or an RNA sequence.
6. A nucleic acid sequence search system based on k-tuple frequency, comprising:
a k-tuple frequency determination module for determining the frequency X of occurrence of adjacent k-tuple pairs in the nucleic acid sequence to be searchedijWherein X isijRepresents the number of times k-tuple i appears before k-tuple j;
Xijvector representation determination module for determining XijIs a vector ofijWherein, in the step (A), birepresenting the deviation caused by the sequence background noise pair k-tuple i,represents the deviation, w, caused by the sequence background noise pair k-tuple jiRepresenting the corresponding weight vector when k-tuple i occurs before k-tuple j,represents the corresponding weight vector when k-tuple i appears after k-tuple j;
A dimensionality reduction representation module for determining dimensionality reduction representation of the nucleic acid sequence to be searchedk is tuple length, 4kTuple number;
the dissimilarity degree calculation module is used for calculating the dissimilarity degree between the nucleic acid sequence to be searched after dimension reduction and each nucleic acid sequence in the gene database, wherein the nucleic acid sequence in the gene database is obtained after dimension reduction is carried out by adopting the same method as the dimension reduction method of the nucleic acid sequence to be searched;
and the matched nucleic acid output module is used for outputting a nucleic acid sequence with relatively low dissimilarity degree with the nucleic acid sequence to be searched after dimensionality reduction in the gene database.
8. The system for searching a nucleic acid sequence based on k-tuple frequency according to claim 6, wherein the dissimilarity calculation module comprises:
9. the k-tuple frequency based nucleic acid sequence search system according to claim 6, wherein the matching nucleic acid output module comprises:
and the matched nucleic acid output unit is used for outputting the first n nucleic acid sequences with relatively low dissimilarity with the nucleic acid sequences to be searched after dimensionality reduction in the gene database, wherein n is a preset numerical value.
10. The k-tuple frequency-based nucleic acid sequence search system according to claim 6, wherein the nucleic acid sequence is a DNA sequence or an RNA sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010083043.6A CN111326215B (en) | 2020-02-07 | 2020-02-07 | Method and system for searching nucleic acid sequence based on k-tuple frequency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010083043.6A CN111326215B (en) | 2020-02-07 | 2020-02-07 | Method and system for searching nucleic acid sequence based on k-tuple frequency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111326215A CN111326215A (en) | 2020-06-23 |
CN111326215B true CN111326215B (en) | 2022-04-29 |
Family
ID=71168902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010083043.6A Active CN111326215B (en) | 2020-02-07 | 2020-02-07 | Method and system for searching nucleic acid sequence based on k-tuple frequency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111326215B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112863593B (en) * | 2021-02-05 | 2024-02-20 | 厦门大学 | Identification feature extraction method and system based on skin metagenome data |
CN113921082B (en) * | 2021-10-27 | 2023-04-07 | 云舟生物科技(广州)股份有限公司 | Gene search weight adjustment method, computer storage medium, and electronic device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002027024A2 (en) * | 2000-09-28 | 2002-04-04 | Office Of The Staff Judge Advocate U.S. Army Medical Research And Material Command | Automated method of identifying and archiving nucleic acid sequences |
CN1598821A (en) * | 2004-09-07 | 2005-03-23 | 东南大学 | Seaching method of genome sequence data based on characteristic |
CN105950707A (en) * | 2016-03-30 | 2016-09-21 | 广州精科生物技术有限公司 | Method and system for determining nucleic acid sequence |
CN106202999A (en) * | 2016-07-21 | 2016-12-07 | 厦门大学 | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement |
-
2020
- 2020-02-07 CN CN202010083043.6A patent/CN111326215B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002027024A2 (en) * | 2000-09-28 | 2002-04-04 | Office Of The Staff Judge Advocate U.S. Army Medical Research And Material Command | Automated method of identifying and archiving nucleic acid sequences |
CN1598821A (en) * | 2004-09-07 | 2005-03-23 | 东南大学 | Seaching method of genome sequence data based on characteristic |
CN105950707A (en) * | 2016-03-30 | 2016-09-21 | 广州精科生物技术有限公司 | Method and system for determining nucleic acid sequence |
CN106202999A (en) * | 2016-07-21 | 2016-12-07 | 厦门大学 | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement |
Non-Patent Citations (3)
Title |
---|
Comparison of Metatranscriptomic Samples Based on k-Tuple Frequencies;Wang Ying et al.;《PLOS ONE》;20140102;第9卷(第1期);第1-19页 * |
Effect of k-tuple length on sample-comparison with high-throughput sequencing data;Wang Ying et al.;《Biochemical and Biophysical Research Communications》;20160122;第469卷(第4期);第1021-1027页 * |
基于k-tuple频度统计的微生物群落;刘麟;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20140815(第8期);第A006-197页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111326215A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200058374A1 (en) | Systems and methods for adaptive local alignment for graph genomes | |
EP2499569B1 (en) | Clustering method and system | |
Sato et al. | RNA secondary structural alignment with conditional random fields | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
WO2015081754A1 (en) | Genome compression and decompression | |
CN111326215B (en) | Method and system for searching nucleic acid sequence based on k-tuple frequency | |
CN110597844B (en) | Unified access method for heterogeneous database data and related equipment | |
CN108399268B (en) | Incremental heterogeneous graph clustering method based on game theory | |
CN111026877A (en) | Knowledge verification model construction and analysis method based on probability soft logic | |
Saini et al. | Probabilistic expression of spatially varied amino acid dimers into general form of Chou׳ s pseudo amino acid composition for protein fold recognition | |
CN113722512A (en) | Text retrieval method, device and equipment based on language model and storage medium | |
EP3703061A1 (en) | Image retrieval | |
CN111309930A (en) | Medical knowledge graph entity alignment method based on representation learning | |
CN112818091A (en) | Object query method, device, medium and equipment based on keyword extraction | |
Ni et al. | Applying frequency chaos game representation with perceptual image hashing to gene sequence phylogenetic analyses | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN114266249A (en) | Mass text clustering method based on birch clustering | |
CN118095278A (en) | Co-reference resolution document level relation extraction method based on pre-training model | |
CN110377721B (en) | Automatic question answering method, device, storage medium and electronic equipment | |
CN105162648B (en) | Corporations' detection method based on backbone network extension | |
US20200142910A1 (en) | Data clustering apparatus and method based on range query using cf tree | |
CN115357691A (en) | Semantic retrieval method, system, equipment and computer readable storage medium | |
CN112632287B (en) | Electric power knowledge graph construction method and device | |
CN107608972B (en) | Multi-text quick summarization method | |
CN111275201A (en) | Sub-graph division based distributed implementation method for semi-supervised learning of graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
OL01 | Intention to license declared | ||
OL01 | Intention to license declared |