WO2020211466A1 - 一种非冗余基因集聚类方法、系统及电子设备 - Google Patents

一种非冗余基因集聚类方法、系统及电子设备 Download PDF

Info

Publication number
WO2020211466A1
WO2020211466A1 PCT/CN2019/130563 CN2019130563W WO2020211466A1 WO 2020211466 A1 WO2020211466 A1 WO 2020211466A1 CN 2019130563 W CN2019130563 W CN 2019130563W WO 2020211466 A1 WO2020211466 A1 WO 2020211466A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
gene set
redundant
clustering
pairs
Prior art date
Application number
PCT/CN2019/130563
Other languages
English (en)
French (fr)
Inventor
郑志春
郭宁
魏彦杰
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to EP19925504.3A priority Critical patent/EP3955256A4/en
Publication of WO2020211466A1 publication Critical patent/WO2020211466A1/zh
Priority to US17/477,471 priority patent/US20220005546A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Definitions

  • This application belongs to the technical field of genetic data processing, and particularly relates to a non-redundant gene set clustering method, system and electronic equipment.
  • next generation sequencing technology NGS
  • the amount of biological sequence data has exploded. It is generally believed that if two sequences meet a certain similarity threshold condition, the two sequences are considered to be the same sequence or redundant to each other. A large number of redundant sequences will not only affect the analysis speed of genome research, increase memory consumption, reduce the speed of the analysis process, but also cause errors and affect the final experimental results.
  • Hobohm and Sander[Hobohm U,Scharf M,Schneider R,et al.Selection of representative protein data sets.[J].Protein Science,2010,1(3):409-417; Hobohm U,Sander C.Enlarged representative set of protein structures.[J].Protein Science,2010,3(3):522-524.] is the earliest clustering algorithm for non-redundant gene sequences. The basic idea is to first divide the collection of gene sequences into several different Class, and then find a sequence from each class to represent the class, and finally the set formed by these representative classes is the non-redundant reference gene set.
  • the software for de-redundancy of biological genetic data mainly includes NRDB90[Holm L,Sander C.Removing near-neighbour redundancy from large protein sequence collections.[J].Bioinformatics,1998,14(5):423-429.], CD -HIT[Li W,Jaroszewski L,Godzik A.Clustering of highly homogeneous sequences to reduce the size of large protein databases[J].Bioinformatics,2001,17(3):282-283;Li W,Jaroszewski A,Godzik .Tolerating some Redundancy Significantly Speeds up Clustering of Large Protein Databases[J].Bioinformatics,2002,18(1):77-82; Li W.
  • CD-HIT is the most widely used when de-redundant in research.
  • CD-HIT is a software developed by the Burnham Institute in the United States to solve the problem of large-scale protein sequence redundancy. It can complete the construction of non-redundant reference gene sets in a relatively short time.
  • the specific implementation principle first sort all sequences according to their length, then start with the longest sequence to form the first sequence class, and then process the sequence in turn. If the new sequence is the same as the representative sequence of the existing sequence class If the similarity is above the cutoff, the sequence is added to the sequence class, otherwise a new sequence class is formed.
  • CD-HIT is fast is mainly due to two reasons: one is the word filtering method, that is, if the similarity between two sequences is 80% (assuming the sequence length is 100), then they have at least 60 the same There are at least 40 identical words of length 3, and at least 20 identical words of length 4. Based on this principle, when processing a new sequence, if the length of the same word of the new sequence and the existing sequence cannot meet these requirements, there is no need for comparison, which greatly reduces time consumption; another reason is Using index table, you can quickly calculate the number of identical words between sequences.
  • CD-HIT has a very high efficiency in de-redundancy and can complete the construction of a non-redundant reference gene set in a short time, it uses the new sequence and the current sequence class each time it is aligned. The representative sequence is compared, resulting in no reference value for other sequences in the current sequence class. For example, there are three gene sequences arranged from largest to smallest in length, A, B, and C. According to the CD-HIT clustering method, A is first classified into one category, and then B and C are taken out for comparison. If A and B are similar in length, A and B reach the threshold and A and C do not reach the threshold, then we will get two categories AB and C. In fact, C should also be considered a sequence similar to A.
  • the word filter-based method makes the level of redundancy that can be processed for each length of word is limited; for example, a word of length 3 can only obtain sequence classes with a similarity of more than 66.7%.
  • This application provides a non-redundant gene set clustering method, system and electronic device, which aim to solve one of the above technical problems in the prior art at least to a certain extent.
  • a non-redundant gene set clustering method including the following steps:
  • Step a Perform a comparison operation on the original gene set to obtain gene pairs that meet the similarity threshold in the original gene set;
  • Step b construct and collect forests based on the obtained gene pairs
  • Step c Obtain gene clustering results of all classes in the original gene set according to the combined search forest;
  • Step d Based on the gene clustering result, the longest sequence in each category is selected as the representative sequence of each category to obtain a non-redundant reference genome.
  • the technical solution adopted in the embodiment of the present application further includes: in the step a, the comparison operation of the original gene set specifically includes: setting a similarity threshold, and comparing the original gene set to its own gene set through BLAT ; Optimize the output information of BLAT, eliminate duplicate information and remove exactly the same sequence. Finally, delete unnecessary column information, and retain the sequence name of the gene pair and their respective length information.
  • the technical solution adopted in the embodiment of the present application further includes: in the step b, the construction and search of a forest based on the obtained gene pairs specifically includes: for any two gene pairs, firstly search for two genes through a Find operation If the root information of the two gene pairs is the same, the numbers represented by the two gene pairs are merged into a tree through the Union operation, and the root information is updated; if the root information of the two gene pairs are not the same, then No Union operation is performed.
  • the technical solution adopted in the embodiment of the application further includes: the step b also includes: path optimization is performed on the merged search forest through a path compression operation, and the child nodes of each tree point to the root node, and when the trees are merged , Merge the smaller number of trees to the larger number of trees to get the optimized merged search forest.
  • a non-redundant gene set clustering system including:
  • Gene comparison module used to perform a comparison operation on the original gene set, and obtain gene pairs that meet the similarity threshold in the original gene set;
  • Combining search building module used to construct and search forest based on the obtained gene pairs
  • Gene clustering module used to obtain gene clustering results of all classes in the original gene set according to the combined search forest;
  • Result output module Based on the gene clustering result, the longest sequence in each category is selected as the representative sequence of each category to obtain a non-redundant reference genome.
  • the technical solution adopted in the embodiment of the application further includes: the gene comparison module performs a comparison operation on the original gene set specifically: setting a similarity threshold, and comparing the original gene set to its own gene set through BLAT;
  • the output information is optimized to eliminate duplicate information and remove exactly the same sequence. Finally, delete unnecessary column information, and retain the sequence name of the gene pair and their respective length information.
  • the technical solution adopted in the embodiments of the present application further includes: the construction of the combined search building module based on the obtained gene pairs and the search forest specifically includes: for any two gene pairs, firstly search for the two gene pairs through the Find operation Root information, if the root information of the two gene pairs are the same, merge the numbers represented by the two gene pairs into a tree through the Union operation, and update the root information; if the root information of the two gene pairs are not the same, do not proceed Union operation.
  • the technical solution adopted in the embodiment of the present application further includes a combined search optimization module, which is used to optimize the path of the combined search forest through a path compression operation, and direct the child nodes of each tree to the root When merging trees, merge the smaller number of trees to the larger number of trees to obtain the optimized merged search forest.
  • a combined search optimization module which is used to optimize the path of the combined search forest through a path compression operation, and direct the child nodes of each tree to the root When merging trees, merge the smaller number of trees to the larger number of trees to obtain the optimized merged search forest.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions executable by the one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the following of the non-redundant gene set clustering method described above operating:
  • Step a Perform a comparison operation on the original gene set to obtain gene pairs that meet the similarity threshold in the original gene set;
  • Step b construct and collect forests based on the obtained gene pairs
  • Step c Obtain gene clustering results of all classes in the original gene set according to the combined search forest;
  • Step d Based on the gene clustering result, the longest sequence in each category is selected as the representative sequence of each category to obtain a non-redundant reference genome.
  • the beneficial effects produced by the embodiments of the present application are: the non-redundant gene set clustering method, system and electronic equipment of the embodiments of the present application perform non-redundant gene set clustering by using BLAT comparison and the data structure based on the union search.
  • the clustering of surplus gene sets can take into account the similarity between more genes and improve the accuracy of de-redundancy.
  • the construction of non-redundant gene sets can be completed in a very fast time through further path compression optimization, which improves the construction efficiency of non-reference gene sets.
  • FIG. 1 is a flowchart of a non-redundant gene set clustering method according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a non-redundant gene set clustering system according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of the hardware device structure of the non-redundant gene set clustering method provided by an embodiment of the present application.
  • FIG. 1 is a flowchart of a non-redundant gene set clustering method according to an embodiment of the present application.
  • the non-redundant gene set clustering method of the embodiment of the present application includes the following steps:
  • Step 100 Perform a comparison operation on the original gene set through the gene comparison software BLAT, and obtain the gene pairs that meet the similarity threshold in the original gene set;
  • step 100 the method for obtaining the gene pair of the similarity threshold is specifically as follows: firstly set the similarity threshold, compare the original gene set to its own gene set through the gene comparison software BLAT; then optimize the output information of BLAT Since the self-alignment is adopted, the sequence will be aligned twice. In the embodiment of the present application, repeated information will be eliminated and sequences with 100% similarity (ie identical sequences) will be removed. Finally, delete some unnecessary column information, and only retain the sequence name of the gene pair and their respective length information.
  • Step 200 Based on the obtained gene pairs, construct a merged search forest through the Find and Union operations of the merged search;
  • step 200 after the comparison operation of the gene set is completed, a series of gene pairs will be obtained, and then the construction of the combined search forest can be carried out.
  • the merge search algorithm mainly includes two operations, Find and Union, specifically:
  • the construction of the union search forest is specifically as follows: for any two gene pairs, first find the root information of the two gene pairs through the Find operation, if the root information of the two gene pairs are the same, then use Union The operation merges the numbers represented by the two gene pairs into the same tree, and updates the root information. If the root information of the two gene pairs is not the same, the Union operation is not performed. As the number of gene pairs increases, forests are obtained and collected.
  • Step 300 Perform path optimization on the merged search forest through a path compression operation, and point the child nodes of each tree to the root node, and when merging trees, merge the smaller number of trees into the larger number of trees to obtain Optimized combined search forest;
  • step 300 as the number of gene pairs increases, the height of the merged tree becomes larger and larger, which will affect subsequent query merge operations.
  • this application uses path compression to optimize the path of the combined search forest, which can greatly improve the clustering efficiency of non-reference gene sets.
  • Step 400 Obtain gene clustering results of all classes in the original gene set according to the optimized union search forest;
  • Step 500 Based on the gene clustering result, the longest sequence in each category is selected as the representative sequence of each category, and the final non-redundant reference genome is obtained.
  • step 500 after the construction of the union search forest, all the clusters based on the original gene set clustering are obtained. Using the stored length information, the longest sequence in each category is selected as the representative sequence to form the final non-redundant reference genome.
  • FIG. 2 is a schematic structural diagram of a non-redundant gene set clustering system according to an embodiment of the present application.
  • the non-redundant gene set clustering system of the embodiment of the present application includes a gene comparison module, a combined search set construction module, a combined search set optimization module, a gene clustering module, and a result output module.
  • Gene comparison module used to perform comparison operations on the original gene set through the gene comparison software BLAT to obtain gene pairs that meet the similarity threshold in the original gene set; among them, the method of obtaining the gene pair of the similarity threshold is as follows: Set the similarity threshold, and compare the original gene set to its own gene set through the gene alignment software BLAT; then optimize the output information of BLAT. Since the self-alignment is adopted, there will be two sequence alignments. In the case of this application, repeated information will be eliminated and sequences with a similarity of 100% (ie, identical sequences) will be eliminated. Finally, delete some unnecessary column information, and only retain the sequence name of the gene pair and their respective length information.
  • Combined search building module used to construct and optimize the combined search forest based on the obtained gene pairs through the Find and Union operations of the combined search; among them, after the comparison operation of the gene set is completed, a series of genes will be obtained Yes, then you can proceed and check the construction of the forest.
  • the merge search algorithm mainly includes two operations, Find and Union, specifically:
  • the construction of the union search forest is specifically as follows: for any two gene pairs, first find the root information of the two gene pairs through the Find operation, if the root information of the two gene pairs are the same, then use Union The operation merges the numbers represented by the two gene pairs into the same tree, and updates the root information. If the root information of the two gene pairs is not the same, the Union operation is not performed. As the number of gene pairs increases, forests are obtained and collected.
  • Union search optimization module used to optimize the path of the union search forest through path compression operations, point the child nodes of each tree to the root node, and when merging trees, merge the smaller number of trees into the larger number In the tree, the optimized union search forest is obtained; among them, as the number of gene pairs increases, the height of the merged tree becomes larger and larger, which will affect the subsequent query merge operation.
  • this application uses path compression to optimize the path of the combined search forest, which can greatly improve the clustering efficiency of non-reference gene sets.
  • Gene clustering module used to obtain the gene clustering results of all classes in the original gene set according to the optimized union search forest;
  • Result output module Based on the gene clustering results, the longest sequence in each category is selected as the representative sequence of each category to obtain the final non-redundant reference genome. Among them, after the construction of the union check forest, all the clusters based on the original gene set clustering are obtained. Using the stored length information, the longest sequence in each category is selected as the representative sequence to form the final non-redundant reference genome.
  • FIG. 3 is a schematic diagram of the hardware device structure of the non-redundant gene set clustering method provided by an embodiment of the present application.
  • the device includes one or more processors and memory. Taking a processor as an example, the device may also include: an input system and an output system.
  • the processor, the memory, the input system, and the output system may be connected by a bus or other methods.
  • the connection by a bus is taken as an example.
  • the memory can be used to store non-transitory software programs, non-transitory computer executable programs, and modules.
  • the processor executes various functional applications and data processing of the electronic device by running non-transitory software programs, instructions, and modules stored in the memory, that is, realizing the processing methods of the foregoing method embodiments.
  • the memory may include a program storage area and a data storage area, where the program storage area can store an operating system and an application program required by at least one function; the data storage area can store data and the like.
  • the memory may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid state storage devices.
  • the storage may optionally include storage remotely arranged with respect to the processor, and these remote storages may be connected to the processing system through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input system can receive input digital or character information, and generate signal input.
  • the output system may include display devices such as a display screen.
  • the one or more modules are stored in the memory, and when executed by the one or more processors, the following operations of any of the foregoing method embodiments are performed:
  • Step a Perform a comparison operation on the original gene set to obtain gene pairs that meet the similarity threshold in the original gene set;
  • Step b construct and collect forests based on the obtained gene pairs
  • Step c Obtain gene clustering results of all classes in the original gene set according to the combined search forest;
  • Step d Based on the gene clustering result, the longest sequence in each category is selected as the representative sequence of each category to obtain a non-redundant reference genome.
  • the embodiment of the present application provides a non-transitory (nonvolatile) computer storage medium, the computer storage medium stores computer executable instructions, and the computer executable instructions can perform the following operations:
  • Step a Perform a comparison operation on the original gene set to obtain gene pairs that meet the similarity threshold in the original gene set;
  • Step b construct and collect forests based on the obtained gene pairs
  • Step c Obtain gene clustering results of all classes in the original gene set according to the combined search forest;
  • Step d Based on the gene clustering result, the longest sequence in each category is selected as the representative sequence of each category to obtain a non-redundant reference genome.
  • the embodiment of the present application provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer To make the computer do the following:
  • Step a Perform a comparison operation on the original gene set to obtain gene pairs that meet the similarity threshold in the original gene set;
  • Step b construct and collect forests based on the obtained gene pairs
  • Step c Obtain gene clustering results of all classes in the original gene set according to the combined search forest;
  • Step d Based on the gene clustering result, the longest sequence in each category is selected as the representative sequence of each category to obtain a non-redundant reference genome.
  • the non-redundant gene set clustering method, system and electronic device of the embodiments of the present application use BLAT comparison and the data structure based on the union search to perform the non-redundant gene set clustering work, which can take into account the differences between more genes.
  • the similarity improves the accuracy of de-redundancy.
  • the construction of non-redundant gene sets can be completed in a very fast time through further path compression optimization, and the construction efficiency of non-reference gene sets can be improved.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请涉及一种非冗余基因集聚类方法、系统及电子设备。所述方法包括:步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;步骤b:基于所获取的基因对构建并查集森林;步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。本申请通过使用BLAT比对以及基于并查集的数据结构进行非冗余基因集的聚类工作,可以兼顾更多基因之间的相似度,提高去冗余的精确程度。

Description

一种非冗余基因集聚类方法、系统及电子设备 技术领域
本申请属于基因数据处理技术领域,特别涉及一种非冗余基因集聚类方法、系统及电子设备。
背景技术
随着下一代测序技术(NGS)的快速发展,生物序列数据量出现爆炸性的增长。一般认为,如果两条序列满足一定的相似度阈值条件,就认为这两条序列是相同的序列或者互为冗余。大量的冗余序列不仅会影响基因组研究的分析速度,增加内存消耗,降低分析流程的速度,更会导致错误的产生,影响最终的实验结果。
Hobohm和Sander[Hobohm U,Scharf M,Schneider R,et al.Selection of representative protein data sets.[J].Protein Science,2010,1(3):409-417;Hobohm U,Sander C.Enlarged representative set of protein structures.[J].Protein Science,2010,3(3):522-524.]是最早完成非冗余基因序列的聚类算法,基本思路是先将基因序列集合划分为若干个不同的类,然后从各个类中找出一个序列来代表该类,最终这些代表类所形成的集合就是非冗余参考基因集。
针对生物基因数据去冗余的软件主要有NRDB90[Holm L,Sander C.Removing near-neighbour redundancy from large protein sequence collections.[J].Bioinformatics,1998,14(5):423-429.]、CD-HIT[Li W,Jaroszewski L,Godzik A.Clustering of highly homologous sequences to reduce the size of large protein databases[J].Bioinformatics,2001,17(3):282-283;Li W,Jaroszewski L,Godzik A.Tolerating some Redundancy Significantly Speeds up Clustering of Large Protein Databases[J].Bioinformatics,2002,18(1):77-82;Li W.Fast Program for  Clustering and Comparing Large Sets of Protein or Nucleotide Sequences[M].Springer US,2015.]、PICSES[Wang G,Jr D R.PISCES:a protein sequence culling server[J].Bioinformatics,2003,19(12):1589.]等,他们各具特色,均由序列比对和选取最终的冗余序列两部分构成。
目前,在研究中去冗余时使用最多应用最广泛的就是CD-HIT。CD-HIT是由美国Burnham Institute开发的用来解决大规模蛋白质序列冗余问题的软件,可以在较短的时间内完成非冗余参考基因集的构建。具体实现原理:首先对所有序列按照其长度进行排序,然后从最长的序列开始,形成第一个序列类,然后依次对序列进行处理,如果新的序列与已有的序列类的代表序列的相似性在cutoff以上则把该序列加到该序列类中,否则形成新的序列类。
CD-HIT之所以快主要是两个方面的原因:一个是使用了word过滤方法,即如果两条序列之间的相似性在80%(假设序列长度为100),那么它们至少有60个相同的长度为2的word,至少有40个相同的长度为3的word,至少有20个相同的长度为4的word。基于这个原则,在处理新的序列的时候,如果新的序列与已有序列的相同word的长度不能满足这些要求则不需要进行比对了,这极大的降低了时间消耗;另外一个原因是使用了index table,可以很快的计算序列之间相同word的数目。
尽管CD-HIT在去冗余时效率十分的高,可以在很短的时间内完成非冗余参考基因集的构建,但由于其每次在比对时都是使用新序列与当前序列类的代表序列进行比对,导致当前序列类中的其他序列没有了参考价值。比如存在A、B、C三个按长度从大到小排列的基因序列,按照CD-HIT的聚类方法,A首先被分作一类,然后依次取出B、C进行比对。如果A、B长度相近,A、B达到阈值并且A、C未达到阈值,这样我们将得到AB和C两个类别。而实际上C也 应该可以认为与A是相似的序列。除此之外,基于word filter的方法使得每个长度的word能够处理的冗余性水平有限;例如长度为3的word只能够得到相似性66.7%以上的序列类。
基于上述问题,有必要提供一种新的非冗余基因集聚类方法,能够在提高基因去冗余过程中的准确度和效率的同时,尽可能更准确的剔除冗余基因。
发明内容
本申请提供了一种非冗余基因集聚类方法、系统及电子设备,旨在至少在一定程度上解决现有技术中的上述技术问题之一。
为了解决上述问题,本申请提供了如下技术方案:
一种非冗余基因集聚类方法,包括以下步骤:
步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
步骤b:基于所获取的基因对构建并查集森林;
步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
本申请实施例采取的技术方案还包括:在所述步骤a中,所述对原始基因集合进行比对操作具体为:设定相似度阈值,通过BLAT将原始基因集合比对到自身基因集合上;将BLAT的输出信息进行优化,剔除重复信息并且去除完全相同的序列,最后,删除不需要的列信息,保留基因对的序列名称及各自的长度信息。
本申请实施例采取的技术方案还包括:在所述步骤b中,所述基于所获取的基因对构建并查集森林具体包括:对于任意的两个基因对,首先通过Find 操作查找两个基因对的root信息,如果两个基因对的root信息相同,则通过Union操作将两个基因对所代表的数合并成一棵树,并更新root信息;如果两个基因对的root信息不相同,则不进行Union操作。
本申请实施例采取的技术方案还包括:所述步骤b还包括:通过路径压缩操作对所述并查集森林进行路径优化,将每棵树的子节点都指向root节点,并在合并树时,将数目较小的树合并到数目较大的树上,得到优化后的并查集森林。
本申请实施例采取的另一技术方案为:一种非冗余基因集聚类系统,包括:
基因比对模块:用于对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
并查集构建模块:用于基于所获取的基因对构建并查集森林;
基因聚类模块:用于根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
结果输出模块:用于基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
本申请实施例采取的技术方案还包括:所述基因比对模块对原始基因集合进行比对操作具体为:设定相似度阈值,通过BLAT将原始基因集合比对到自身基因集合上;将BLAT的输出信息进行优化,剔除重复信息并且去除完全相同的序列,最后,删除不需要的列信息,保留基因对的序列名称及各自的长度信息。
本申请实施例采取的技术方案还包括:所述并查集构建模块基于所获取的基因对构建并查集森林具体包括:对于任意的两个基因对,首先通过Find操作查找两个基因对的root信息,如果两个基因对的root信息相同,则通过 Union操作将两个基因对所代表的数合并成一棵树,并更新root信息;如果两个基因对的root信息不相同,则不进行Union操作。
本申请实施例采取的技术方案还包括并查集优化模块,所述并查集优化模块用于通过路径压缩操作对所述并查集森林进行路径优化,将每棵树的子节点都指向root节点,并在合并树时,将数目较小的树合并到数目较大的树上,得到优化后的并查集森林。
本申请实施例采取的又一技术方案为:一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的非冗余基因集聚类方法的以下操作:
步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
步骤b:基于所获取的基因对构建并查集森林;
步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
相对于现有技术,本申请实施例产生的有益效果在于:本申请实施例的非冗余基因集聚类方法、系统及电子设备通过使用BLAT比对以及基于并查集的数据结构进行非冗余基因集的聚类工作,可以兼顾更多基因之间的相似度,提高去冗余的精确程度。同时,基于并查集的数据结构,通过进一步的路径压缩 优化可以在非常快的时间内完成非冗余基因集的构建,提升非参考基因集的构建效率。
附图说明
图1是本申请实施例的非冗余基因集聚类方法的流程图;
图2是本申请实施例的非冗余基因集聚类系统的结构示意图;
图3是本申请实施例提供的非冗余基因集聚类方法的硬件设备结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
请参阅图1,是本申请实施例的非冗余基因集聚类方法的流程图。本申请实施例的非冗余基因集聚类方法包括以下步骤:
步骤100:通过基因比对软件BLAT对原始基因集合进行比对操作,获取原始基因集合中满足相似度阈值的基因对;
步骤100中,相似度阈值的基因对获取方式具体为:首先设定好相似度阈值,通过基因比对软件BLAT将原始基因集合比对到自身基因集合上;紧接着将BLAT的输出信息进行优化,由于采取的是自身比对,所以会出现序列比对两次的情况,本申请实施例中,会剔除重复信息并且去除相似度100%的序列(即完全相同的序列)。最后,删除一些不需要的列信息,仅仅保留基因对的序列名称及各自的长度信息。
步骤200:基于获取的基因对,通过并查集的Find和Union操作进行并查集森林的构建;
步骤200中,基因集合的比对操作完成后,会获得一系列的基因对,然后就可以进行并查集森林的构建。并查集算法主要包括Find和Union两个操作,具体为:
Find:确定元素属于哪一个子集,可以被用来确定两个元素是否属于同一子集;
Union:将两个子集合并成同一个集合。
本申请实施例中,并查集森林的构建具体为:对于任意的两个基因对,首先通过Find操作查找这两个基因对的root信息,如果两个基因对的root信息相同,则通过Union操作将两个基因对所代表的数合并成同一棵树,并更新root信息。如果两个基因对的root信息不相同,则不进行Union操作。随着基因对数目的增加,得到并查集森林。
步骤300:通过路径压缩操作对并查集森林进行路径优化,将每棵树的子节点都指向root节点,并且在合并树时,将数目较小的树合并到数目较大的树上,得到优化后的并查集森林;
步骤300中,随着基因对数目的增加,合并得到的树的高度越来越大,会影响后续的查询合并操作。为了解决由于树的深度过大带来的查询效率较低的问题,本申请通过路径压缩的方式对并查集森林进行路径优化,可以大幅提升非参考基因集的聚类效率。
步骤400:根据优化后的并查集森林得到原始基因集合中所有类的基因聚类结果;
步骤500:基于基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到最终的非冗余参考基因组。
步骤500中,经过并查集森林的构建,获得基于原始基因集合聚类完成的所有类。利用存储的长度信息,选取每一类中最长的一条序列作为代表序列,形成最后的非冗余参考基因组。
请参阅图2,是本申请实施例的非冗余基因集聚类系统的结构示意图。本申请实施例的非冗余基因集聚类系统包括基因比对模块、并查集构建模块、并查集优化模块、基因聚类模块和结果输出模块。
基因比对模块:用于通过基因比对软件BLAT对原始基因集合进行比对操作,获取原始基因集合中满足相似度阈值的基因对;其中,相似度阈值的基因对获取方式具体为:首先设定好相似度阈值,通过基因比对软件BLAT将原始基因集合比对到自身基因集合上;紧接着将BLAT的输出信息进行优化,由于采取的是自身比对,所以会出现序列比对两次的情况,本申请实施例中,会剔除重复信息并且去除相似度100%的序列(即完全相同的序列)。最后,删除一些不需要的列信息,仅仅保留基因对的序列名称及各自的长度信息。
并查集构建模块:用于基于获取的基因对,通过并查集的Find和Union操作进行并查集森林的构建与优化;其中,基因集合的比对操作完成后,会获得一系列的基因对,然后就可以进行并查集森林的构建。并查集算法主要包括Find和Union两个操作,具体为:
Find:确定元素属于哪一个子集,可以被用来确定两个元素是否属于同一子集;
Union:将两个子集合并成同一个集合。
本申请实施例中,并查集森林的构建具体为:对于任意的两个基因对,首先通过Find操作查找这两个基因对的root信息,如果两个基因对的root信息相同,则通过Union操作将两个基因对所代表的数合并成同一棵树,并更新root信息。如果两个基因对的root信息不相同,则不进行Union操作。随着基因对数目的增加,得到并查集森林。
并查集优化模块:用于通过路径压缩操作对并查集森林进行路径优化,将每棵树的子节点都指向root节点,并且在合并树时,将数目较小的树合并到数目较大的树上,得到优化后的并查集森林;其中,随着基因对数目的增加,合并得到的树的高度越来越大,会影响后续的查询合并操作。为了解决由于树的深度过大带来的查询效率较低的问题,本申请通过路径压缩的方式对并查集森林进行路径优化,可以大幅提升非参考基因集的聚类效率。
基因聚类模块:用于根据优化后的并查集森林得到原始基因集合中所有类的基因聚类结果;
结果输出模块:用于基于基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到最终的非冗余参考基因组。其中,经过并查集森林的构建,获得基于原始基因集合聚类完成的所有类。利用存储的长度信息,选取每一类中最长的一条序列作为代表序列,形成最后的非冗余参考基因组。
图3是本申请实施例提供的非冗余基因集聚类方法的硬件设备结构示意图。如图3所示,该设备包括一个或多个处理器以及存储器。以一个处理器为例,该设备还可以包括:输入系统和输出系统。
处理器、存储器、输入系统和输出系统可以通过总线或者其他方式连接,图3中以通过总线连接为例。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、 非暂态计算机可执行程序以及模块。处理器通过运行存储在存储器中的非暂态软件程序、指令以及模块,从而执行电子设备的各种功能应用以及数据处理,即实现上述方法实施例的处理方法。
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至处理系统。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入系统可接收输入的数字或字符信息,以及产生信号输入。输出系统可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器中,当被所述一个或者多个处理器执行时,执行上述任一方法实施例的以下操作:
步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
步骤b:基于所获取的基因对构建并查集森林;
步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例提供的方法。
本申请实施例提供了一种非暂态(非易失性)计算机存储介质,所述计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行以下操作:
步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
步骤b:基于所获取的基因对构建并查集森林;
步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行以下操作:
步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
步骤b:基于所获取的基因对构建并查集森林;
步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
本申请实施例的非冗余基因集聚类方法、系统及电子设备通过使用BLAT比对以及基于并查集的数据结构进行非冗余基因集的聚类工作,可以兼顾更多基因之间的相似度,提高去冗余的精确程度。同时,基于并查集的数据结构,通过进一步的路径压缩优化可以在非常快的时间内完成非冗余基因集的构建,提升非参考基因集的构建效率。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通 技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (9)

  1. 一种非冗余基因集聚类方法,其特征在于,包括以下步骤:
    步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
    步骤b:基于所获取的基因对构建并查集森林;
    步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
    步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
  2. 根据权利要求1所述的非冗余基因集聚类方法,其特征在于,在所述步骤a中,所述对原始基因集合进行比对操作具体为:设定相似度阈值,通过BLAT将原始基因集合比对到自身基因集合上;将BLAT的输出信息进行优化,剔除重复信息并且去除完全相同的序列,最后,删除不需要的列信息,保留基因对的序列名称及各自的长度信息。
  3. 根据权利要求1或2所述的非冗余基因集聚类方法,其特征在于,在所述步骤b中,所述基于所获取的基因对构建并查集森林具体包括:对于任意的两个基因对,首先通过Find操作查找两个基因对的root信息,如果两个基因对的root信息相同,则通过Union操作将两个基因对所代表的数合并成一棵树,并更新root信息;如果两个基因对的root信息不相同,则不进行Union操作。
  4. 根据权利要求3所述的非冗余基因集聚类方法,其特征在于,所述步骤b还包括:通过路径压缩操作对所述并查集森林进行路径优化,将每棵树的子节点都指向root节点,并在合并树时,将数目较小的树合并到数目较大的树上,得到优化后的并查集森林。
  5. 一种非冗余基因集聚类系统,其特征在于,包括:
    基因比对模块:用于对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
    并查集构建模块:用于基于所获取的基因对构建并查集森林;
    基因聚类模块:用于根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
    结果输出模块:用于基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
  6. 根据权利要求5所述的非冗余基因集聚类系统,其特征在于,所述基因比对模块对原始基因集合进行比对操作具体为:设定相似度阈值,通过BLAT将原始基因集合比对到自身基因集合上;将BLAT的输出信息进行优化,剔除重复信息并且去除完全相同的序列,最后,删除不需要的列信息,保留基因对的序列名称及各自的长度信息。
  7. 根据权利要求5或6所述的非冗余基因集聚类系统,其特征在于,所述并查集构建模块基于所获取的基因对构建并查集森林具体包括:对于任意的两个基因对,首先通过Find操作查找两个基因对的root信息,如果两个基因对的root信息相同,则通过Union操作将两个基因对所代表的数合并成一棵树,并更新root信息;如果两个基因对的root信息不相同,则不进行Union操作。
  8. 根据权利要求7所述的非冗余基因集聚类系统,其特征在于,还包括并查集优化模块,所述并查集优化模块用于通过路径压缩操作对所述并查集森林进行路径优化,将每棵树的子节点都指向root节点,并在合并树时,将数目较小的树合并到数目较大的树上,得到优化后的并查集森林。
  9. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述1至4任一项所述的非冗余基因集聚类方法的以下操作:
    步骤a:对原始基因集合进行比对操作,获取所述原始基因集合中满足相似度阈值的基因对;
    步骤b:基于所获取的基因对构建并查集森林;
    步骤c:根据所述并查集森林得到原始基因集合中所有类的基因聚类结果;
    步骤d:基于所述基因聚类结果,分别选取每一类中的最长序列作为每一类的代表序列,得到非冗余参考基因组。
PCT/CN2019/130563 2019-04-16 2019-12-31 一种非冗余基因集聚类方法、系统及电子设备 WO2020211466A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19925504.3A EP3955256A4 (en) 2019-04-16 2019-12-31 METHOD AND SYSTEM FOR NON-REDUNDANT GENE CLUSTERIZATION AND ELECTRONIC DEVICE
US17/477,471 US20220005546A1 (en) 2019-04-16 2021-09-16 Non-redundant gene set clustering method and system, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910303390.2A CN110060740A (zh) 2019-04-16 2019-04-16 一种非冗余基因集聚类方法、系统及电子设备
CN201910303390.2 2019-04-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/477,471 Continuation US20220005546A1 (en) 2019-04-16 2021-09-16 Non-redundant gene set clustering method and system, and electronic device

Publications (1)

Publication Number Publication Date
WO2020211466A1 true WO2020211466A1 (zh) 2020-10-22

Family

ID=67319187

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130563 WO2020211466A1 (zh) 2019-04-16 2019-12-31 一种非冗余基因集聚类方法、系统及电子设备

Country Status (4)

Country Link
US (1) US20220005546A1 (zh)
EP (1) EP3955256A4 (zh)
CN (1) CN110060740A (zh)
WO (1) WO2020211466A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060740A (zh) * 2019-04-16 2019-07-26 中国科学院深圳先进技术研究院 一种非冗余基因集聚类方法、系统及电子设备
CN111026920A (zh) * 2019-12-17 2020-04-17 深圳云天励飞技术有限公司 一种档案合并方法、装置、电子设备及存储介质
US20240248628A1 (en) * 2023-01-24 2024-07-25 VMware LLC Tiered memory data structures and algorithms for union-find
CN117037912B (zh) * 2023-09-13 2024-06-18 青岛极智医学检验实验室有限公司 一种泛基因组的构建方法、终端设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197434A (zh) * 2018-01-16 2018-06-22 深圳市泰康吉音生物科技研发服务有限公司 去除宏基因组测序数据中人源基因序列的方法
WO2018186740A1 (en) * 2017-04-04 2018-10-11 Skylinedx B.V Method for identifying gene expression signatures
CN108846259A (zh) * 2018-04-26 2018-11-20 河南师范大学 一种基于聚类和随机森林算法的基因分类方法及系统
CN110060740A (zh) * 2019-04-16 2019-07-26 中国科学院深圳先进技术研究院 一种非冗余基因集聚类方法、系统及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060052943A1 (en) * 2004-07-28 2006-03-09 Karthik Ramani Architectures, queries, data stores, and interfaces for proteins and drug molecules
KR20080094347A (ko) * 2007-04-20 2008-10-23 인하대학교 산학협력단 금 표면에서의 특이적인 dna와 단백질의 spri 방법
US9553997B2 (en) * 2014-11-01 2017-01-24 Somos, Inc. Toll-free telecommunications management platform
CN106971091B (zh) * 2017-03-03 2020-08-28 江苏大学 一种基于确定性粒子群优化和支持向量机的肿瘤识别方法
CN107577923B (zh) * 2017-09-26 2018-12-04 广东美格基因科技有限公司 一种高度相似微生物的鉴定和分类方法
CN109243531B (zh) * 2018-07-24 2021-11-26 江苏省农业科学院 一种批量计算近缘物种间基因组编码区snp位点的方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018186740A1 (en) * 2017-04-04 2018-10-11 Skylinedx B.V Method for identifying gene expression signatures
CN108197434A (zh) * 2018-01-16 2018-06-22 深圳市泰康吉音生物科技研发服务有限公司 去除宏基因组测序数据中人源基因序列的方法
CN108846259A (zh) * 2018-04-26 2018-11-20 河南师范大学 一种基于聚类和随机森林算法的基因分类方法及系统
CN110060740A (zh) * 2019-04-16 2019-07-26 中国科学院深圳先进技术研究院 一种非冗余基因集聚类方法、系统及电子设备

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
HOBOHM USANDER C: "Enlarged representative set of protein structures", J]. PROTEIN SCIENCE, vol. 3, no. 3, 2010, pages 522 - 524
HOBOHM USCHARF MSCHNEIDER R ET AL.: "Selection of representative protein data sets", [J]. PROTEIN SCIENCE, vol. 1, no. 3, 2010, pages 409 - 417
HOLM LSANDER C: "Removing near-neighbour redundancy from large protein sequence collections", [J]. BIOINFORMATICS, vol. 14, no. 5, 1998, pages 423 - 429, XP003000272, DOI: 10.1093/bioinformatics/14.5.423
LI W: "Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences", [M]. SPRINGER US, 2015
LI WJAROSZEWSKI LGODZIK A: "Clustering of highly homologous sequences to reduce the size of large protein databases", [J]. BIOINFORMATICS, vol. 17, no. 3, 2001, pages 282 - 283
LI WJAROSZEWSKI LGODZIK A: "Tolerating some Redundancy Significantly Speeds up Clustering of Large Protein Databases", [J]. BIOINFORMATICS, vol. 18, no. l, 2002, pages 77 - 82
WANG G, JR D R.: "PISCES: a protein sequence culling server", [J]. BIOINFORMATICS, vol. 19, no. 12, 2003, pages 1589

Also Published As

Publication number Publication date
CN110060740A (zh) 2019-07-26
US20220005546A1 (en) 2022-01-06
EP3955256A4 (en) 2022-06-22
EP3955256A1 (en) 2022-02-16

Similar Documents

Publication Publication Date Title
WO2020211466A1 (zh) 一种非冗余基因集聚类方法、系统及电子设备
Khasawneh et al. Sql, newsql, and nosql databases: A comparative survey
CN109101620B (zh) 相似度计算方法、聚类方法、装置、存储介质及电子设备
WO2017096892A1 (zh) 索引构建方法、查询方法及对应装置、设备、计算机存储介质
CN106991141B (zh) 一种基于深度剪枝策略的关联规则挖掘方法
CN103514201A (zh) 一种非关系型数据库的数据查询方法和装置
CN108549696B (zh) 一种基于内存计算的时间序列数据相似性查询方法
WO2007085187A1 (fr) Procédé d'extraction de données, procédé de production de fichiers d'index et moteur de recherche
CN110795469B (zh) 基于Spark的高维序列数据相似性查询方法及系统
CN103678550A (zh) 一种基于动态索引结构的海量数据实时查询方法
CN113868230B (zh) 一种基于Spark计算框架的大表连接优化方法
CN112527948A (zh) 基于句子级索引的数据实时去重方法及系统
CN111666468A (zh) 一种基于团簇属性在社交网络中搜索个性化影响力社区的方法
CN114281823A (zh) 表格处理方法、装置、设备、存储介质及产品
CN104933143A (zh) 获取推荐对象的方法及装置
Khan et al. Set-based unified approach for attributed graph summarization
CN109101595B (zh) 一种信息查询方法、装置、设备及计算机可读存储介质
WO2012159320A1 (zh) 一种大规模图像数据的聚类方法及装置
CN103761298A (zh) 一种基于分布式架构的实体匹配方法
Lu et al. An improved k-means distributed clustering algorithm based on spark parallel computing framework
CN106844533B (zh) 一种数据分组聚集方法及装置
CN111666302A (zh) 用户排名的查询方法、装置、设备及存储介质
WO2013097065A1 (zh) 一种索引数据处理方法及设备
CN114238576A (zh) 数据匹配方法、装置、计算机设备和存储介质
CN110892401B (zh) 生成用于k个不匹配搜索的过滤器的系统和方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925504

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019925504

Country of ref document: EP

Effective date: 20211111