CN112863607A - Large-scale gene data-oriented same identification system and optimization processing method - Google Patents
Large-scale gene data-oriented same identification system and optimization processing method Download PDFInfo
- Publication number
- CN112863607A CN112863607A CN202011476095.6A CN202011476095A CN112863607A CN 112863607 A CN112863607 A CN 112863607A CN 202011476095 A CN202011476095 A CN 202011476095A CN 112863607 A CN112863607 A CN 112863607A
- Authority
- CN
- China
- Prior art keywords
- sequence
- gene
- data
- symbol
- symbols
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 283
- 238000003672 processing method Methods 0.000 title claims abstract description 17
- 238000005457 optimization Methods 0.000 title claims abstract description 15
- 108020004414 DNA Proteins 0.000 claims abstract description 72
- 238000000034 method Methods 0.000 claims abstract description 71
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 50
- 230000015654 memory Effects 0.000 claims abstract description 46
- 238000001914 filtration Methods 0.000 claims abstract description 39
- 238000013138 pruning Methods 0.000 claims abstract description 38
- 230000008520 organization Effects 0.000 claims abstract description 16
- 239000000178 monomer Substances 0.000 claims abstract description 5
- 238000012795 verification Methods 0.000 claims description 45
- 230000008569 process Effects 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 15
- 230000002068 genetic effect Effects 0.000 claims description 15
- 238000010276 construction Methods 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000012163 sequencing technique Methods 0.000 claims description 8
- 108700028369 Alleles Proteins 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 102000004169 proteins and genes Human genes 0.000 claims description 4
- 102000053602 DNA Human genes 0.000 claims description 3
- 239000012634 fragment Substances 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 claims description 3
- 101150040471 19 gene Proteins 0.000 claims description 2
- 102100030374 Actin, cytoplasmic 2 Human genes 0.000 claims description 2
- 101000773237 Homo sapiens Actin, cytoplasmic 2 Proteins 0.000 claims description 2
- 101000983077 Homo sapiens Phospholipase A2 Proteins 0.000 claims description 2
- 102100026918 Phospholipase A2 Human genes 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000008030 elimination Effects 0.000 claims description 2
- 238000003379 elimination reaction Methods 0.000 claims description 2
- 238000000638 solvent extraction Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims 1
- 238000005192 partition Methods 0.000 abstract description 24
- 230000002596 correlated effect Effects 0.000 abstract description 4
- 230000000875 corresponding effect Effects 0.000 description 19
- 238000005516 engineering process Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 239000003550 marker Substances 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 108091036333 Rapid DNA Proteins 0.000 description 2
- 239000012620 biological material Substances 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 239000000523 sample Substances 0.000 description 2
- 230000009885 systemic effect Effects 0.000 description 2
- 206010064571 Gene mutation Diseases 0.000 description 1
- 206010071602 Genetic polymorphism Diseases 0.000 description 1
- 108010026552 Proteome Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000010353 genetic engineering Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of computers, and particularly relates to a large-scale gene data-oriented same identification system and an optimization processing method, wherein the system provides an efficient data organization basis and optimized query guarantee for gene data and DNA locus information thereof in a mode of combining internal and external memories, and an indexing method adopts a partition mode and supports prefix retrieval, so that the error-free filtration and error-free filtration of tolerance DNA matching of a given symbol can be ensured. The pruning filtering method based on the prefix can effectively utilize the pruning capability of the low-frequency symbols, can obviously prune DNA sequences irrelevant to the query sequence based on the inverted arrangement table of the low-frequency symbols, and ensures high-efficiency and error-free DNA sequence matching. The correlation information database building method based on the external memory can be effectively correlated with the DNA sequence ID in the internal memory, can be flexibly adapted to different external memory architectures, and ensures that the DNA matching result can be correlated with the monomer information in multiple dimensions.
Description
Technical Field
Aiming at the same identification problem of large-scale gene data, the invention provides a set of optimization processing method based on filtration pruning and multidimensional correlation; a set of hierarchical data organization method for managing gene data and gene locus data thereof is provided with a view to efficient pruning matching and association, the matching efficiency is improved by effectively extracting characteristic symbols of gene sequences and by a memory inverted index mode, and efficient organization and rapid and accurate access to original gene data in a memory are provided by a KeyValue organization mode of large-scale information; by adopting the system and the processing method provided by the patent, billion-level gene data can be identified and inquired quickly, on the premise that the memory capacity is met, the single-thread and single-retrieval time is not more than 10 milliseconds, and the iterative association of multiple dimensions can be realized based on the data identification of the inquiry result, so that a data organization strategy and an optimized processing method support can be provided for case-related personnel identification and other accurate gene matching application scenes.
Background
1) With the rapid development of genetic engineering and related technologies in recent years, the theories, methods and related applications involved in biological computing have been more and more emphasized by related fields. As a cross discipline, the development and research of biological computing increasingly show important roles. The data acquisition, storage and analysis of biological information such as protein, proteome, gene, genome and the like and the biological significance thereof are important research contents of biological computing, even biology, medicine and medicine. Identity recognition based on genetic information has gradually penetrated into various fields of social public safety, and China has formed a DNA sample library of cross-time regions, cross-administrative divisions and cross-professional groups and contains a large amount of DNA data accumulated historically. With the standardization of forensic DNA typing techniques, increasingly powerful computers, and the continuous development of increasingly sophisticated network technologies, DNA databases have emerged and developed. The broad DNA database contains DNA data obtained in various fields of research in biology, which focuses on information processing of genes and their related DNA sequences. The narrow DNA database refers to the DNA database of court science, and is also called public security organ at present. The DNA database in China is mainly divided into an on-site database, a predecessor database and a missing person database. The site library is a DNA database storing DNA typing data of on-site examination materials of criminal cases and case information, wherein samples mainly are on-site biological material evidence DNA samples, and unknown corpse information is recorded into the library. The foreadministrative department library is a DNA database storing DNA typing data and information codes of criminals, and the samples are mainly DNA samples of criminals in the foreadministrative department. The missing person library is a DNA database storing DNA typing data and related information of parents or spouses and children of the missing person, and suspected missing persons, and the samples are mainly DNA samples of parents or spouses and children of the missing person and the missing person.
2) The division is carried out according to different angles, and the comparison modes of the DNA database mainly comprise the following modes: tolerance comparison, i.e., a preset comparison method that allows one or more pairs of inconsistent individual data to still prompt comparison, is the most common comparison method in the current database, and is mainly used for avoiding missing comparison caused by gene mutation, data entry errors and the like. In 1998, the department of justice applied for the project of "DNA database pattern library for crime in china", in which 13 STR loci of 2500 criminals were examined in total and statistical analysis was started on the obtained genetic data, and the national sequence for large-scale establishment of DNA database was uncovered. Personal identification ability refers to the probability that two individuals are randomly drawn from a population and the phenotype of the genetic marker is not the same, and is used to measure the system performance of the same identified genetic marker system. Generally, the higher the degree of genetic polymorphism of a genetic marker, the higher the systemic efficacy thereof, and the greater the number of genetic markers, the higher the systemic efficacy thereof. At present, a kit commonly used for forensic science DNA inspection comprises 19 loci, namely 19 genetic markers, the accumulated individual identification capability of the kit can ensure that a single person is identified with small probability errors, and theoretically proves that the kit can be used for the same identification. In practice, DNA identity determination requires several key processes: after the field biological material evidence extracted from the crime scene is checked by a DNA laboratory; typing the STR obtained by the method into a DNA database, and directly determining the STR to be the same as the data in the database predecessor library, or typing the STR into the database and determining the STR to be the same as the data in the field library after the predecessor biological sample is tested by a DNA laboratory; given STR typing in a field library, identification of a serious suspect in a case release field can be realized based on the same identification result.
3) On the basis of determining STR typing, the identity determination problem can be defined as the determination problem of the amount of overlap between DNA sequences within a certain tolerance. The traditional method comprises the following steps: 1) scanning DNA sequences in a historical library one by one, then carrying out matching judgment on the high-low value of each locus of the sequence to be inquired and the DNA sequences in the historical library one by one, stopping the judgment process when the number of the matched loci reaches the threshold value setting, and returning to the step of successful matching; 2) one optimization method is that after a tolerance threshold value is given, DNA sequences in a historical library are matched one by one according to the sequence of loci, and if the number of loci which are unsuccessfully matched is larger than the preset tolerance threshold value, the judgment process is stopped and the unsuccessfully matched loci are returned. Considering that most DNA sequences in the history library do not match with the given query sequence, the method 2) effectively reduces the matching time complexity of a single DNA sample sequence when the data size is large. However, when the DNA history sequence reaches hundreds of millions, even if each sequence matches one locus requires the last locus comparison, thereby reducing the usability of the algorithm. There is a prefix pruning method in the database field, which can be applied to gene alignment, as described below: determining a set of low-frequency loci through frequency, and then rapidly selecting candidate DNA sequences with a certain matching degree with the query DNA sequence according to matching judgment of the low-frequency loci and on the premise of not scanning all DNA historical sequences; further, the "potential possible" candidate matching sequences left by pruning are verified at a later stage by: locus matching is performed on the candidate sequences locus by locus and the resulting sequences of the symbol conditions are returned. The motivation for this approach is: the most of DNA historical data which cannot be matched is filtered based on the low-frequency locus values by utilizing the difference of the high-low locus values of the loci, the filtering process can be based on inverted indexes and the algorithm ensures 'no filtering omission', and then all historical DNA sequences meeting the matching rule are obtained in the final verification stage. The algorithm needs to be guaranteed to be applicable to actual combat at the system level, and needs to be guaranteed to be free of missing filtration and error filtration. Wherein, no filtering leakage is needed to ensure that the final result sequence meets the given matching requirement, and the unmatched result is not increased; error-free filtering is particularly important to ensure that the result set cannot be smaller than under the constraints of the listening conditions. In conclusion, efficient DNA sequence matching under the combined requirement of no-missing filtration and no-error filtration has strong technical challenges.
The DNA enabling technology is mature gradually at present, and high-probability identity identification can be realized by extracting gene loci, so that a means for identifying biological characters is provided for the economic life field, and a core first-priority basis for identifying individual identities is provided for the public safety field. The same identification is carried out based on DNA, so that the working efficiency of a public security system can be effectively improved, and the identity identification problem of key suspects can be solved. In response to this problem, database affinity matching techniques provide us with an avenue for trying. The patent provides a prefix pruning method for accurate DNA comparison, which comprises the steps of constructing DNA sequence partitions of different locus sets by adopting data partitions, and constructing inverted indexes aiming at each partition to provide a memory data organization basis for prefix pruning; the prefix pruning algorithm is given, so that the data comparison efficiency under a large-scale data environment is remarkably improved, and the comparison process is ensured to be free from error filtering; on the basis, a verification method is provided to ensure that the final result is free from filtration leakage; the patent application systematically provides internal and external memory data distribution and data organization means, constructs a plurality of individual association ways on the basis of high-efficiency identification, and can effectively support the related application expansion taking the DNA identification as the core.
Disclosure of Invention
The patent provides key processing steps and introduces new technology on two core levels of DNA identification, and provides efficient solutions for the same identification data organization and processing system and the filtering + verification comparison method respectively.
By partition construction of data set partition of different locus sets, different locus sets in history can be partitioned into a plurality of sets, and different historical records belong to different partitions; combining the high and low locus values of each locus on each partition to form a symbol, and constructing a DNA sequence ID for each symbol in the same partition and position information of the symbol by adopting a memory inverted index method; and storing the mapping relation of all DNA sequence IDs and symbol sets thereof in an internal memory hash table, and storing the original gene sequence and the association relation in an external memory. The data organization method can support symbol-based rapid DNA matching in the memory, and record the ID and can rapidly associate the exogenous information in the external memory.
The extraction forms an efficient comparison algorithm, and the retrieval efficiency of large-scale gene sequences is improved in two stages: low frequency symbols can be preferentially searched in each partition based on the inverted table structure, so that most data which has no shared symbol with the query sequence at a given position is filtered out; in the verification stage, a small number of IDs of the candidate DNA sequences are subjected to Hash extraction to obtain a symbol set of the candidate DNA sequences, so that the verification process is completed, and the final result set and the query sequence are ensured to meet given matching conditions.
As the final application stage of the system, the data association verification is performed in two orthogonal directions: the individual identity information corresponding to each result can be extracted based on the ID of the final result sequence, and then other related information of the individual is related through the individual identity information; in addition, given the ID of the result sequence, the original gene information can be obtained in the gene library, other gene information with the same characteristic set as the individual gene can be found through the gene pattern, and the individual association is carried out through the ID iteration of the associated genes.
The following technical scheme is provided specifically:
the same identification system for large-scale gene data is characterized in that an efficient indexing and storage system is constructed to efficiently match the large-scale gene data, and comprises the following steps:
a client: sending a gene sequence request to be inquired to a server;
a server: receiving gene sequence request information of a client user, calling a service algorithm to extract values of high and low loci of the gene sequence request information on characteristic sites, and obtaining the same identified individual information of the gene sequence through a request server, wherein the server comprises:
an index library: according to different contexts, partitioning is firstly carried out on the condition that a historical library contains gene data with different lengths, and the sites of all historical sequences are symbolized to support the subsequent indexing process; extracting the genotype of each monomer gene sequence on the characteristic site, and constructing an inverted index optimized access structure for the same identity identification; the index structure constructs index symbols by combining the values of high and low loci of each locus, and quickly accesses the main keys of the gene sequences through mapping from the symbols to the gene sequences, wherein the main keys are used for uniquely identifying one gene sequence; in order to speed up the gene sequence extraction of a given gene ID, a hash table result is constructed and the mapping of the gene ID to a gene symbol sequence is maintained.
A storage bank: the method is used for storing the value of the gene locus of the gene sequence and the original gene sequence thereof, the value of 19 gene loci is extracted from each original gene sequence by default to express the individual, and all subsequent authority content supports the user-defined gene locus set provided by a user. Storing the data in two ways;
wherein the values of the characteristic loci of each sequence are stored in memory, while the original gene sequences can be stored in external memory by file or database means;
quickly accessing the value of the gene locus of each gene sequence through the main key of the gene sequence so as to extract the original sequence and the related individual association information of the original sequence;
an external association library: and the method is used for acquiring information such as the related identity and social relationship of the successfully matched gene subject. Based on the contact information or the personal identity information generated when the gene extraction information is generated, connection with the existing related information can be performed to obtain external related information.
A calling matching unit: receiving gene sequence request information of a client user, calling a service algorithm to extract values of high and low loci of the gene sequence request information on characteristic sites based on the inverted index and historical gene data thereof, and obtaining the same identified individual information of the gene sequence through a request server;
in the above-mentioned large-scale gene data-oriented same identification system, in the index library, the values of the high and low loci of the loci are combined to obtain symbols, and the frequency difference of the symbols provides a large number of irrelevant gene sequences in the filtering stage, so that the verification cost for each candidate set of the gene sequences to be queried is small, and the specific combining step is as follows:
step 2.1, if the given DNA data is a digital representation of the converted genotype, then step 2.3 is skipped directly. If the given DNA data is a numerical representation of a genotype that has not been transformed, it is necessary to extract characteristic sites, and in a sexually reproducing organism, the DNA contains information about the inheritance and traits of the organism as a genetic material. Segments containing genetic information on the DNA strand of the human gene, the genotype consisting of both alleles is recorded using loci, i.e. site names, named by international standards. Defining that the high-low locus values of each locus are arranged from small to large according to numerical values, and if the high-low values of two groups of the same locus are equal, respectively equalizing the high-low values after sequencing.
And 2.2, if the given NDA data is only an original gene sequence consisting of an ACTG sequence, extracting values of high and low loci of the characteristic locus based on the international locus rule. Consider a characteristic locus comprising any subset of the following complete set of loci, or a superset of loci comprising the subordinate loci:
{"","MT1","MT2","AMEL","D8S1179","D21S11","D7S820","CSF1PO", "D3S1358","TH01","D13S317","D16S539","D2S1338","D19S433","vWA", "TPOX","D18S51","D5S818","FGA","ABOGROUP","Penta D","Pentax E", "DYS19","DYS385","DYS389I","DYS389II","DYS390","DYS391","DYS392", "DYS393","DYS437","DYS438","DYS439","DYS448","DYS456","DYS458", "DYS635","DY_GATA_H4","OLDMAKER","FESFPS","F13A01","Penta E", "D19S253","DES","PLA2A","D12S391","MT HIV I","D6S1043","DYS385AB", "GATA H4","MIX","B_DYS389","B_DYS389I","B_DYS389II","B_DYS390","B_DYS390II" ,"B_DYS456","DYF387S1ab","DYS385a/b","DYS388","DYS389 I","DYS389 II","DYS444","DYS447","DYS449","DYS460", "DYS481","DYS518","DYS522","DYS527","DYS527a/b","DYS527ab","DYS533"," DYS549","DYS570","DYS576","DYS627","DYS643","DYS64310","G_DYS19","G_D YS1915","G_DYS385","G_DYS458","R_DYS437","R_DYS438","R_DYS448","R_Y_G ATA_H","R_Y_GATA_H4","Y-DYS392","YGATAH4","Yindel","Y_DYS385","Y_DYS3 91","Y_DYS392","Y_DYS393","Y_DYS439","Y_DYS635", "Y_GATA_H","Y_GATA_H4","rel"}
and 2.3, combining each genotype of a DNA sequence in a locus-high locus value/low locus value mode to obtain a symbol set of the sequence.
And 2.4, carrying out frequency statistics on the symbols obtained by all genes through the step 2.3, and assigning a sequence number to each symbol, wherein the sequence numbers are generated from low to high according to the frequency of the symbols. Defining the sequence number as a global sequence number of the symbol, wherein the sequence number is used for supporting the subsequent reverse index organization and pruning filtering process.
In the above-mentioned one large-scale gene data-oriented identification system, the indexing method constructs inverted tables for the combined symbols of each pair of high and low locus values, each entry in the inverted tables includes the gene sequence ID of the combined symbol and the position of the symbol in the gene sequence ordered symbol set, and based on these information, pruning of the candidate symbol set is completed without accessing all symbol sets of the candidate gene sequence, and the specific construction steps are:
step 3.1, sequentially considering each gene data set, and performing symbolic conversion on the form of the data set to form a symbolic ordered set, which is specifically as follows:
step 3.1a, if the gene data is in an original form, calling step 2.2 to extract the genotype of the predefined locus, and naming the genotype of the locus according to an international method;
step 3.1b, symbolizing the locus of each DNA sequence according to step 2.3 and based on
2.4, the determined global sequence orders the symbols of each DNA sequence;
3.2, if all sequences are equal in length, directly executing the step 3.3, otherwise, endowing the converted DNA sequences with a primary bond ID, and sequencing the DNA sequences from short to long based on the number of loci;
step 3.3, scanning the ordered symbol set of each DNA sequence one by one, creating a triple < # p, # t, ID > for each symbol, respectively corresponding to the position of the ordered set where the symbol is located, the number of symbols of the sequence, and the sequence ID >, and putting the triple into the inverted arrangement table where the symbol is located; if all symbol sets are equal, the step is ended, otherwise, the following steps are executed.
And 3.4, if the length information of the DNA is maintained in the step 3.2, recording jump information of the length interval in an inverted list. And after all the symbol ordered sets of all the DNA sequences are scanned, the creation of the inverted index is completed.
In the same identification system for large-scale gene data, the inverted indexing method finally and directly points to the major key (ID) of a mass gene sequence, and the hash structure of the high and low values of the gene in the memory can quickly extract the values of all high and low loci of the sequence based on the ID, so that the association of multiple dimensions can be carried out through the ID and the values of the high and low loci, and the specific construction steps are as follows:
step 4.1, maintaining a hash table from the ID of each sequence to the genotype set thereof based on the sequence IDs determined in the step 3.2 and the high and low genotypes thereof;
and 4.2, storing the gene data in a maintained gene library based on the sequence ID determined in the step 3.2 and the original gene data thereof by adopting one of two modes:
4.2a, a database mode or a KeyValue storage mode can be adopted to construct a quick access mode with the Key as the ID on each piece of gene data;
step 4.2b, sequentially storing all the gene data in a disk space, and simultaneously maintaining a cluster BTree user index sequence ID to the initial disk position of the gene data;
4.3, maintaining the association relationship between the gene sequences and the identities thereof through the ID corresponding to the main body of each gene sequence in the association data set;
and 4.4, all genotypes of the sequences can be quickly obtained on the basis of the step 4.1 through each sequence ID, and other gene sequences matched with the given genotypes can be found in other gene foreign libraries based on the patterns through the genotypes, so that the gene pattern association based on the gene loci is realized.
In the same identification system facing large-scale gene data, unified individual identification marks are adopted to maintain original gene data related to individuals and social information related to the identities, gene patterns of the individuals and the social related information of the individuals are searched and inquired based on individual gene identification results, and the specific construction steps of the external association library are as follows:
step 5.1, storing original gene sequences of all gene data according to the mode of < ID, locus symbol and locus value >;
step 5.2, using the gene ID in the step 5.1 as a main key, storing the gene segments of the original sequences in the historical gene library, and simultaneously storing the gene ID information containing all the gene segments according to the gene segments, wherein the ID corresponds to a certain gene in the gene sequence library; maintaining all fragment information of genes which are not in the gene library in the gene pattern library, and carrying out unified pattern organization on the genes and gene data in other existing gene libraries through ID;
and 5.3, taking the gene ID in the 5.1 as a main key, maintaining a main body identity main key of the main body corresponding to the gene, and storing attribute information related to the main body along with the main body identity.
An optimization processing method based on the same identification system oriented to large-scale gene data is characterized in that based on a pruning verification matching algorithm focusing on memory matching, the value of a gene locus of a gene sequence to be queried is queried through two stages of filtering and verifying, and external association library construction is sequentially subjected to symbolization, filtering, verifying and association processing. Based on the given symbol global frequency, firstly constructing a symbol set for a sequence to be queried, and sequencing the converted symbol set to form a symbol ordered set facing prefix pruning; based on the given reverse index structure of the historical sequences, scanning prefix symbol subsets to be queried and filtering out a large number of historical sequences which cannot be matched with a query symbol set; based on the created hash structure, verifying the symbol set of the candidate historical genes to obtain an accurate sequence set meeting the matching requirement; based on the structure of the association database, the association of individual identity and gene pattern is carried out on the result gene sequence, which specifically comprises the following steps:
the method comprises the following specific steps:
In the above optimization processing method, the construction of the prefix symbol set can accurately give the minimum subset to be matched of each query sequence, and filtering based on the subset can ensure that the DNA ordered symbol set whose prefix is disjoint to the query sequence is "impossible" and the query sequence are the same identity, and the specific method of symbol conversion in step 1 is:
step 1.1, combining the values of the high-low locus of each locus of the sequence to be inquired. Extracting the values of the high-low locus from the locus based on the sequence to be inquired, and directly combining the values of the high-low locus if the inquiry sequence directly gives the values of the high-low locus of the locus;
step 1.2, then, sequencing the combined symbols from low to high to form an ordered symbol set of the query sequence (the frequency is the global frequency of the symbols);
step 1.3, calculating the prefix length and tolerance threshold of the ordered symbol set according to the tolerance range given by inquiry, and entering a prefix pruning stage:
step 1.3a, if the given query conditions are determined to be exact identity, namely different high and low genotypes are not allowed to exist between the query sequence and the result sequence, defining the prefix length as 1;
step 1.3b, if t high and low genotypes of different sites of the result sequence and the query sequence are allowed, giving the number # q of symbols of the query sequence, and defining the prefix length as t + 1;
step 1.3c, giving the number n of the query sequence symbol sets, assuming that the minimum number of symbols of the data is l, the maximum number of symbols is u, and l < n < u is satisfied, if the maximum contained number between the result sequence with the length of r and the query sequence with the length of n is allowed to be larger than c, the prefix length can be defined as n-c;
in the optimization processing method, the imbalance of frequency is used for carrying out pruning on the candidate set based on the low-frequency symbol preferentially, the time complexity of filtering and verification is reduced, and the algorithm pruning in the step 2 adopts the selective execution steps according to the condition that the DNA data set is the same as or different from the site set with the same query sequence, and the specific method is as follows:
choose to perform one if the DNA dataset has the same set of sites as the query sequence
Step 2.1, giving the ordered query sequence and constructing a large-scale DNA ordered symbol set of the inverted index, and scanning prefix symbols of the query sequence in a prefix range;
step 2.2, based on the constructed hash table of the candidate set, the key of the hash table is the ID of the candidate DNA sequence, the value is the number of the same symbols accumulated between the current candidate sequence and the query sequence, and the execution is performed for the symbol of each query sequence:
step 2.2a, sequentially adding the items in the inverted arrangement list of the current symbol into the candidate set, and selectively executing the following steps according to whether the candidate records are required to be newly added or not:
step A, if the candidate set does not have the sequence ID in the current item, adding a new record ID < # p, # q, c > into the hash table, recording the position # q of the symbol in the query record, the position # p of the symbol in the data sequence and the number of the current sharing symbols of the symbol and the data sequence, and recording that c is 1;
step B, if the candidate set has the sequence ID recorded currently, extracting triples < # p, # q, c > from the hash table of the candidate set, updating the position # p of the symbol in the data sequence and the position # q in the query sequence, and adding one to the number c of the shared symbols;
step 2.2b, if the number of the recorded symbols in the data set is different, updating the candidate set according to the positions # q and # p of the current prefix symbols, and judging as follows:
step A, for the current state < # p, # q, c > of a certain candidate set, if n < min (n- # q, u- # p) + c + t, removing the candidate set;
step B, maintaining the proposed candidate data in a removed hash set at any time, and ensuring that the candidate set generated by subsequent scanning is never removed before;
choose to execute two, if the DNA dataset is different from the query sequence step site set, then
Step 2.3, step 2.1 and step 2.2 are to define the DNA dataset to have the same site set as the query sequence; if the site selection of the DNA historical sequence is different, three situations exist in the symbol set in the data set relative to the symbol set of the query sequence, and a query site set Q and a data site set D are defined:
In any case, the following steps may be separately implanted to form 1-0 before step 1 above and added to step 1-1 after step 1:
1-0, before constructing the inverted index, segmenting the data sets according to the site sets and ensuring that the site sets contained in the data sequences in each group of data sets are the same; to this end, step 1 is performed in each data subset in a round-robin manner; it is noted that the symbol order of the index is determined by the occurrence order of the symbols within the data sets, the symbol order may be different between different data sets;
step 1-1, before determining the prefix length, first, solving that I ═ Q ≡ D for a sequence set with a certain symbol set as D, taking I as a query symbol set, keeping the tolerance quantity unchanged, calculating the prefix length through I, and substituting the prefix length into the step 2 for calculation. Wherein, the order is calculated according to the symbol set D to be matched when the symbol set is inquired each time.
In the above optimization processing method, the verification scale is reduced based on the pruning result, and the time complexity of each verification is reduced based on the pruning position, and the specific method of candidate verification in step 3 is:
step 3.1, verification is carried out successively aiming at different data sets, namely after scanning and filtering are carried out on one data set based on a corresponding symbol set, triple IDs < # p, # q and c > of each data to be verified in a candidate set are recorded, the recorded symbol set is extracted from a record hash table according to the IDs, and the recorded symbols are sequenced based on the symbol sequence of the current data set;
step 3.2, positioning the recorded position # p and the inquired position # q, and scanning the symbol backwards in sequence, wherein the symbol is scanned at each time and is processed differently according to two conditions:
step 3.2a, if the symbols at the position # p + i of the current candidate data symbol set and the position # q + j of the query symbol set are the same, adding one to i and j respectively and executing the step 3.2 in an iterative manner, and if the symbols at the position # p + i (in the candidate data symbol set) and the position # q + j (in the query symbol set) are different, executing the following steps;
and 3.2b, if the symbols are different, calling the step 2.2b to judge the candidate records: if the given constraint is not satisfied, the following judgment steps are executed after the verification and the record elimination are carried out;
and 3.3, if the quantity of any data and the shared symbols of the query exceeds | Q ^ D | -t, the data is a matched qualified record, the ID of the qualified record is added into a result set, and the step 3.1 is carried out to verify the next candidate data.
In the optimization processing method, correlation of multiple dimensions is performed based on the result data ID and the gene data of the original sequence thereof, and the specific method of correlation extraction in step 4 is as follows:
step 4.1, based on the constructed association table from the ID to the identity, extracting identity information from an individual identity library through the ID of the successfully matched result data, and colliding with other tables based on the identity information to obtain the association information of the individual with the identity corresponding to the ID;
step 4.2, original gene data of the data are found in a gene library based on the data ID successfully matched in the step 4.1, feature extraction is carried out on each original result gene data through a gene mode, and other gene data with certain feature are found in the library based on the extraction result;
4.3, reversely obtaining the gene data ID of the related individual by using the step 4.1, further recursively finding the related gene of the related individual in the gene library based on the step 4.2, and the recursive hierarchy is designated by the user; or firstly, a gene data set of the associated gene pattern is found based on the step 2, and the associated identity information is found in the identity library by recursion based on the gene data ID corresponding to the data set.
The invention has the following advantages: 1. the system provides efficient data organization foundation and optimized query guarantee for gene data and DNA locus information thereof by adopting a mode of combining internal and external memories, and the indexing method adopts a partition mode and supports prefix retrieval, so that the tolerance DNA matching of a given symbol can be ensured to be free from missing filtration and error filtration. 2. The pruning filtering method based on the prefix can effectively utilize the pruning capability of the low-frequency symbols, can obviously prune DNA sequences irrelevant to the query sequence based on the inverted arrangement table of the low-frequency symbols, and ensures high-efficiency and error-free DNA sequence matching. 3. The correlation information database building method based on the external memory can be effectively correlated with the DNA sequence ID in the internal memory, can be flexibly adapted to different external memory architectures, and ensures that the DNA matching result can be correlated with the monomer information in multiple dimensions.
Drawings
Fig. 1 shows the core flow of building a memory structure and performing the same identification process for a certain partition. The dotted line in the figure shows the supporting relationship between the storage structure and the identification process, and the flow hierarchy shows the storage and operation relationship among the inverted index, the hash DNA data and the external storage DNA data after the preprocessing process. Wherein, the storage of the third layer in the figure corresponds to the step 3 of the content (2), and the identification process of the third layer corresponds to the step 1-3 of the content (3); step 4 of storing corresponding to the content (2) in the fourth layer verification stage, and step 4-5 of identifying the corresponding content (3); the final sixth layer association stage corresponds to content (2), steps 5-6, and its identification process corresponds to content (3), steps 7-9.
FIG. 2 shows a schematic diagram of the frequency order of the inverted index and the structure of the inverted list of symbols. Wherein, the ID (such as ID-1, ID-2, etc.) of the DNA sequence corresponding to each symbol, the position of a certain symbol (shown by the top blue rectangle in the figure) in a certain sequence, and the total number of symbols in the sequence. It is worth noting that in the partition mode, since all DNA sequences in a partition have the same set of loci, the number of sites in the map is the same, and # t in the map can be omitted (since all loci are the same, the number must be the same).
Fig. 3 shows the architecture and core association design of the system at different memory levels. The filtering of most of gene data which do not meet the matching requirements is realized through the inverted index, the verification is further carried out through a hash index structure maintained in the memory, and the ID of the matching result obtained through verification is used as the association basis to carry out identity and gene mode association respectively.
Detailed Description
The following is further described with reference to the accompanying drawings.
(1) Large-scale gene data-oriented same identification system
The processing idea is as follows: the construction of a large-scale DNA sequence-oriented identity recognition system requires effective organization of index structures and associated information and guarantees the availability and efficiency of the system in the process flow. Constructing an inverted index optimized access structure oriented to the same identity identification by extracting the genotype of each (monomer) gene sequence on a characteristic site; the index structure constructs index symbols by combining the values of high and low loci of each site, and quickly accesses the main keys of the gene sequences through mapping from the symbols to the gene sequences (the main keys are used for uniquely identifying one gene sequence); the value of the locus of a gene sequence and its original gene sequence can be stored in two ways; wherein the values of the (characteristic) loci of each sequence are stored in memory, while the original gene sequences can be stored in external memory by file or database means; by the main key of the gene sequence, the value of the gene locus of each gene sequence can be quickly accessed, and the original sequence and the related individual association information of the original sequence can be further extracted.
After any partition is determined, the scanning order of the prefix and the query symbol sequence can be determined according to the symbol frequency sequence in the partition, an ID set of a result sequence is obtained according to a prefix pruning method and a verification process of the prefix pruning method, and correlation can be carried out on different dimensions based on the ID set. FIG. 1 shows a system data architecture design and process flow design, which is described below by way of example. The process of building and authenticating at storage is described as follows:
a. value extraction of sequence loci. DNA contains genetic and trait information of an organism as a genetic material evidence. In the case of segments containing genetic information on the DNA strand of a human gene, the genotype consisting of both alleles can be recorded using the names of loci (i.e., loci) named by international standards. For example, the D8S1179 locus has 9 alleles: 10. 11, 12, 13, 14, 15, 16, 17, 18. A pair of alleles forms a genotype, for example the genotype 10/11 of D8S1179 indicates that the typing at the D8S1179 locus has a value of 10 for the upper locus and 11 for the lower locus. If the two groups of loci have the same high-low value, it is sufficient (not necessary) that the sorted high-low values are respectively equal. Given a query gene sequence, if its expression is not that of a locus, the locus extraction method is applicable to the query sequence.
b. The high and low values of the loci were pooled. Each genotype of a DNA sequence was combined in a "locus" - "high locus value"/"low locus value" fashion, resulting in a set of symbols for the sequence. For example: given a certain sequence < D8S1179(17/12), D21S11(33.2/28), D7S820(13/14), … > (where the parenthesized part represents the values of the high and low loci of the locus where the sequence is located) will be combined to give the set of symbols { D8S1179-17/12, D21S11-33.2/28, D7S820-13/14, … }. And assigning an order of frequencies from small to large for the symbols based on the combined symbol frequency in the data, and sequencing the symbol sequences of all the data. Given a sequence of query symbols, the order of the data symbols is applied to the query such that the data has the same symbol order as the query.
c. And constructing an inverted index. The converted DNA sequences are assigned primary bond IDs and are ordered from short to long based on the number of their loci (this step is negligible for all sequences of equal length). The ordered symbol set of each DNA sequence is scanned successively, a triple < # p, # t, ID > is created by scanning a symbol each time, the triple < # p, # t, ID > respectively corresponds to the position of the ordered set where the symbol is located, the number of symbols of the sequence, and the sequence ID >, and the triple is added to the inverted list where the symbol is located. After the symbols of all data are scanned, the inverted table of each symbol of the inverted table maintains the ID and prefix position information of all data including the symbol. For a given symbol of a given query, other DNA data information that also contains a symbol can be scanned in the inverted table, and the data sequence containing the symbol in the query is subtracted out, reducing the time complexity of verification.
d. A hash index of the sequence ID to the symbol is maintained. The Hash structure of the high and low values of the genes in the memory maintains the mapping relation from the sequence ID to the symbols, and the values of all high and low loci of the sequence sites can be quickly extracted based on the ID. In particular, the symbols formed by the values of these high and low loci are stored together in the index, and the order in which the symbols appear in the data of different partitions may be different for different partitions. Since an ID uniquely identifies a data sequence, and a data sequence belongs to only one partition, a certain ID belongs to only one partition, and the order of its corresponding data symbols is independent of other partitions. In the verification stage, a query is given, a symbol order and a symbol intersection are uniquely determined on each partition according to a symbol set of the query, and the candidate sequence ID after pruning belongs to different partitions, so that the symbol ordered set of the candidate sequence can be positioned in a uniformly maintained hash table based on the ID, and the verification is finally completed.
e. An index of sequence IDs to the original gene data was constructed. The original gene can be stored in a memory mode or maintained in an external memory mode. In the former method, the mapping relation between the ID and the original gene sequence is maintained simply by adopting a hash table structure. For the latter: a database mode or a KeyValue storage mode can be adopted to construct a quick access mode with the Key as the ID on each piece of gene data; or constructing a cluster index for maintenance, sequentially storing all the gene data in a disk space, and simultaneously maintaining a cluster BTree user index sequence ID to the initial disk position of the gene data. Associations between gene sequences and their identities are maintained by an ID corresponding to the subject, such as each gene sequence, in the association dataset.
(2) Large-scale gene data-oriented same identification processing method
The processing idea is as follows: given sequence partitions, unmatched records on the loci where the symbols are located can be accurately judged based on the intersection symbols of the ordered queries; on the basis, the inverted list corresponding to the query symbol can be scanned to filter unmatched records. In this process, since the queried symbol set is ordered from low to high according to the frequency of the symbols, a large number of candidate DNA sequences can be removed in the prefix scan process, so that the time complexity of the verification stage can be significantly reduced. In the verification stage, different data sets are sequentially subjected to scanning and filtering, namely after scanning and filtering are performed on one data set (different according to the symbol set of the data set), the ID and the shared symbol position information of each piece of data to be verified in the candidate set are recorded, the recorded symbol set is extracted from the record hash table according to the ID, and finally the recorded symbols can be sorted based on the symbol sequence of the current data set and the matching degree of the symbols can be sequentially verified.
The matching process is carried out based on the inverted index given in fig. 2, and pruning sequentially passes through three stages of scanning a symbol inverted arrangement table, accumulating shared symbols and judging whether the candidate is legal or not. The verification phase can be simply described as four phases of scanning suffix symbols, accumulating shared symbols and determining whether the result is legal.
The treatment process is as follows:
a. giving the ordered query sequence and constructing a large-scale DNA ordered symbol set with inverted indexes, and scanning prefix symbols of the query sequence in a prefix range. In the figure, if the query contains the symbols D8S1179-9/9, two sequences (id-1 and id-2) can be obtained, i.e., the other symbols (higher and lower locus values) of the locus D8S1179 must not share higher and lower locus values (e.g., id-5, id-6, id-8, etc., corresponding to D8S 1179-8/9) with the current query at that locus.
b. And accumulating the prefix information and the shared symbol information of the data where the shared symbol is located. In the figure, if the symbols D8S1179-9/9 have been shared, the prefix information of id-1 and id-2 and the number of symbols shared with the current query (currently denoted as 1) can be maintained to facilitate subsequent pruning operations. The candidate set hash table is used to maintain this information:
I. if the candidate set has no sequence ID in the current item, adding a new record ID < # p, # q, c >, the position # q of the record symbol in the query record, the position # p of the record symbol in the data sequence, and the number of the two current sharing symbols (record c is 1) to the hash table.
And II, if the candidate set has the sequence ID of the current record, extracting the triples < # p, # q, c > from the hash table of the candidate set, updating the position # p of the symbol in the data sequence and the position # q of the symbol in the query sequence, and adding one to the number c of the shared symbols.
c. If the number of the recorded symbols in the data set is different, updating the candidate set according to the positions # q and # p of the current prefix symbol, and judging as follows:
I. for the current state < # p, # q, c > of a certain candidate set, if n < min (n- # q, u- # p) + c + t, removing the candidate set;
any time the candidate data is fetched, it is maintained in a culled hash set, ensuring that the candidate set generated by the subsequent scan is never culled before.
d. The verification is carried out successively aiming at different data sets, namely after scanning and filtering are carried out on one data set (different according to the symbol set of the data set), the triple ID < # p, # q, c > of each data to be verified in the candidate set is recorded, the recorded symbol set is extracted from the record hash table according to the ID, and the record symbols are sorted based on the symbol sequence of the current data set.
e. Positioning the recorded position # p and the inquired position # q, scanning the symbol backwards in sequence, and performing different processing according to two conditions when scanning the symbol each time:
I. if the symbols of the current position # p + i (in the candidate data symbol set) and # q + j (in the query symbol set) are the same, respectively adding one to i and j, and iteratively executing the step 2;
if the signs are different, adjusting the candidate records to judge: and if the given constraint is not satisfied, quitting the verification and rejecting the record.
f. And if the number of the shared symbols of any data and the query exceeds | Q ^ D | -t, the data is a record which is qualified in matching and is added into the result set.
(3) Correlation method based on large-scale gene data same identification
The processing idea is as follows: and storing the mapping relation of all DNA sequence IDs and symbol sets thereof in an internal memory hash table, and storing the original gene sequence and the association relation in an external memory. The data organization method can support symbol-based rapid DNA matching in the memory, and record the ID and can rapidly associate the exogenous information in the external memory. As the final application stage of the system, the data association verification is performed in two orthogonal directions: the individual identity information corresponding to each result can be extracted based on the ID of the final result sequence, and then other related information of the individual is related through the individual identity information; in addition, given the ID of the result sequence, the original gene information can be obtained in the gene library, other gene information with the same characteristic set as the individual gene can be found through the gene pattern, and the individual association is carried out through the ID iteration of the associated genes. The core process of association is given in fig. 3, where the filtering and verification are given in detail in (2) above. Based on the verified result set ID, the processing flow is as follows:
a. the verified result sequence ID is used for identifying an original DNA sequence which meets the same identification with the sequence to be matched, and the original sequence of the ID can be extracted from an external memory based on the ID; meanwhile, identity association information corresponding to the ID can be extracted from the associated individual identity based on the ID; also, based on other patterns of the original sequence DNA sequence, other genetic information can be extracted that matches the result sequence at a given locus information based on the feature subsequence.
b. Based on the constructed association table from the ID to the identity, the identity information (such as the identity card number of a certain ID individual) is extracted from the individual identity library through the ID of the result data which is successfully matched, and based on the collision between the identity information and other tables, the association information belonging to the individual with the identity corresponding to the ID is obtained.
c. Original gene data of the data are found in a gene library based on the successfully matched data ID, feature extraction is carried out on each original result gene data through a gene mode, and other gene data with certain feature are found in the gene library based on the extraction result;
d. the gene data ID of the related individual can be obtained reversely by the step b, and then the related gene of the related individual is found in the gene library based on the step 2 in a recursive manner (the recursive hierarchy is specified by the user); or, firstly, finding out the gene data set of the associated gene mode based on the step c, and recursively finding out the associated identity information in the identity library based on the gene data ID corresponding to the data set.
The three core technical methods can be sequentially used for solving the problems of unified storage and management of large-scale gene data, efficient pruning verification based on prefixes and multi-dimensional association based on gene data.
The above is only the core flow of the present invention, and is not intended to limit the scope of the present invention. Therefore, any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. The same identification system for large-scale gene data is characterized in that an efficient indexing and storage system is constructed to efficiently match the large-scale gene data, and comprises the following steps:
a client: sending a gene sequence request to be inquired to a server;
a server: receiving gene sequence request information of a client user, calling a service algorithm to extract values of high and low loci of the gene sequence request information on characteristic sites, and obtaining the same identified individual information of the gene sequence through a request server, wherein the server comprises:
an index library: according to different contexts, partitioning is firstly carried out on the condition that a historical library contains gene data with different lengths, and the sites of all historical sequences are symbolized to support the subsequent indexing process; extracting the genotype of each monomer gene sequence on the characteristic site, and constructing an inverted index optimized access structure for the same identity identification; the index structure constructs index symbols by combining the values of high and low loci of each locus, and quickly accesses the main keys of the gene sequences through mapping from the symbols to the gene sequences, wherein the main keys are used for uniquely identifying one gene sequence; in order to accelerate the gene sequence extraction of a given gene ID, a hash table result is constructed, and the mapping from the gene ID to a gene symbol sequence is maintained;
a storage bank: the gene locus value and the original gene sequence are used for storing the gene sequence, the value of 19 gene loci is extracted from each original gene sequence by default to express the individual, and all subsequent authority content supports a user-defined gene locus set provided by a user; storing the data in two ways;
wherein the values of the characteristic loci of each sequence are stored in memory, while the original gene sequences can be stored in external memory by file or database means;
quickly accessing the value of the gene locus of each gene sequence through the main key of the gene sequence so as to extract the original sequence and the related individual association information of the original sequence;
an external association library: the method is used for acquiring information such as related identity and social relationship of successfully matched gene subjects; based on the contact information or the personal identity information generated when the gene extraction information is generated, the connection with the existing related information can be carried out to obtain the external related information;
a calling matching unit: receiving gene sequence request information of a client user, calling a service algorithm to extract values of high and low loci of the gene sequence request information on characteristic sites based on the inverted index and historical gene data thereof, and obtaining the same identified individual information of the gene sequence through a request server.
2. The system of claim 1, wherein the values of loci at high and low positions in the index database are combined to obtain symbols, and the frequency difference of the symbols provides a large number of irrelevant gene sequences in the filtering stage, so that the verification cost for each candidate set of gene sequences to be queried is low, and the combination steps are as follows:
step 2.1, if the given DNA data is the digital representation of the converted genotype, directly skipping to step 2.3; if the given DNA data is a digital representation of a genotype which has not been transformed, characteristic sites need to be extracted, and in a sexually reproducing organism, the DNA contains the inheritance and character information of the organism as a genetic material evidence; recording a genotype consisting of two alleles using international standard-named loci, i.e., site names, on a fragment comprising genetic information on a human gene DNA strand; defining that the high-low locus values of each locus are arranged from small to large according to numerical values, and if the two groups of high-low locus values of the same locus are equal, respectively equalizing the sequenced high-low locus values;
2.2, if the given NDA data is only an original gene sequence formed by an ACTG sequence, extracting values of high and low loci of the characteristic locus based on the international locus rule; consider a characteristic locus comprising any subset of the following complete set of loci, or a superset of loci comprising the subordinate loci:
{"","MT1","MT2","AMEL","D8S1179","D21S11","D7S820","CSF1PO","D3S1358","TH01","D13S317","D16S539","D2S1338","D19S433","vWA","TPOX","D18S51","D5S818","FGA","ABOGROUP","Penta D","Pentax E","DYS19","DYS385","DYS389I","DYS389II","DYS390","DYS391","DYS392","DYS393","DYS437","DYS438","DYS439","DYS448","DYS456","DYS458","DYS635","DY_GATA_H4","OLDMAKER","FESFPS","F13A01","Penta E","D19S253","DES","PLA2A","D12S391","MT HIV I","D6S1043","DYS385AB","GATA H4","MIX","B_DYS389 ","B_DYS389I","B_DYS389II","B_DYS390","B_DYS390II","B_DYS456","DYF387S1ab","DYS385a/b","DYS388","DYS389 I","DYS389 II","DYS444","DYS447","DYS449","DYS460","DYS481","DYS518","DYS522","DYS527","DYS527a/b","DYS527ab","DYS533","DYS549","DYS570","DYS576","DYS627","DYS643","DYS64310","G_DYS19","G_DYS1915","G_DYS385","G_DYS458","R_DYS437","R_DYS438","R_DYS448","R_Y_GATA_H","R_Y_GATA_H4","Y-DYS392","YGATAH4","Yindel","Y_DYS385","Y_DYS391","Y_DYS392","Y_DYS393","Y_DYS439","Y_DYS635","Y_GATA_H","Y_GATA_H4","rel"}
step 2.3, combining each genotype of a DNA sequence by adopting a gene locus-high gene locus value/low gene locus value mode to obtain a symbol set of the sequence;
step 2.4, carrying out frequency statistics on the symbols obtained by all genes through the step 2.3, and assigning a sequence number to each symbol, wherein the sequence numbers are generated from low to high according to the frequency of the symbols; defining the sequence number as a global sequence number of the symbol, wherein the sequence number is used for supporting the subsequent reverse index organization and pruning filtering process.
3. The system for identifying identity of large-scale genetic data according to claim 2, wherein the indexing method constructs inverted tables for the combined symbols of each pair of high and low locus values, each inverted table entry contains the gene sequence ID of the combined symbol and the position of the symbol in the gene sequence ordered symbol set, and pruning of the candidate symbol set is completed based on the information without accessing all symbol sets of the candidate gene sequence, and the specific construction steps are as follows:
step 3.1, sequentially considering each gene data set, and performing symbolic conversion on the form of the data set to form a symbolic ordered set, which is specifically as follows:
step 3.1a, if the gene data is in an original form, calling step 2.2 to extract the genotype of the predefined locus, and naming the genotype of the locus according to an international method;
step 3.1b, symbolizing the locus of each DNA sequence according to step 2.3, and ordering the symbols of each DNA sequence based on the global sequence determined in step 2.4;
3.2, if all sequences are equal in length, directly executing the step 3.3, otherwise, endowing the converted DNA sequences with a primary bond ID, and sequencing the DNA sequences from short to long based on the number of loci;
step 3.3, scanning the ordered symbol set of each DNA sequence one by one, creating a triple < # p, # t, ID > for each symbol, respectively corresponding to the position of the ordered set where the symbol is located, the number of symbols of the sequence, and the sequence ID >, and putting the triple into the inverted arrangement table where the symbol is located; if all symbol sets are equal in length, ending the step, otherwise executing the following steps;
step 3.4, if the length information of the DNA is maintained in the step 3.2, recording jump information of the length interval in an inverted list; and after all the symbol ordered sets of all the DNA sequences are scanned, the creation of the inverted index is completed.
4. The large-scale gene data-oriented identity authentication system according to claim 1, wherein the inverted indexing method finally points to the major key (ID) of the massive gene sequences, and the hash structure of the high and low gene values in the memory can rapidly extract the values of all the high and low loci of the locus of the sequence based on the ID, and further can perform multi-dimensional association by using the ID and the values of the high and low loci, and the specific construction steps are as follows:
step 4.1, maintaining a hash table from the ID of each sequence to the genotype set thereof based on the sequence IDs determined in the step 3.2 and the high and low genotypes thereof;
and 4.2, storing the gene data in a maintained gene library based on the sequence ID determined in the step 3.2 and the original gene data thereof by adopting one of two modes:
4.2a, a database mode or a KeyValue storage mode can be adopted to construct a quick access mode with the Key as the ID on each piece of gene data;
step 4.2b, sequentially storing all the gene data in a disk space, and simultaneously maintaining a cluster BTree user index sequence ID to the initial disk position of the gene data;
4.3, maintaining the association relationship between the gene sequences and the identities thereof through the ID corresponding to the main body of each gene sequence in the association data set;
and 4.4, all genotypes of the sequences can be quickly obtained on the basis of the step 4.1 through each sequence ID, and other gene sequences matched with the given genotypes can be found in other gene foreign libraries based on the patterns through the genotypes, so that the gene pattern association based on the gene loci is realized.
5. The large-scale gene data-oriented identity authentication system according to claim 1, wherein unified individual identification is used to maintain original gene data related to individuals and social information related to the identities, and based on the individual gene identification result, the individual gene pattern and the individual social related information are searched, and the specific construction of the external association library comprises the following steps:
step 5.1, storing original gene sequences of all gene data according to the mode of < ID, locus symbol and locus value >;
step 5.2, using the gene ID in the step 5.1 as a main key, storing the gene segments of the original sequences in the historical gene library, and simultaneously storing the gene ID information containing all the gene segments according to the gene segments, wherein the ID corresponds to a certain gene in the gene sequence library; maintaining all fragment information of genes which are not in the gene library in the gene pattern library, and carrying out unified pattern organization on the genes and gene data in other existing gene libraries through ID;
and 5.3, taking the gene ID in the 5.1 as a main key, maintaining a main body identity main key of the main body corresponding to the gene, and storing attribute information related to the main body along with the main body identity.
6. The optimization processing method of the same identification system for large-scale gene data, based on the claim 1, is characterized in that based on a pruning verification matching algorithm focusing on memory matching, the value of the gene locus of the gene sequence to be queried is queried through two stages of filtering and verification, and external association library construction is sequentially subjected to symbolization, filtering, verification and association processing; based on the given symbol global frequency, firstly constructing a symbol set for a sequence to be queried, and sequencing the converted symbol set to form a symbol ordered set facing prefix pruning; based on the given reverse index structure of the historical sequences, scanning prefix symbol subsets to be queried and filtering out a large number of historical sequences which cannot be matched with a query symbol set; based on the created hash structure, verifying the symbol set of the candidate historical genes to obtain an accurate sequence set meeting the matching requirement; based on the structure of the association database, the association of individual identity and gene pattern is carried out on the result gene sequence, which specifically comprises the following steps:
the method comprises the following specific steps:
step 1, symbol conversion: combining the values of the high and low loci of each locus of the sequence to be queried; then, ordering according to the global frequency of the combined symbols from low to high to form an ordered symbol set of the query sequence; calculating the prefix length and the tolerance threshold of the ordered symbol set according to the query given tolerance range, and entering a prefix pruning stage;
step 2, algorithm pruning: controlling the number of scanning symbols based on the inquired prefix length, and scanning the inverted list according to the symbol sequence; constructing a temporary result of candidate set maintenance, and adding one to the number of public symbols of the existing candidate sequences in the candidate set every time one symbol is scanned; after the prefix symbol set is scanned, the number of public symbols of the candidate set is rechecked, and candidate gene sequences with difference exceeding a certain number with the gene sequence to be inquired are eliminated based on a tolerance threshold; entering a verification stage by carrying the sequence ID in the candidate set;
step 3, candidate verification: extracting values of all gene loci from a memory according to the last remaining gene sequence ID in the candidate set; obtaining the initial position of verification scanning according to the number of public symbols of each candidate sequence and the sequence to be matched and the position information of the last shared symbol, which are maintained in the candidate set; scanning the ordered symbol set suffixes of the candidate sequence and the sequence to be matched from the starting position; adding one to the number of public symbols every time a new shared symbol is found until the last prefix symbol is detected to obtain a candidate set, directly updating the candidate set according to the number of suffix symbols, and eliminating candidate sequences which cannot meet the threshold condition;
step 4, association extraction: constructing an external memory individual association database on the basis of the association database construction method given by the claim 5, wherein the external memory individual association database covers other individual gene information, gene pattern matching information of the other individual gene information and social relationship information associated with individual identities; the verified result sequence ID is used for identifying an original DNA sequence which meets the same identification with the sequence to be matched, and the original sequence of the ID can be extracted from an external memory based on the individual ID in the matching result; meanwhile, identity association information corresponding to the ID can be extracted from the associated individual identity based on the ID; also, based on other patterns of the original sequence DNA sequence, other genetic information can be extracted that matches the result sequence at a given locus information based on the feature subsequence.
7. The optimization processing method according to claim 6, wherein the prefix symbol set is constructed to accurately give the minimum subset to be matched for each query sequence, and filtering based on the subset can ensure that the DNA ordered symbol set whose prefix does not intersect with the query sequence is "impossible" to be identical to the query sequence, and the specific method of symbol transformation in step 1 is:
step 1.1, combining the values of high and low loci of each locus of a sequence to be queried; extracting the values of the high-low locus from the locus based on the sequence to be inquired, and directly combining the values of the high-low locus if the inquiry sequence directly gives the values of the high-low locus of the locus;
step 1.2, then, sequencing the combined symbols from low to high to form an ordered symbol set of a query sequence, wherein the frequency is the global frequency of the symbols;
step 1.3, calculating the prefix length and tolerance threshold of the ordered symbol set according to the tolerance range given by inquiry, and entering a prefix pruning stage:
step 1.3a, if the given query conditions are determined to be exact identity, namely different high and low genotypes are not allowed to exist between the query sequence and the result sequence, defining the prefix length as 1;
step 1.3b, if t high and low genotypes of different sites of the result sequence and the query sequence are allowed, giving the number # q of symbols of the query sequence, and defining the prefix length as t + 1;
and 1.3c, giving the number n of the query sequence symbol sets, and assuming that the minimum number of symbols of the data is l, the maximum number of symbols is u, and l < n < u, if the maximum inclusion number between the result sequence with the length of r and the query sequence with the length of n is allowed to be larger than c, defining the prefix length as n-c.
8. The optimization processing method according to claim 6, wherein the imbalance of frequencies is used to preferentially prune the candidate set based on the low frequency symbols, and reduce the time complexity of filtering and verification, and the algorithm pruning in step 2 adopts selective execution steps according to the same or different site sets of the DNA data set as the query sequence, and the specific method is as follows:
choose to perform one if the DNA dataset has the same set of sites as the query sequence
Step 2.1, giving the ordered query sequence and constructing a large-scale DNA ordered symbol set of the inverted index, and scanning prefix symbols of the query sequence in a prefix range;
step 2.2, based on the constructed hash table of the candidate set, the key of the hash table is the ID of the candidate DNA sequence, the value is the number of the same symbols accumulated between the current candidate sequence and the query sequence, and the execution is performed for the symbol of each query sequence:
step 2.2a, sequentially adding the items in the inverted arrangement list of the current symbol into the candidate set, and selectively executing the following steps according to whether the candidate records are required to be newly added or not:
step A, if the candidate set does not have the sequence ID in the current item, adding a new record ID < # p, # q, c > into the hash table, recording the position # q of the symbol in the query record, the position # p of the symbol in the data sequence and the number of the current sharing symbols of the symbol and the data sequence, and recording that c is 1;
step B, if the candidate set has the sequence ID recorded currently, extracting triples < # p, # q, c > from the hash table of the candidate set, updating the position # p of the symbol in the data sequence and the position # q in the query sequence, and adding one to the number c of the shared symbols;
step 2.2b, if the number of the recorded symbols in the data set is different, updating the candidate set according to the positions # q and # p of the current prefix symbols, and judging as follows:
step A, for the current state < # p, # q, c > of a certain candidate set, if n < min (n- # q, u- # p) + c + t, removing the candidate set;
step B, maintaining the proposed candidate data in a removed hash set at any time, and ensuring that the candidate set generated by subsequent scanning is never removed before;
choose to execute two, if the DNA dataset is different from the query sequence step site set, then
Step 2.3, step 2.1 and step 2.2 are to define the DNA dataset to have the same site set as the query sequence; if the site selection of the DNA historical sequence is different, three situations exist in the symbol set in the data set relative to the symbol set of the query sequence, and a query site set Q and a data site set D are defined:
In any case, the following steps may be separately implanted to form 1-0 before step 1 above and added to step 1-1 after step 1:
1-0, before constructing the inverted index, segmenting the data sets according to the site sets and ensuring that the site sets contained in the data sequences in each group of data sets are the same; to this end, step 1 is performed in each data subset in a round-robin manner; it is noted that the symbol order of the index is determined by the occurrence order of the symbols within the data sets, the symbol order may be different between different data sets;
step 1-1, before determining the prefix length, firstly, solving that I is Q and D for a sequence set with a certain symbol set as D, taking I as a query symbol set, keeping the tolerance quantity unchanged, calculating the prefix length through I and substituting the prefix length into the step 2 for calculation; wherein, the order is calculated according to the symbol set D to be matched when the symbol set is inquired each time.
9. The optimization processing method according to claim 8, wherein the verification scale is reduced based on the pruning result, and the time complexity of each verification is reduced based on the pruning position, and the specific method of candidate verification in step 3 is:
step 3.1, verification is carried out successively aiming at different data sets, namely after scanning and filtering are carried out on one data set based on a corresponding symbol set, triple IDs < # p, # q and c > of each data to be verified in a candidate set are recorded, the recorded symbol set is extracted from a record hash table according to the IDs, and the recorded symbols are sequenced based on the symbol sequence of the current data set;
step 3.2, positioning the recorded position # p and the inquired position # q, and scanning the symbol backwards in sequence, wherein the symbol is scanned at each time and is processed differently according to two conditions:
step 3.2a, if the symbols at the position # p + i of the current candidate data symbol set and the position # q + j of the query symbol set are the same, adding one to i and j respectively and executing the step 3.2 in an iterative manner, and if the symbols at the position # p + i (in the candidate data symbol set) and the position # q + j (in the query symbol set) are different, executing the following steps;
step 3.2b, if the symbols are different, calling step 2.2b) to judge the candidate records: if the given constraint is not satisfied, the following judgment steps are executed after the verification and the record elimination are carried out;
and 3.3, if the quantity of any data and the shared symbols of the query exceeds | Q ^ D | -t, the data is a matched qualified record, the ID of the qualified record is added into a result set, and the step 3.1 is carried out to verify the next candidate data.
10. The optimization processing method according to claim 6, wherein the result data ID and the gene data of the original sequence thereof are associated in multiple dimensions, and the specific method of association extraction in step 4 is as follows:
step 4.1, based on the constructed association table from the ID to the identity, extracting identity information from an individual identity library through the ID of the successfully matched result data, and colliding with other tables based on the identity information to obtain the association information of the individual with the identity corresponding to the ID;
step 4.2, original gene data of the data are found in a gene library based on the data ID successfully matched in the step 4.1, feature extraction is carried out on each original result gene data through a gene mode, and other gene data with certain feature are found in the library based on the extraction result;
4.3, reversely obtaining the gene data ID of the related individual by using the step 4.1, further recursively finding the related gene of the related individual in the gene library based on the step 4.2, and the recursive hierarchy is designated by the user; or firstly, a gene data set of the associated gene pattern is found based on the step 2, and the associated identity information is found in the identity library by recursion based on the gene data ID corresponding to the data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011476095.6A CN112863607B (en) | 2020-12-14 | 2020-12-14 | Large-scale gene data-oriented identity identification system and optimization processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011476095.6A CN112863607B (en) | 2020-12-14 | 2020-12-14 | Large-scale gene data-oriented identity identification system and optimization processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112863607A true CN112863607A (en) | 2021-05-28 |
CN112863607B CN112863607B (en) | 2024-03-22 |
Family
ID=75997240
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011476095.6A Active CN112863607B (en) | 2020-12-14 | 2020-12-14 | Large-scale gene data-oriented identity identification system and optimization processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112863607B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701549A (en) * | 2023-06-21 | 2023-09-05 | 黑龙江禹桥科技开发有限公司 | Big data multi-scale fusion supervision system and method based on blockchain |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156635A (en) * | 2014-07-08 | 2014-11-19 | 华南师范大学 | OPSM mining method of gene chip expression data based on common sub-sequences |
CN104182460A (en) * | 2014-07-18 | 2014-12-03 | 浙江大学 | Time sequence similarity query method based on inverted indexes |
WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
US10545960B1 (en) * | 2019-03-12 | 2020-01-28 | The Governing Council Of The University Of Toronto | System and method for set overlap searching of data lakes |
CN111899855A (en) * | 2020-07-16 | 2020-11-06 | 武汉大学 | Individual health and public health data space-time aggregation visualization construction method and platform |
-
2020
- 2020-12-14 CN CN202011476095.6A patent/CN112863607B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156635A (en) * | 2014-07-08 | 2014-11-19 | 华南师范大学 | OPSM mining method of gene chip expression data based on common sub-sequences |
CN104182460A (en) * | 2014-07-18 | 2014-12-03 | 浙江大学 | Time sequence similarity query method based on inverted indexes |
WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
US10545960B1 (en) * | 2019-03-12 | 2020-01-28 | The Governing Council Of The University Of Toronto | System and method for set overlap searching of data lakes |
CN111899855A (en) * | 2020-07-16 | 2020-11-06 | 武汉大学 | Individual health and public health data space-time aggregation visualization construction method and platform |
Non-Patent Citations (1)
Title |
---|
李文海;冯玉才;吕泽华;马晓鸣;: "基于粗集约简的索引枚举选择方法", 华中科技大学学报(自然科学版), no. 11, 15 November 2008 (2008-11-15) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701549A (en) * | 2023-06-21 | 2023-09-05 | 黑龙江禹桥科技开发有限公司 | Big data multi-scale fusion supervision system and method based on blockchain |
Also Published As
Publication number | Publication date |
---|---|
CN112863607B (en) | 2024-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7640256B2 (en) | Data collection cataloguing and searching method and system | |
US6546394B1 (en) | Database system having logical row identifiers | |
US20020042680A1 (en) | System and method for a precompiled database for biomolecular sequence information | |
WO2018218788A1 (en) | Third-generation sequencing sequence alignment method based on global seed scoring optimization | |
EP2095277B1 (en) | Fuzzy database matching | |
CN109460386B (en) | Malicious file homology analysis method and device based on multi-dimensional fuzzy hash matching | |
US20210141833A1 (en) | Optimizing k-mer databases by k-mer subtraction | |
CN113901006A (en) | Large-scale gene sequencing data storage and query system | |
US20110264377A1 (en) | Method and system for analysing data sequences | |
CN112863607B (en) | Large-scale gene data-oriented identity identification system and optimization processing method | |
WO2011073680A1 (en) | Improvements relating to hash tables | |
JP2011133928A (en) | Retrieval device, retrieval system, retrieval method, and computer program for retrieving document file stored in storage device | |
CN114124417B (en) | Vulnerability assessment method with enhanced expandability under large-scale network | |
KR101375684B1 (en) | Method and system for managing dna sequence data | |
Esmat et al. | A parallel hash‐based method for local sequence alignment | |
JP5433894B2 (en) | Three-dimensional structure data attribution method, three-dimensional structure data attribution program, and three-dimensional structure data attribution device | |
KR100513266B1 (en) | Client/server based workbench system and method for expressed sequence tag analysis | |
CN115391284B (en) | Method, system and computer readable storage medium for quickly identifying gene data file | |
CN118335203B (en) | Coronavirus recombination detection method, system, equipment and medium for large-scale genome data | |
CN115622818B (en) | Network attack data processing method and device | |
US20230377687A1 (en) | Systems and methods using dna sequence strings as a common data format for forensic dna typing applications | |
Peng et al. | New Hash-based Sequence Alignment Algorithm | |
Cara Woodwark et al. | Sequence search algorithms for single pass sequence identification: does one size fit all? | |
CN113961671A (en) | Vehicle information retrieval method and device, electronic equipment and storage medium | |
Su | Collaborative Cross Graphical Genome |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |