CN112364080A - Rapid retrieval system and method for massive vector library - Google Patents
Rapid retrieval system and method for massive vector library Download PDFInfo
- Publication number
- CN112364080A CN112364080A CN202011269580.6A CN202011269580A CN112364080A CN 112364080 A CN112364080 A CN 112364080A CN 202011269580 A CN202011269580 A CN 202011269580A CN 112364080 A CN112364080 A CN 112364080A
- Authority
- CN
- China
- Prior art keywords
- sample
- signal data
- samples
- data
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 190
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000003780 insertion Methods 0.000 claims abstract description 5
- 230000037431 insertion Effects 0.000 claims abstract description 5
- 238000012163 sequencing technique Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 6
- 238000004806 packaging method and process Methods 0.000 claims description 6
- 238000005215 recombination Methods 0.000 claims description 6
- 230000006798 recombination Effects 0.000 claims description 6
- 101100153581 Bacillus anthracis topX gene Proteins 0.000 claims description 3
- 101100481876 Danio rerio pbk gene Proteins 0.000 claims description 3
- 101100481878 Mus musculus Pbk gene Proteins 0.000 claims description 3
- 101150041570 TOP1 gene Proteins 0.000 claims description 3
- 238000012217 deletion Methods 0.000 abstract description 5
- 230000037430 deletion Effects 0.000 abstract description 5
- 238000000605 extraction Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a rapid retrieval system and method for a massive vector library. The system comprises a central control unit and a plurality of subsystem units, wherein the central control unit is responsible for signal feature vector extraction, task distribution and combination. The subsystems establish a data structure which does not need clustering and incremental, namely the complexity of constructing the data structure is simplified, the constructed data structure is independent of data set distribution, massive vectors can be quickly retrieved, dynamic insertion and deletion of samples are realized, and more actual scene requirements can be met. The invention randomly selects nodes from the sample set to be split, thereby simplifying the calculation and simultaneously ensuring that the constructed data structure does not depend on the original data distribution; the invention can dynamically add and delete samples in the data structure without reconstructing the data structure model after adding and deleting samples each time.
Description
Technical Field
The invention belongs to the field of massive vector retrieval, and particularly relates to a massive vector library-oriented rapid retrieval system and method.
Background
The current mass vector retrieval method is a mass search method based on Hadoop and other frameworks, and the method distributes target characteristic vectors to different subsystem units, each subsystem unit independently completes respective retrieval tasks, and finally respective retrieval structures are combined to obtain a final result; the method based on the data structure comprises the steps of firstly dividing mass vector characteristics through clustering, and then constructing a data structure model through clustering results. During retrieval, only the retrieved target characteristic vector needs to be quickly found out of the cluster category to which the target characteristic vector belongs through a data structure, and then all samples in the cluster category are traversed to realize target vector retrieval; based on the cascade method, the method firstly filters the sample by using simple characteristics, reduces the retrieval range and then carries out accurate retrieval in a small range.
The major disadvantages of the framework based on Hadoop and the like are large calculation amount, high resource consumption and low retrieval efficiency. The method adopts a violent search mode, and the sample to be retrieved needs to be matched with all samples in the sample library.
The method based on the data structure has the main defects that dynamic addition and deletion of the sample library cannot be realized, and the memory requirement is high. The method needs to cluster sample data in advance, then generates a data structure model by using a clustering result, and when the data volume is large, the characteristic clustering and the establishment of the data structure model are time-consuming; in the searching process, the whole data structure model is loaded into a memory, and the size of the model is in direct proportion to the number of samples.
The method based on cascade connection has the main disadvantages of low precision and low retrieval efficiency. The simple features cannot completely describe the real information of the sample, and performance degradation may be caused by screening and filtering by using the simple features; because the similarity between the screening and filtering needs to be calculated with all samples, although the calculation is simplified by adopting simple characteristics, the sample data size is huge, the time consumption condition cannot be ignored, and the efficiency is still low.
In summary, the main technical problems of the existing mass vector retrieval method are as follows:
the retrieval efficiency is low, the retrieval time is proportional to the size of the sample library, and when the sample library is very large (more than millions), the vector retrieval speed cannot meet the real-time requirement.
The dynamic sample addition and deletion can not be realized, and the conventional rapid massive vector retrieval is to cluster a sample library and then construct a specific data structure model on the basis of clustering. Once the data structure model is built, the samples cannot be added or deleted.
The resource occupancy is high, if a 512-dimensional vector is used to describe one sample, the space required by each sample is about 2k, and when the number of samples is more than 1 hundred million, the required storage space exceeds 200G. In order to realize quick retrieval, all the data is often required to be loaded into a memory, and the resource consumption is huge.
Disclosure of Invention
In order to solve the technical problems, the invention provides a rapid retrieval system and a rapid retrieval method for a massive vector library.
The invention solves the following technical problems in the prior massive vector library retrieval:
the system comprises a central control unit and a plurality of subsystem units, wherein the central control unit is sequentially connected with the subsystem units.
The technical scheme of the invention is a rapid retrieval method facing to a massive vector library, which is characterized by comprising the following steps:
step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, and combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set;
step 2: the central control unit distributes a data sample set to each subsystem unit, each subsystem unit recombines the distributed data sample sets, selects a sample set to be split from the recombined sample set, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample set to be split by utilizing the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of the samples contained in each cluster is less than the designated number, updates the splitting node at the same time, and adds the splitting node into the splitting node set;
and step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; and the central control unit combines the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a combined set, eliminates samples with consistent labels, and selects k samples with the maximum similarity as final output, namely topK.
And 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; and (3) randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step (3) until the corresponding split node cannot be found in the subsystem unit. Traversing the number of all samples which are the same as the cluster codes of the samples to be inserted in the subsystem unit to be inserted, if the number is larger than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all the samples in the cluster and the split node, updating the cluster codes of all the nodes in the cluster, and adding the split node into a split node set.
And 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code with the search result sample in the split node set of the subsystem unit.
Preferably, the feature vector of the original signal extracted in step 1 is:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data, n is the number of original signal data, F represents the feature extractor, FkA feature vector of the kth original signal data;
the label of the original signal data in the step 1 is as follows: idk,IdkA tag representing kth original signal data;
datak={Idk,fk}
k∈[0,n]
wherein, the datakA sample representing kth original signal data, n being the number of original signal data;
from the label, i.e. Id, of the original signal datakSequencing from small to large, and obtaining samples of the sequenced signal data as follows:
i∈[1,n]
ki∈[1,n]
wherein,representing the ith sequenced signal data sample, i.e. corresponding to the kthiSamples of the original signal data, n being the number of original signal data,representing the label in the ith sorted signal data sample,representing a feature vector in the ith sorted signal data sample;
if it isAndsame, i ≠ j, then it willAndmerging the signals into the same merged signal data sample set;
in step 1, the combined signal data sample set is:
φ={φ1,φ2,...,φL}
u∈[1,L]
where φ represents a set of combined signal data samples, φuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];
Preferably, in step 2, the central control unit allocates a data sample set for each subsystem unit as follows:
u∈[1,c]
wherein phi isiRepresenting the set of data samples assigned by the central control unit to the i-th subsystem element,represents the u-th combined signal data sample, L, in the i-th subsystem elementuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]C represents the number of sample tags contained in each subsystem unit;
j∈[1,L]
wherein,representing the recombined data sample set of the ith subsystem element,represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and IdjLabel representing the j-th signal data sample after recombination, fjAnd representing a feature vector of the j-th signal data sample after recombination, wherein a node represents whether the sample is selected as a splitting node or not, the default is false, sim represents the similarity of the sample and the splitting node, code represents the cluster code to which the sample belongs, and the initial value is-1.
j∈[1,l]
wherein,represents the recombined to-be-split sample set of the ith subsystem unit (i.e. code-M data sample set),represents the j (M) th code after being recombined in the i (th) subsystem unit, l represents the number of the (M) th code after being recombined in the i (th) subsystem unit, and sjTo representSimilarity to split nodes.
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
i∈[1,N]
kj∈[1,l]
wherein, the first and second guide rollers are arranged in a row,represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds toMiddle (k) thjA sample, < i > isThe number of the medium samples is the same as the number of the medium samples,a label representing the ith sorted sample,representing the feature vector of the ith sorted sample.
wherein,indicating the degree of similarity ordering of the split sample setsThe similarity of the samples is used as a threshold value of the splitting node;
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending.
wherein,the splitting node with the number of M in the ith subsystem unit is represented, M represents the region code to which the splitting node belongs, and thresh represents the threshold value of the splitting node;
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
Preferably, the extracting the feature vector of the target signal to be retrieved in step 3 is:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth target signal data to be retrieved, n is the number of target signal data to be retrieved, F represents the feature extractor, FkCharacteristic vectors of kth target signal data to be retrieved;
step 3, the information is packaged into a form containing the characteristic vector and the cluster code, the information is distributed to each subsystem unit, and the node with the cluster code of-1 is found out from the split set of each subsystem unit and begins to be:
Idk,Idka tag representing kth original signal data;
and step 3, the sample of the target signal data to be retrieved is as follows:
datak={fk,codek=-1}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be retrievedkRepresenting the clustering code of the sample of the kth target signal data to be retrieved, wherein the initial value is-1, and n is the number of the target signal data to be retrieved;
and 3, calculating the similarity between the split node and the sample to be retrieved, and updating the cluster code of the sample to be retrieved according to the similarity:
and step 3, the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And step 3, updating the cluster codes of the samples to be retrieved according to the similarity as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 3, repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit.
Updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending.
And 3, taking out all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment.
Traversing the codes of all samples in the sample set of the subsystem unit, and taking all cluster coded samples which are the same as the samples to be retrieved as a set, wherein the form is as follows:
j∈[1,l]
wherein,a sample set representing code M in the ith subsystem unit,a sample representing the jth code ═ M, l isNumber of medium samples.
And 3, sequencing all samples consistent with the cluster codes of the samples to be retrieved in the subsystem unit according to the similarity, wherein the sequenced sample set is as follows:
kj∈[1,l]
wherein,represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds toMiddle (k) thjA sample, < i > isThe number of the medium samples is the same as the number of the medium samples,a label representing the ith sorted sample,representing the feature vector of the ith sorted sample.
And 3, the central control unit combines the retrieval results uploaded by each subsystem unit to obtain a combined set, wherein the combined set is as follows:
kj∈[1,m×N]
wherein phitarAnd representing the data set after the subsystem units are combined and sorted according to sim.
Step 3, removing samples with consistent labels:
Step 3, selecting the k samples with the maximum similarity as the final output:
i∈[1,k]
wherein: outtarA search set representing the output is generated,and the ith sample with the maximum similarity to the sample to be retrieved in the retrieval result is represented.
Preferably, the extracting of the feature vector of the original information of the sample to be inserted in step 4 is as follows:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data to be inserted, n is the number of original signal data to be inserted, F represents the feature extractor, FkA characteristic vector of the kth original signal data to be inserted is obtained;
step 4, the label to be inserted with the original signal data is: idk,IdkA tag representing a kth original signal data to be inserted;
step 4, the samples of the original signal data to be inserted are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakRepresenting a sample of kth original signal data to be inserted, wherein n is the number of the original signal data to be inserted;
step 4, inserting the sample to be inserted into the data set (as shown in fig. 2) corresponding to the central control unit according to the label sequence by adopting a binary search method for the label attribute in the data set of the central control unit:
if the sample label to be inserted is found in the data set of the central control unit, inserting the characteristic vector of the sample to be inserted into the characteristic vector with the same label in the data set, wherein the form of the data set after insertion is as follows:
φ={φ1,φ2,...,φL}
u∈[1,L]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresenting the combined signal data samples, L, identical to the sample labels to be inserteduRepresenting the number of feature vectors in the combined signal data samples that are identical to the sample labels to be inserted,indicating the same label in the combined signal data sample as the sample label to be inserted, i.e. corresponding to the kthuThe label of the sample of the original signal data,representing the v-th eigenvector in the combined signal data sample identical to the sample label to be inserted, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];Representing the feature vector of the sample to be inserted.
If the to-be-inserted sample label cannot be found in the central control unit data set, the position of the to-be-inserted sample is found according to the binary search, the to-be-inserted sample is inserted into the central data set, and the inserted data set is as follows:
φ={φ1,φ2,...,φL,φL+1}
u∈[1,L+1]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresents the u-th combined signal data sample, LuRepresenting bits in the u-th combined signal data sampleThe number of the eigenvectors is,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];φL+1Representing the data sample to be inserted into the sample.
And 4, randomly selecting one subsystem unit as the subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3.
Encapsulating samples of signal data to be inserted as:
datak={fk,codek}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be insertedkRepresenting the clustering code of the sample of the kth target signal data to be inserted, wherein the initial value is-1, and n is the number of the target signal data to be inserted;
calculating the similarity between the splitting node and the sample to be inserted, and updating the cluster code of the sample to be inserted according to the similarity:
the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And updating the cluster code of the sample to be inserted according to the similarity, wherein the cluster code updating formula is as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 4, until the corresponding split node cannot be found in the subsystem unit.
The description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending.
Step 4, splitting the node set psiiComprises the following steps:
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
Preferably, in step 5, the central control unit searches for the data obtained by combining the to-be-deleted samples in the data set corresponding to the central control unit by using a binary search method for the tags, and if the combined data with the same tags as the to-be-deleted samples is found in the data set of the central control unit, the signal data obtained by combining the to-be-deleted tags is extracted as:
u∈[1,L]
wherein phi isuIndicating that a combined signal data sample set which is the same as the sample label to be deleted is found in the central control unit, wherein L indicates the number of characteristic vectors (namely the number of characteristics to be deleted) in combined signal data samples which are the same as the sample label to be deleted in the central control unit, and IdkIndicating that the sample label is to be deleted,representing the v-th feature vector to be deleted.
Step 5, searching each feature vector in each subsystem unit by using the feature vector of each data to be deleted as the feature vector to be searched by the method in step 3 to obtain top 1;
packaging the characteristic vector of each label signal data to be deleted into data to be deleted:
datak={Idk fk,codek}
k∈[0,n]
wherein, the datakDenotes the kth data to be deleted, IdkA label indicating a target signal to be deleted, fkK-th eigenvector, code, representing target signal data to be deleted by the central control unitkRepresenting the cluster coding of the kth data of the target signal data to be deleted, wherein the initial value is-1, and n is the number of the target signal data to be deleted;
and (3) taking the feature vector of each data to be deleted as the feature vector to be retrieved, retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set.
Wherein the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Updating the cluster code of the sample to be deleted, wherein the cluster code updating formula is as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 5, if the label of the retrieval result is the same as the label of the sample to be deleted, deleting the retrieval result sample in the subsystem unit is as follows:
datakindicating that the data is to be deleted,representation datakTop1 data output in the ith subsystem element as a sample to be retrieved,the serial number in the ith subsystem unit is m, ifThen delete it
Step 5, traversing the number of all the samples in the subsystem unit to be inserted, which are the same as the cluster code of the retrieval result sample, and if the number is zero, deleting the split nodes in the split node set of the subsystem unit, which are the same as the cluster code of the retrieval result sample, as follows:
datakindicating that the data is to be deleted,data in split node set representing ith subsystem unitkClustering the split nodes with the same code, and counting the data in the ith subsystem unitkClustering encodes the number of identical samples, and if the number is zero, deleting
The invention has the advantages that:
and (3) establishing a data structure: the invention does not need clustering, randomly selects the samples in the class as the central nodes, and constructs the samples into the tree structure, thereby simplifying the calculation, ensuring that the constructed data structure does not depend on the distribution of the original data, and realizing the dynamic addition and deletion of the samples.
Fine granularity of data structure leaf node partitioning: the leaf nodes are not necessarily divided into single samples, but contain a certain number of sample sets, and the retrieval precision can be improved.
Dynamic data structure: the data structure established in the subsystem unit is dynamic and is not a fixed model, and the processing of the sample is more flexible.
Data in the data structure can be added or deleted: the system can directly add and delete samples to the data structure of the subsystem unit, so that the retrieval system can meet more requirements.
Drawings
FIG. 1: the invention is implemented in a scene diagram.
FIG. 2: a central control unit database.
FIG. 3: the invention is a flow chart of a subsystem unit.
FIG. 4: and (5) dividing subsystem unit data.
FIG. 5: the tree structure diagram of the subsystem unit features.
FIG. 6: sample retrieval flow diagram.
FIG. 7: sample insertion flow chart.
FIG. 8: sample delete map flow.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. For the parameters that need to be analyzed in the actual situation, we have noted the parameter setting method above and will not be described herein.
As shown in fig. 1, which is a schematic view of an implementation scenario of the present invention, the method of the present invention includes a central control unit and a plurality of subsystem units, and the central control unit is sequentially connected to the plurality of subsystem units.
The following describes the embodiments of the present invention with reference to fig. 1 to 8:
step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set, and the form of the combined data sample set is shown in FIG. 2;
fk=F(xk)
k∈[0,n]
wherein x iskIs the number of kth original signalAccording to n is the number of raw signal data, F represents a feature extractor, FkA feature vector of the kth original signal data;
the label of the original signal data in the step 1 is as follows: idk,IdkA tag representing kth original signal data;
datak={Idk,fk}
k∈[0,n]
wherein, the datakA sample representing kth original signal data, n being the number of original signal data;
from the label, i.e. Id, of the original signal datakSequencing from small to large, and obtaining samples of the sequenced signal data as follows:
i∈[1,n]
ki∈[1,n]
wherein,representing the ith sequenced signal data sample, i.e. corresponding to the kthiSamples of the original signal data, n being the number of original signal data,representing the label in the ith sorted signal data sample,representing a feature vector in the ith sorted signal data sample;
if it isAndsame, i ≠ j, then it willAndmerging the signals into the same merged signal data sample set;
in step 1, the combined signal data sample set is:
φ={φ1,φ2,...,φL}
u∈[1,L]
where φ represents a set of combined signal data samples, φuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];
Step 2: the central control unit allocates a data sample set to each subsystem unit, each subsystem unit recombines the allocated data sample sets, selects a sample set to be split from the recombined sample sets, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample sets to be split by using the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of samples contained in each cluster is less than a specified number, updates the splitting node and adds the splitting node into the splitting node set, the subsystem unit data structure construction flow chart is shown in figure 3, the set splitting chart is shown in figure 4, the constructed tree data structure is shown in FIG. 5;
u∈[1,c]
wherein phi isiRepresenting the set of data samples assigned by the central control unit to the i-th subsystem element,represents the u-th combined signal data sample, L, in the i-th subsystem elementuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]C represents the number of sample tags contained in each subsystem unit;
j∈[1,L]
wherein,representing the recombined data sample set of the ith subsystem element,represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and IdjLabel representing the j-th signal data sample after recombination, fjAnd representing a feature vector of the j-th signal data sample after recombination, wherein a node represents whether the sample is selected as a splitting node or not, the default is false, sim represents the similarity of the sample and the splitting node, code represents the cluster code to which the sample belongs, and the initial value is-1.
j∈[1,l]
wherein,represents the recombined to-be-split sample set of the ith subsystem unit (i.e. code-M data sample set),represents the j (M) th code after being recombined in the i (th) subsystem unit, l represents the number of the (M) th code after being recombined in the i (th) subsystem unit, and sjTo representSimilarity to split nodes.
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
i∈[1,N]
kj∈[1,l]
wherein, the first and second guide rollers are arranged in a row,represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds toMiddle (k) thjA sample, < i > isThe number of the medium samples is the same as the number of the medium samples,a label representing the ith sorted sample,representing the feature vector of the ith sorted sample.
wherein,indicating the degree of similarity ordering of the split sample setsThe similarity of the samples is used as a threshold value of the splitting node;
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending.
wherein,the splitting node with the number of M in the ith subsystem unit is represented, M represents the region code to which the splitting node belongs, and thresh represents the threshold value of the splitting node;
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
And step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; the central control unit merges the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a merged set, eliminates samples with consistent labels, selects k samples with the maximum similarity as final output, namely topK, and a sample retrieval flow chart is shown in fig. 6.
And 3, extracting the characteristic vector of the target signal to be retrieved:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth target signal data to be retrieved, n is the number of target signal data to be retrieved, F represents the feature extractor, FkCharacteristic vectors of kth target signal data to be retrieved;
step 3, the information is packaged into a form containing the characteristic vector and the cluster code, the information is distributed to each subsystem unit, and the node with the cluster code of-1 is found out from the split set of each subsystem unit and begins to be:
Idk,Idka tag representing kth original signal data;
and step 3, the sample of the target signal data to be retrieved is as follows:
datak={fk,codek=-1}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be retrievedkRepresenting the clustering code of the sample of the kth target signal data to be retrieved, wherein the initial value is-1, and n is the number of the target signal data to be retrieved;
and 3, calculating the similarity between the split node and the sample to be retrieved, and updating the cluster code of the sample to be retrieved according to the similarity:
and step 3, the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And step 3, updating the cluster codes of the samples to be retrieved according to the similarity as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 3, repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit.
Updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending.
And 3, taking out all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment.
Traversing the codes of all samples in the sample set of the subsystem unit, and taking all cluster coded samples which are the same as the samples to be retrieved as a set, wherein the form is as follows:
j∈[1,l]
wherein,a sample set representing code M in the ith subsystem unit,a sample representing the jth code ═ M, l isNumber of medium samples.
And 3, sequencing all samples consistent with the cluster codes of the samples to be retrieved in the subsystem unit according to the similarity, wherein the sequenced sample set is as follows:
kj∈[1,l]
wherein,represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds toMiddle (k) thjA sample, < i > isThe number of the medium samples is the same as the number of the medium samples,a label representing the ith sorted sample,representing the feature vector of the ith sorted sample.
And 3, the central control unit combines the retrieval results uploaded by each subsystem unit to obtain a combined set, wherein the combined set is as follows:
kj∈[1,m×N]
wherein phitarAnd representing the data set after the subsystem units are combined and sorted according to sim.
Step 3, removing samples with consistent labels:
Step 3, selecting the k samples with the maximum similarity as the final output:
i∈[1,k]
wherein: outtarA search set representing the output is generated,and the ith sample with the maximum similarity to the sample to be retrieved in the retrieval result is represented.
And 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; and (3) randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step (3) until the corresponding split node cannot be found in the subsystem unit. Traversing the number of all samples in the sub-system unit to be inserted, which are the same as the cluster code of the sample to be inserted, if the number is greater than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all samples in the cluster and the split node, updating the cluster code of all nodes in the cluster, adding the split node into a split node set, and inserting the samples as shown in fig. 7.
And 4, extracting the characteristic vector of the original information of the sample to be inserted:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data to be inserted, n is the number of original signal data to be inserted, F represents the feature extractor, FkA characteristic vector of the kth original signal data to be inserted is obtained;
step 4, the label to be inserted with the original signal data is: idk,IdkA tag representing a kth original signal data to be inserted;
step 4, the samples of the original signal data to be inserted are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakRepresenting a sample of kth original signal data to be inserted, wherein n is the number of the original signal data to be inserted;
step 4, inserting the sample to be inserted into the data set (as shown in fig. 2) corresponding to the central control unit according to the label sequence by adopting a binary search method for the label attribute in the data set of the central control unit:
if the sample label to be inserted is found in the data set of the central control unit, inserting the characteristic vector of the sample to be inserted into the characteristic vector with the same label in the data set, wherein the form of the data set after insertion is as follows:
φ={φ1,φ2,...,φL}
u∈[1,L]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresenting the combined signal data samples, L, identical to the sample labels to be inserteduIs shown andthe number of feature vectors in the combined signal data samples to be inserted with the same sample label,indicating the same label in the combined signal data sample as the sample label to be inserted, i.e. corresponding to the kthuThe label of the sample of the original signal data,representing the v-th eigenvector in the combined signal data sample identical to the sample label to be inserted, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];Representing the feature vector of the sample to be inserted.
If the to-be-inserted sample label cannot be found in the central control unit data set, the position of the to-be-inserted sample is found according to the binary search, the to-be-inserted sample is inserted into the central data set, and the inserted data set is as follows:
φ={φ1,φ2,...,φL,φL+1}
u∈[1,L+1]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,denotes the u-th sumAnd the v-th feature vector in the post-signal data sample corresponds to the k-th feature vectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];φL+1Representing the data sample to be inserted into the sample.
And 4, randomly selecting one subsystem unit as the subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3.
Encapsulating samples of signal data to be inserted as:
datak={fk,codek}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be insertedkRepresenting the clustering code of the sample of the kth target signal data to be inserted, wherein the initial value is-1, and n is the number of the target signal data to be inserted;
calculating the similarity between the splitting node and the sample to be inserted, and updating the cluster code of the sample to be inserted according to the similarity:
the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And updating the cluster code of the sample to be inserted according to the similarity, wherein the cluster code updating formula is as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 4, until the corresponding split node cannot be found in the subsystem unit.
The description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending.
Step 4, splitting the node set psiiComprises the following steps:
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
And 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code as the search result sample in the split node set of the subsystem unit, wherein the sample deletion is as shown in FIG. 8.
Step 5, searching the data obtained by combining the sample to be deleted in the data set corresponding to the central control unit by the central control unit through a binary search method for the label, and if the combined data with the same label as the sample to be deleted is found in the data set of the central control unit, extracting the signal data obtained by combining the label to be deleted as follows:
u∈[1,L]
wherein phi isuIndicating that a combined signal data sample set which is the same as the sample label to be deleted is found in the central control unit, wherein L indicates the number of characteristic vectors (namely the number of characteristics to be deleted) in combined signal data samples which are the same as the sample label to be deleted in the central control unit, and IdkIndicating that the sample label is to be deleted,representing the v-th feature vector to be deleted.
Step 5, searching each feature vector in each subsystem unit by using the feature vector of each data to be deleted as the feature vector to be searched by the method in step 3 to obtain top 1;
packaging the characteristic vector of each label signal data to be deleted into data to be deleted:
datak={Idk fk,codek}
k∈[0,n]
wherein, the datakDenotes the kth data to be deleted, IdkA label indicating a target signal to be deleted, fkK-th eigenvector, code, representing target signal data to be deleted by the central control unitkIndicating a target message to be deletedClustering coding of kth data of the number data, wherein the initial value is-1, and n is the number of target signal data to be deleted;
and (3) taking the feature vector of each data to be deleted as the feature vector to be retrieved, retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set.
Wherein the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Updating the cluster code of the sample to be deleted, wherein the cluster code updating formula is as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 5, if the label of the retrieval result is the same as the label of the sample to be deleted, deleting the retrieval result sample in the subsystem unit is as follows:
datakindicating that the data is to be deleted,representation datakTop1 data output in the ith subsystem element as a sample to be retrieved,the serial number in the ith subsystem unit is m, ifThen delete it
Step 5, traversing the number of all the samples in the subsystem unit to be inserted, which are the same as the cluster code of the retrieval result sample, and if the number is zero, deleting the split nodes in the split node set of the subsystem unit, which are the same as the cluster code of the retrieval result sample, as follows:
datakindicating that the data is to be deleted,data in split node set representing ith subsystem unitkClustering the split nodes with the same code, and counting the data in the ith subsystem unitkClustering encodes the number of identical samples, and if the number is zero, deleting
It should be understood that parts of the application not described in detail are prior art.
It should be understood that the above description of the preferred embodiments is given for clearness of understanding and no unnecessary limitations should be understood therefrom, and all changes and modifications may be made by those skilled in the art without departing from the scope of the invention as defined by the appended claims.
Claims (6)
1. A retrieval method based on a rapid retrieval system facing a mass vector library is characterized in that:
the rapid retrieval system facing the massive vector library comprises: a central control unit and a plurality of subsystem units;
the central control unit is sequentially connected with the plurality of subsystem units;
the retrieval method comprises the following steps:
step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, and combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set;
step 2: the central control unit distributes a data sample set to each subsystem unit, each subsystem unit recombines the distributed data sample sets, selects a sample set to be split from the recombined sample set, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample set to be split by utilizing the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of the samples contained in each cluster is less than the designated number, updates the splitting node at the same time, and adds the splitting node into the splitting node set;
and step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; the central control unit combines the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a combined set, eliminates samples with consistent labels, and selects k samples with the maximum similarity as final output, namely topK;
and 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3 until the corresponding split node cannot be found in the subsystem unit; traversing the number of all samples which are the same as the cluster codes of the samples to be inserted in the subsystem unit to be inserted, if the number is larger than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all the samples in the cluster and the split node, updating the cluster codes of all the nodes in the cluster, and adding the split node into a split node set;
and 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code with the search result sample in the split node set of the subsystem unit.
2. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
step 1, extracting the feature vector of the original signal is as follows:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data, n is the number of original signal data, F represents the feature extractor, FkA feature vector of the kth original signal data;
the label of the original signal data in the step 1 is as follows: idk,IdkA tag representing kth original signal data;
step 1, the samples of the original signal data are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakA sample representing kth original signal data, n being the number of original signal data;
step 1 the sample ordering of the raw signal data is:
from the label, i.e. Id, of the original signal datakSequencing from small to large, and obtaining samples of the sequenced signal data as follows:
i∈[1,n]
ki∈[1,n]
wherein,representing the ith ordered signalData samples, i.e. corresponding to the kthiSamples of the original signal data, n being the number of original signal data,representing the label in the ith sorted signal data sample,representing a feature vector in the ith sorted signal data sample;
step 1, merging the sorted signal data samples corresponding to the labels in the same sorted signal data samples into:
if it isAndsame, i ≠ j, then it willAndmerging the signals into the same merged signal data sample set;
in step 1, the combined signal data sample set is:
φ={φ1,φ2,...,φL}
u∈[1,L]
where φ represents a set of combined signal data samples, φuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]。
3. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
step 2, the central control unit allocates a data sample set for each subsystem unit as follows:
u∈[1,c]
wherein phi isiRepresenting the set of data samples assigned by the central control unit to the i-th subsystem element,represents the u-th combined signal data sample, L, in the i-th subsystem elementuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]C represents the number of sample tags contained in each subsystem unit;
step 2, each subsystem unit recombines the distributed data sample sets into:
j∈[1,L]
wherein,representing the recombined data sample set of the ith subsystem element,represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and IdjLabel representing the j-th signal data sample after recombination, fjRepresenting a feature vector of a j-th signal data sample after recombination, representing whether the sample is selected as a split node or not by a node, default of the node is false, representing the similarity between the sample and the split node by sim, representing a cluster code to which the sample belongs by a code, and having an initial value of-1;
step 2, selecting a sample set to be split from the recombined sample set as follows:
j∈[1,l]
wherein,represents the recombined to-be-split sample set of the ith subsystem unit (i.e. code-M data sample set),represents the j (M) th code after being recombined in the i (th) subsystem unit, l represents the number of the (M) th code after being recombined in the i (th) subsystem unit, and sjTo representSimilarity to split nodes;
step 2, the calculation formula for calculating the similarity between the to-be-split sample and the split node is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Step 2, sorting all samples in the sample set to be classified according to the similarity, wherein the sorted sample set is as follows:
i∈[1,N]
kj∈[1,l]
wherein,represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds toMiddle (k) thjA sample, < i > isThe number of the medium samples is the same as the number of the medium samples,a label representing the ith sorted sample,a feature vector representing the ith sorted sample;
step 2, selecting the similarity of the intermediate samples as a threshold value, wherein the form is as follows:
wherein,indicating the degree of similarity ordering of the split sample setsThe similarity of the samples is used as a threshold value of the splitting node;
step 2, updating the coding formula of each sample to be split according to the relationship between the similarity and the threshold value is as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 2, continuously repeating the splitting of each cluster until the number of samples contained in each cluster is less than the specified number, specifically describing as:
after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending;
step 2, the mode for updating each split node is as follows:
wherein,the splitting node with the number of M in the ith subsystem unit is represented, M represents the region code to which the splitting node belongs, and thresh represents the threshold value of the splitting node;
step 2, splitting the node set psiiComprises the following steps:
j∈[1,q]
4. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
and 3, extracting the characteristic vector of the target signal to be retrieved:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth target signal data to be retrieved, n is the number of target signal data to be retrieved, F represents the feature extractor, FkCharacteristic vectors of kth target signal data to be retrieved;
step 3, the information is packaged into a form containing the characteristic vector and the cluster code, the information is distributed to each subsystem unit, and the node with the cluster code of-1 is found out from the split set of each subsystem unit and begins to be:
Idk,Idka tag representing kth original signal data;
and step 3, the sample of the target signal data to be retrieved is as follows:
datak={fk,codek=-1}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be retrievedkRepresenting the clustering code of the sample of the kth target signal data to be retrieved, wherein the initial value is-1, and n is the number of the target signal data to be retrieved;
and 3, calculating the similarity between the split node and the sample to be retrieved, and updating the cluster code of the sample to be retrieved according to the similarity:
and step 3, the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And step 3, updating the cluster codes of the samples to be retrieved according to the similarity as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit;
updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending;
step 3, all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment are taken out;
traversing the codes of all samples in the sample set of the subsystem unit, and taking all cluster coded samples which are the same as the samples to be retrieved as a set, wherein the form is as follows:
j∈[1,l]
wherein,a sample set representing code M in the ith subsystem unit,a sample representing the jth code ═ M, l isThe number of medium samples;
and 3, sequencing all samples consistent with the cluster codes of the samples to be retrieved in the subsystem unit according to the similarity, wherein the sequenced sample set is as follows:
kj∈[1,l]
wherein,represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds toMiddle (k) thjA sample, < i > isThe number of the medium samples is the same as the number of the medium samples,a label representing the ith sorted sample,a feature vector representing the ith sorted sample;
and 3, the central control unit combines the retrieval results uploaded by each subsystem unit to obtain a combined set, wherein the combined set is as follows:
kj∈[1,m×N]
wherein phitarRepresenting a data set which is formed by merging all subsystem units and sorting the subsystem units according to sim;
step 3, removing samples with consistent labels:
Step 3, selecting the k samples with the maximum similarity as the final output:
i∈[1,k]
5. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
and 4, extracting the characteristic vector of the original information of the sample to be inserted:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data to be inserted, n is the number of original signal data to be inserted, F represents the feature extractor, FkA characteristic vector of the kth original signal data to be inserted is obtained;
step 4, the label to be inserted with the original signal data is: idk,IdkA tag representing a kth original signal data to be inserted;
step 4, the samples of the original signal data to be inserted are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakRepresenting a sample of kth original signal data to be inserted, wherein n is the number of the original signal data to be inserted;
step 4, inserting the sample to be inserted into the data set (as shown in fig. 2) corresponding to the central control unit according to the label sequence by adopting a binary search method for the label attribute in the data set of the central control unit:
if the sample label to be inserted is found in the data set of the central control unit, inserting the characteristic vector of the sample to be inserted into the characteristic vector with the same label in the data set, wherein the form of the data set after insertion is as follows:
φ={φ1,φ2,...,φL}
u∈[1,L]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresenting the combined signal data samples, L, identical to the sample labels to be inserteduRepresenting the number of feature vectors in the combined signal data samples that are identical to the sample labels to be inserted,indicating the same label in the combined signal data sample as the sample label to be inserted, i.e. corresponding to the kthuThe label of the sample of the original signal data,representing the v-th eigenvector in the combined signal data sample identical to the sample label to be inserted, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];Indicating to be insertedA feature vector of the sample;
if the to-be-inserted sample label cannot be found in the central control unit data set, the position of the to-be-inserted sample is found according to the binary search, the to-be-inserted sample is inserted into the central data set, and the inserted data set is as follows:
φ={φ1,φ2,...,φL,φL+1}
u∈[1,L+1]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];φL+1A data sample representing a sample to be inserted;
4, randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3;
encapsulating samples of signal data to be inserted as:
datak={fk,codek}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be insertedkCluster coding of samples representing the kth target signal data to be inserted with an initial value of-1, n is the amount of target signal data to be inserted;
calculating the similarity between the splitting node and the sample to be inserted, and updating the cluster code of the sample to be inserted according to the similarity:
the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And updating the cluster code of the sample to be inserted according to the similarity, wherein the cluster code updating formula is as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 4, until the corresponding split node can not be found in the subsystem unit;
the description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending;
step 4, splitting the node set psiiComprises the following steps:
j∈[1,q]
6. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
step 5, searching the data obtained by combining the sample to be deleted in the data set corresponding to the central control unit by the central control unit through a binary search method for the label, and if the combined data with the same label as the sample to be deleted is found in the data set of the central control unit, extracting the signal data obtained by combining the label to be deleted as follows:
u∈[1,L]
wherein phi isuIndicating that a combined signal data sample set which is the same as the sample label to be deleted is found in the central control unit, wherein L indicates the number of characteristic vectors (namely the number of characteristics to be deleted) in combined signal data samples which are the same as the sample label to be deleted in the central control unit, and IdkIndicating that the sample label is to be deleted,representing the v-th feature vector to be deleted;
step 5, searching each feature vector in each subsystem unit by using the feature vector of each data to be deleted as the feature vector to be searched by the method in step 3 to obtain top 1;
packaging the characteristic vector of each label signal data to be deleted into data to be deleted:
datak={Idk fk,codek}
k∈[0,n]
wherein, the datakDenotes the kth data to be deleted, IdkA label indicating a target signal to be deleted, fkK-th eigenvector, code, representing target signal data to be deleted by the central control unitkRepresenting the cluster coding of the kth data of the target signal data to be deleted, wherein the initial value is-1, and n is the number of the target signal data to be deleted;
taking the feature vector of each data to be deleted as a feature vector to be retrieved, and retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set;
wherein the similarity calculation formula is as follows:
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Updating the cluster code of the sample to be deleted, wherein the cluster code updating formula is as follows:
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 5, if the label of the retrieval result is the same as the label of the sample to be deleted, deleting the retrieval result sample in the subsystem unit is as follows:
datakindicating that the data is to be deleted,representation datakTop1 data output in the ith subsystem element as a sample to be retrieved,the serial number in the ith subsystem unit is m, ifThen delete it
Step 5, traversing the number of all the samples in the subsystem unit to be inserted, which are the same as the cluster code of the retrieval result sample, and if the number is zero, deleting the split nodes in the split node set of the subsystem unit, which are the same as the cluster code of the retrieval result sample, as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011269580.6A CN112364080B (en) | 2020-11-13 | 2020-11-13 | Rapid retrieval system and method for massive vector libraries |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011269580.6A CN112364080B (en) | 2020-11-13 | 2020-11-13 | Rapid retrieval system and method for massive vector libraries |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112364080A true CN112364080A (en) | 2021-02-12 |
CN112364080B CN112364080B (en) | 2024-04-09 |
Family
ID=74514764
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011269580.6A Active CN112364080B (en) | 2020-11-13 | 2020-11-13 | Rapid retrieval system and method for massive vector libraries |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112364080B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116595065A (en) * | 2023-05-09 | 2023-08-15 | 上海任意门科技有限公司 | Content duplicate identification method, device, system and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156728A (en) * | 2011-03-31 | 2011-08-17 | 河南理工大学 | Improved personalized summary system based on user interest model |
CN103699648A (en) * | 2013-12-26 | 2014-04-02 | 成都市卓睿科技有限公司 | Tree-form data structure used for quick retrieval and implementation method of tree-form data structure |
CN107563715A (en) * | 2017-07-19 | 2018-01-09 | 天津云脉三六五科技有限公司 | Foreign trade set-off marketing system and method |
CN109918529A (en) * | 2019-02-25 | 2019-06-21 | 重庆邮电大学 | An Image Retrieval Method Based on Tree Clustering Vector Quantization |
-
2020
- 2020-11-13 CN CN202011269580.6A patent/CN112364080B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156728A (en) * | 2011-03-31 | 2011-08-17 | 河南理工大学 | Improved personalized summary system based on user interest model |
CN103699648A (en) * | 2013-12-26 | 2014-04-02 | 成都市卓睿科技有限公司 | Tree-form data structure used for quick retrieval and implementation method of tree-form data structure |
CN107563715A (en) * | 2017-07-19 | 2018-01-09 | 天津云脉三六五科技有限公司 | Foreign trade set-off marketing system and method |
CN109918529A (en) * | 2019-02-25 | 2019-06-21 | 重庆邮电大学 | An Image Retrieval Method Based on Tree Clustering Vector Quantization |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116595065A (en) * | 2023-05-09 | 2023-08-15 | 上海任意门科技有限公司 | Content duplicate identification method, device, system and storage medium |
CN116595065B (en) * | 2023-05-09 | 2024-04-02 | 上海任意门科技有限公司 | Content duplicate identification method, device, system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112364080B (en) | 2024-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6792163B2 (en) | Method and apparatus for searching, browsing and summarizing moving image data using fidelity of tree-structured moving image hierarchy | |
CN112307762B (en) | Search result sorting method and device, storage medium and electronic device | |
CN111611488B (en) | Information recommendation method and device based on artificial intelligence and electronic equipment | |
CN113298197B (en) | Data clustering method, device, equipment and readable storage medium | |
JP2023502863A (en) | Image incremental clustering method and apparatus, electronic device, storage medium and program product | |
CN106874445A (en) | High in the clouds image-recognizing method based on words tree retrieval with similarity checking | |
CN113590898A (en) | Data retrieval method and device, electronic equipment, storage medium and computer product | |
CN112364080B (en) | Rapid retrieval system and method for massive vector libraries | |
CN109344309A (en) | Extensive file and picture classification method and system are stacked based on convolutional neural networks | |
CN110110120B (en) | Image retrieval method and device based on deep learning | |
JPH08305718A (en) | Method and device for processing information | |
CN108256083A (en) | Content recommendation method based on deep learning | |
CN116304213B (en) | Subgraph matching query optimization method for RDF graph database based on graph neural network | |
CN113204676B (en) | Compression storage method based on graph structure data | |
CN112905792B (en) | Text clustering method, device, equipment and storage medium based on non-text scene | |
CN102369525A (en) | System for searching visual information | |
CN116680325A (en) | Time-series record link data matching method and device based on attribute correlation | |
CN115329133A (en) | Remote sensing video hash retrieval method based on key frame fusion and attention mechanism | |
CN108256086A (en) | Data characteristics statistical analysis technique | |
CN108280176A (en) | Data mining optimization method based on MapReduce | |
JP3497713B2 (en) | Information classification method, apparatus and system | |
CN118093659B (en) | Database Gao Weishu query method based on three-input network and high-point tree | |
CN105205172A (en) | Database retrieval method | |
CN116821404B (en) | Data retrieval method, device, apparatus, storage medium and program product | |
CN106844725A (en) | A kind of high in the clouds image data base generation and recognition methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |