CN112364080A - Rapid retrieval system and method for massive vector library - Google Patents

Rapid retrieval system and method for massive vector library Download PDF

Info

Publication number
CN112364080A
CN112364080A CN202011269580.6A CN202011269580A CN112364080A CN 112364080 A CN112364080 A CN 112364080A CN 202011269580 A CN202011269580 A CN 202011269580A CN 112364080 A CN112364080 A CN 112364080A
Authority
CN
China
Prior art keywords
sample
signal data
samples
data
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011269580.6A
Other languages
Chinese (zh)
Other versions
CN112364080B (en
Inventor
谢建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD
Wuhan Yangtze Communications Zhilian Technology Co ltd
Original Assignee
WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD
Wuhan Yangtze Communications Zhilian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD, Wuhan Yangtze Communications Zhilian Technology Co ltd filed Critical WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD
Priority to CN202011269580.6A priority Critical patent/CN112364080B/en
Publication of CN112364080A publication Critical patent/CN112364080A/en
Application granted granted Critical
Publication of CN112364080B publication Critical patent/CN112364080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a rapid retrieval system and method for a massive vector library. The system comprises a central control unit and a plurality of subsystem units, wherein the central control unit is responsible for signal feature vector extraction, task distribution and combination. The subsystems establish a data structure which does not need clustering and incremental, namely the complexity of constructing the data structure is simplified, the constructed data structure is independent of data set distribution, massive vectors can be quickly retrieved, dynamic insertion and deletion of samples are realized, and more actual scene requirements can be met. The invention randomly selects nodes from the sample set to be split, thereby simplifying the calculation and simultaneously ensuring that the constructed data structure does not depend on the original data distribution; the invention can dynamically add and delete samples in the data structure without reconstructing the data structure model after adding and deleting samples each time.

Description

Rapid retrieval system and method for massive vector library
Technical Field
The invention belongs to the field of massive vector retrieval, and particularly relates to a massive vector library-oriented rapid retrieval system and method.
Background
The current mass vector retrieval method is a mass search method based on Hadoop and other frameworks, and the method distributes target characteristic vectors to different subsystem units, each subsystem unit independently completes respective retrieval tasks, and finally respective retrieval structures are combined to obtain a final result; the method based on the data structure comprises the steps of firstly dividing mass vector characteristics through clustering, and then constructing a data structure model through clustering results. During retrieval, only the retrieved target characteristic vector needs to be quickly found out of the cluster category to which the target characteristic vector belongs through a data structure, and then all samples in the cluster category are traversed to realize target vector retrieval; based on the cascade method, the method firstly filters the sample by using simple characteristics, reduces the retrieval range and then carries out accurate retrieval in a small range.
The major disadvantages of the framework based on Hadoop and the like are large calculation amount, high resource consumption and low retrieval efficiency. The method adopts a violent search mode, and the sample to be retrieved needs to be matched with all samples in the sample library.
The method based on the data structure has the main defects that dynamic addition and deletion of the sample library cannot be realized, and the memory requirement is high. The method needs to cluster sample data in advance, then generates a data structure model by using a clustering result, and when the data volume is large, the characteristic clustering and the establishment of the data structure model are time-consuming; in the searching process, the whole data structure model is loaded into a memory, and the size of the model is in direct proportion to the number of samples.
The method based on cascade connection has the main disadvantages of low precision and low retrieval efficiency. The simple features cannot completely describe the real information of the sample, and performance degradation may be caused by screening and filtering by using the simple features; because the similarity between the screening and filtering needs to be calculated with all samples, although the calculation is simplified by adopting simple characteristics, the sample data size is huge, the time consumption condition cannot be ignored, and the efficiency is still low.
In summary, the main technical problems of the existing mass vector retrieval method are as follows:
the retrieval efficiency is low, the retrieval time is proportional to the size of the sample library, and when the sample library is very large (more than millions), the vector retrieval speed cannot meet the real-time requirement.
The dynamic sample addition and deletion can not be realized, and the conventional rapid massive vector retrieval is to cluster a sample library and then construct a specific data structure model on the basis of clustering. Once the data structure model is built, the samples cannot be added or deleted.
The resource occupancy is high, if a 512-dimensional vector is used to describe one sample, the space required by each sample is about 2k, and when the number of samples is more than 1 hundred million, the required storage space exceeds 200G. In order to realize quick retrieval, all the data is often required to be loaded into a memory, and the resource consumption is huge.
Disclosure of Invention
In order to solve the technical problems, the invention provides a rapid retrieval system and a rapid retrieval method for a massive vector library.
The invention solves the following technical problems in the prior massive vector library retrieval:
the system comprises a central control unit and a plurality of subsystem units, wherein the central control unit is sequentially connected with the subsystem units.
The technical scheme of the invention is a rapid retrieval method facing to a massive vector library, which is characterized by comprising the following steps:
step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, and combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set;
step 2: the central control unit distributes a data sample set to each subsystem unit, each subsystem unit recombines the distributed data sample sets, selects a sample set to be split from the recombined sample set, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample set to be split by utilizing the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of the samples contained in each cluster is less than the designated number, updates the splitting node at the same time, and adds the splitting node into the splitting node set;
and step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; and the central control unit combines the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a combined set, eliminates samples with consistent labels, and selects k samples with the maximum similarity as final output, namely topK.
And 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; and (3) randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step (3) until the corresponding split node cannot be found in the subsystem unit. Traversing the number of all samples which are the same as the cluster codes of the samples to be inserted in the subsystem unit to be inserted, if the number is larger than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all the samples in the cluster and the split node, updating the cluster codes of all the nodes in the cluster, and adding the split node into a split node set.
And 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code with the search result sample in the split node set of the subsystem unit.
Preferably, the feature vector of the original signal extracted in step 1 is:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data, n is the number of original signal data, F represents the feature extractor, FkA feature vector of the kth original signal data;
the label of the original signal data in the step 1 is as follows: idk,IdkA tag representing kth original signal data;
step 1, the samples of the original signal data are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakA sample representing kth original signal data, n being the number of original signal data;
step 1 the sample ordering of the raw signal data is:
from the label, i.e. Id, of the original signal datakSequencing from small to large, and obtaining samples of the sequenced signal data as follows:
Figure BDA0002777231930000041
Figure BDA0002777231930000042
i∈[1,n]
ki∈[1,n]
wherein,
Figure BDA0002777231930000043
representing the ith sequenced signal data sample, i.e. corresponding to the kthiSamples of the original signal data, n being the number of original signal data,
Figure BDA0002777231930000044
representing the label in the ith sorted signal data sample,
Figure BDA0002777231930000045
representing a feature vector in the ith sorted signal data sample;
step 1, merging the sorted signal data samples corresponding to the labels in the same sorted signal data samples into:
if it is
Figure BDA0002777231930000046
And
Figure BDA0002777231930000047
same, i ≠ j, then it will
Figure BDA0002777231930000048
And
Figure BDA0002777231930000049
merging the signals into the same merged signal data sample set;
in step 1, the combined signal data sample set is:
φ={φ12,...,φL}
Figure BDA00027772319300000410
u∈[1,L]
where φ represents a set of combined signal data samples, φuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure BDA00027772319300000411
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure BDA00027772319300000412
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];
Preferably, in step 2, the central control unit allocates a data sample set for each subsystem unit as follows:
Figure BDA00027772319300000413
Figure BDA00027772319300000414
u∈[1,c]
wherein phi isiRepresenting the set of data samples assigned by the central control unit to the i-th subsystem element,
Figure BDA00027772319300000415
represents the u-th combined signal data sample, L, in the i-th subsystem elementuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure BDA00027772319300000416
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure BDA0002777231930000051
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]C represents the number of sample tags contained in each subsystem unit;
step 2, each subsystem unit recombines the distributed data sample sets into:
Figure BDA0002777231930000052
Figure BDA0002777231930000053
j∈[1,L]
wherein,
Figure BDA0002777231930000054
representing the recombined data sample set of the ith subsystem element,
Figure BDA0002777231930000055
represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and IdjLabel representing the j-th signal data sample after recombination, fjAnd representing a feature vector of the j-th signal data sample after recombination, wherein a node represents whether the sample is selected as a splitting node or not, the default is false, sim represents the similarity of the sample and the splitting node, code represents the cluster code to which the sample belongs, and the initial value is-1.
Step 2, selecting a sample set to be split from the recombined sample set as follows:
Figure BDA0002777231930000056
Figure BDA0002777231930000057
j∈[1,l]
wherein,
Figure BDA0002777231930000058
represents the recombined to-be-split sample set of the ith subsystem unit (i.e. code-M data sample set),
Figure BDA0002777231930000059
represents the j (M) th code after being recombined in the i (th) subsystem unit, l represents the number of the (M) th code after being recombined in the i (th) subsystem unit, and sjTo represent
Figure BDA00027772319300000510
Similarity to split nodes.
Step 2, the calculation formula for calculating the similarity between the to-be-split sample and the split node is as follows:
Figure BDA00027772319300000511
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Step 2, sorting all samples in the sample set to be classified according to the similarity, wherein the sorted sample set is as follows:
Figure BDA0002777231930000061
Figure BDA0002777231930000062
i∈[1,N]
kj∈[1,l]
wherein, the first and second guide rollers are arranged in a row,
Figure BDA0002777231930000063
represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds to
Figure BDA0002777231930000064
Middle (k) thjA sample, < i > is
Figure BDA0002777231930000065
The number of the medium samples is the same as the number of the medium samples,
Figure BDA0002777231930000066
a label representing the ith sorted sample,
Figure BDA0002777231930000067
representing the feature vector of the ith sorted sample.
Step 2, selecting the similarity of the intermediate samples as a threshold value, wherein the form is as follows:
Figure BDA0002777231930000068
wherein,
Figure BDA0002777231930000069
indicating the degree of similarity ordering of the split sample sets
Figure BDA00027772319300000610
The similarity of the samples is used as a threshold value of the splitting node;
step 2, updating the coding formula of each sample to be split according to the relationship between the similarity and the threshold value is as follows:
Figure BDA00027772319300000611
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 2, continuously repeating the splitting of each cluster until the number of samples contained in each cluster is less than the specified number, specifically describing as:
after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending.
Step 2, the mode for updating each split node is as follows:
Figure BDA0002777231930000071
wherein,
Figure BDA0002777231930000072
the splitting node with the number of M in the ith subsystem unit is represented, M represents the region code to which the splitting node belongs, and thresh represents the threshold value of the splitting node;
step 2, splitting the node set psiiComprises the following steps:
Figure BDA0002777231930000073
Figure BDA0002777231930000074
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,
Figure BDA0002777231930000075
denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
Preferably, the extracting the feature vector of the target signal to be retrieved in step 3 is:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth target signal data to be retrieved, n is the number of target signal data to be retrieved, F represents the feature extractor, FkCharacteristic vectors of kth target signal data to be retrieved;
step 3, the information is packaged into a form containing the characteristic vector and the cluster code, the information is distributed to each subsystem unit, and the node with the cluster code of-1 is found out from the split set of each subsystem unit and begins to be:
Idk,Idka tag representing kth original signal data;
and step 3, the sample of the target signal data to be retrieved is as follows:
datak={fk,codek=-1}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be retrievedkRepresenting the clustering code of the sample of the kth target signal data to be retrieved, wherein the initial value is-1, and n is the number of the target signal data to be retrieved;
and 3, calculating the similarity between the split node and the sample to be retrieved, and updating the cluster code of the sample to be retrieved according to the similarity:
and step 3, the similarity calculation formula is as follows:
Figure BDA0002777231930000081
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And step 3, updating the cluster codes of the samples to be retrieved according to the similarity as follows:
Figure BDA0002777231930000082
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 3, repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit.
Updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending.
And 3, taking out all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment.
Traversing the codes of all samples in the sample set of the subsystem unit, and taking all cluster coded samples which are the same as the samples to be retrieved as a set, wherein the form is as follows:
Figure BDA0002777231930000083
Figure BDA0002777231930000084
j∈[1,l]
wherein,
Figure BDA0002777231930000085
a sample set representing code M in the ith subsystem unit,
Figure BDA0002777231930000086
a sample representing the jth code ═ M, l is
Figure BDA0002777231930000087
Number of medium samples.
And 3, sequencing all samples consistent with the cluster codes of the samples to be retrieved in the subsystem unit according to the similarity, wherein the sequenced sample set is as follows:
Figure BDA0002777231930000091
Figure BDA0002777231930000092
kj∈[1,l]
wherein,
Figure BDA0002777231930000093
represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds to
Figure BDA0002777231930000094
Middle (k) thjA sample, < i > is
Figure BDA0002777231930000095
The number of the medium samples is the same as the number of the medium samples,
Figure BDA0002777231930000096
a label representing the ith sorted sample,
Figure BDA0002777231930000097
representing the feature vector of the ith sorted sample.
And 3, the central control unit combines the retrieval results uploaded by each subsystem unit to obtain a combined set, wherein the combined set is as follows:
Figure BDA0002777231930000098
Figure BDA0002777231930000099
kj∈[1,m×N]
wherein phitarAnd representing the data set after the subsystem units are combined and sorted according to sim.
Step 3, removing samples with consistent labels:
if it is
Figure BDA00027772319300000910
And
Figure BDA00027772319300000911
the same, i ≠ j,
Figure BDA00027772319300000912
then delete
Figure BDA00027772319300000913
Otherwise delete
Figure BDA00027772319300000914
Step 3, selecting the k samples with the maximum similarity as the final output:
Figure BDA00027772319300000915
Figure BDA00027772319300000916
i∈[1,k]
wherein: outtarA search set representing the output is generated,
Figure BDA00027772319300000917
and the ith sample with the maximum similarity to the sample to be retrieved in the retrieval result is represented.
Preferably, the extracting of the feature vector of the original information of the sample to be inserted in step 4 is as follows:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data to be inserted, n is the number of original signal data to be inserted, F represents the feature extractor, FkA characteristic vector of the kth original signal data to be inserted is obtained;
step 4, the label to be inserted with the original signal data is: idk,IdkA tag representing a kth original signal data to be inserted;
step 4, the samples of the original signal data to be inserted are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakRepresenting a sample of kth original signal data to be inserted, wherein n is the number of the original signal data to be inserted;
step 4, inserting the sample to be inserted into the data set (as shown in fig. 2) corresponding to the central control unit according to the label sequence by adopting a binary search method for the label attribute in the data set of the central control unit:
if the sample label to be inserted is found in the data set of the central control unit, inserting the characteristic vector of the sample to be inserted into the characteristic vector with the same label in the data set, wherein the form of the data set after insertion is as follows:
φ={φ12,...,φL}
Figure BDA0002777231930000101
u∈[1,L]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresenting the combined signal data samples, L, identical to the sample labels to be inserteduRepresenting the number of feature vectors in the combined signal data samples that are identical to the sample labels to be inserted,
Figure BDA0002777231930000102
indicating the same label in the combined signal data sample as the sample label to be inserted, i.e. corresponding to the kthuThe label of the sample of the original signal data,
Figure BDA0002777231930000103
representing the v-th eigenvector in the combined signal data sample identical to the sample label to be inserted, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];
Figure BDA0002777231930000104
Representing the feature vector of the sample to be inserted.
If the to-be-inserted sample label cannot be found in the central control unit data set, the position of the to-be-inserted sample is found according to the binary search, the to-be-inserted sample is inserted into the central data set, and the inserted data set is as follows:
φ={φ12,...,φLL+1}
Figure BDA0002777231930000105
u∈[1,L+1]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresents the u-th combined signal data sample, LuRepresenting bits in the u-th combined signal data sampleThe number of the eigenvectors is,
Figure BDA0002777231930000106
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure BDA0002777231930000107
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];φL+1Representing the data sample to be inserted into the sample.
And 4, randomly selecting one subsystem unit as the subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3.
Encapsulating samples of signal data to be inserted as:
datak={fk,codek}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be insertedkRepresenting the clustering code of the sample of the kth target signal data to be inserted, wherein the initial value is-1, and n is the number of the target signal data to be inserted;
calculating the similarity between the splitting node and the sample to be inserted, and updating the cluster code of the sample to be inserted according to the similarity:
the similarity calculation formula is as follows:
Figure BDA0002777231930000111
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And updating the cluster code of the sample to be inserted according to the similarity, wherein the cluster code updating formula is as follows:
Figure BDA0002777231930000112
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 4, until the corresponding split node cannot be found in the subsystem unit.
The description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending.
Step 4, splitting the node set psiiComprises the following steps:
Figure BDA0002777231930000121
Figure BDA0002777231930000122
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,
Figure BDA0002777231930000123
denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
Preferably, in step 5, the central control unit searches for the data obtained by combining the to-be-deleted samples in the data set corresponding to the central control unit by using a binary search method for the tags, and if the combined data with the same tags as the to-be-deleted samples is found in the data set of the central control unit, the signal data obtained by combining the to-be-deleted tags is extracted as:
Figure BDA0002777231930000124
u∈[1,L]
wherein phi isuIndicating that a combined signal data sample set which is the same as the sample label to be deleted is found in the central control unit, wherein L indicates the number of characteristic vectors (namely the number of characteristics to be deleted) in combined signal data samples which are the same as the sample label to be deleted in the central control unit, and IdkIndicating that the sample label is to be deleted,
Figure BDA0002777231930000125
representing the v-th feature vector to be deleted.
Step 5, searching each feature vector in each subsystem unit by using the feature vector of each data to be deleted as the feature vector to be searched by the method in step 3 to obtain top 1;
packaging the characteristic vector of each label signal data to be deleted into data to be deleted:
datak={Idk fk,codek}
k∈[0,n]
wherein, the datakDenotes the kth data to be deleted, IdkA label indicating a target signal to be deleted, fkK-th eigenvector, code, representing target signal data to be deleted by the central control unitkRepresenting the cluster coding of the kth data of the target signal data to be deleted, wherein the initial value is-1, and n is the number of the target signal data to be deleted;
and (3) taking the feature vector of each data to be deleted as the feature vector to be retrieved, retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set.
Wherein the similarity calculation formula is as follows:
Figure BDA0002777231930000131
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Updating the cluster code of the sample to be deleted, wherein the cluster code updating formula is as follows:
Figure BDA0002777231930000132
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 5, if the label of the retrieval result is the same as the label of the sample to be deleted, deleting the retrieval result sample in the subsystem unit is as follows:
datakindicating that the data is to be deleted,
Figure BDA0002777231930000133
representation datakTop1 data output in the ith subsystem element as a sample to be retrieved,
Figure BDA0002777231930000134
the serial number in the ith subsystem unit is m, if
Figure BDA0002777231930000135
Then delete it
Figure BDA0002777231930000136
Step 5, traversing the number of all the samples in the subsystem unit to be inserted, which are the same as the cluster code of the retrieval result sample, and if the number is zero, deleting the split nodes in the split node set of the subsystem unit, which are the same as the cluster code of the retrieval result sample, as follows:
datakindicating that the data is to be deleted,
Figure BDA0002777231930000137
data in split node set representing ith subsystem unitkClustering the split nodes with the same code, and counting the data in the ith subsystem unitkClustering encodes the number of identical samples, and if the number is zero, deleting
Figure BDA0002777231930000138
The invention has the advantages that:
and (3) establishing a data structure: the invention does not need clustering, randomly selects the samples in the class as the central nodes, and constructs the samples into the tree structure, thereby simplifying the calculation, ensuring that the constructed data structure does not depend on the distribution of the original data, and realizing the dynamic addition and deletion of the samples.
Fine granularity of data structure leaf node partitioning: the leaf nodes are not necessarily divided into single samples, but contain a certain number of sample sets, and the retrieval precision can be improved.
Dynamic data structure: the data structure established in the subsystem unit is dynamic and is not a fixed model, and the processing of the sample is more flexible.
Data in the data structure can be added or deleted: the system can directly add and delete samples to the data structure of the subsystem unit, so that the retrieval system can meet more requirements.
Drawings
FIG. 1: the invention is implemented in a scene diagram.
FIG. 2: a central control unit database.
FIG. 3: the invention is a flow chart of a subsystem unit.
FIG. 4: and (5) dividing subsystem unit data.
FIG. 5: the tree structure diagram of the subsystem unit features.
FIG. 6: sample retrieval flow diagram.
FIG. 7: sample insertion flow chart.
FIG. 8: sample delete map flow.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. For the parameters that need to be analyzed in the actual situation, we have noted the parameter setting method above and will not be described herein.
As shown in fig. 1, which is a schematic view of an implementation scenario of the present invention, the method of the present invention includes a central control unit and a plurality of subsystem units, and the central control unit is sequentially connected to the plurality of subsystem units.
The following describes the embodiments of the present invention with reference to fig. 1 to 8:
step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set, and the form of the combined data sample set is shown in FIG. 2;
step 1, extracting the feature vector of the original signal is as follows:
fk=F(xk)
k∈[0,n]
wherein x iskIs the number of kth original signalAccording to n is the number of raw signal data, F represents a feature extractor, FkA feature vector of the kth original signal data;
the label of the original signal data in the step 1 is as follows: idk,IdkA tag representing kth original signal data;
step 1, the samples of the original signal data are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakA sample representing kth original signal data, n being the number of original signal data;
step 1 the sample ordering of the raw signal data is:
from the label, i.e. Id, of the original signal datakSequencing from small to large, and obtaining samples of the sequenced signal data as follows:
Figure BDA0002777231930000151
Figure BDA0002777231930000152
i∈[1,n]
ki∈[1,n]
wherein,
Figure BDA0002777231930000153
representing the ith sequenced signal data sample, i.e. corresponding to the kthiSamples of the original signal data, n being the number of original signal data,
Figure BDA0002777231930000154
representing the label in the ith sorted signal data sample,
Figure BDA0002777231930000155
representing a feature vector in the ith sorted signal data sample;
step 1, merging the sorted signal data samples corresponding to the labels in the same sorted signal data samples into:
if it is
Figure BDA0002777231930000156
And
Figure BDA0002777231930000157
same, i ≠ j, then it will
Figure BDA0002777231930000158
And
Figure BDA0002777231930000159
merging the signals into the same merged signal data sample set;
in step 1, the combined signal data sample set is:
φ={φ12,...,φL}
Figure BDA0002777231930000161
u∈[1,L]
where φ represents a set of combined signal data samples, φuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure BDA0002777231930000162
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure BDA0002777231930000163
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];
Step 2: the central control unit allocates a data sample set to each subsystem unit, each subsystem unit recombines the allocated data sample sets, selects a sample set to be split from the recombined sample sets, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample sets to be split by using the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of samples contained in each cluster is less than a specified number, updates the splitting node and adds the splitting node into the splitting node set, the subsystem unit data structure construction flow chart is shown in figure 3, the set splitting chart is shown in figure 4, the constructed tree data structure is shown in FIG. 5;
step 2, the central control unit allocates a data sample set for each subsystem unit as follows:
Figure BDA0002777231930000164
Figure BDA0002777231930000165
u∈[1,c]
wherein phi isiRepresenting the set of data samples assigned by the central control unit to the i-th subsystem element,
Figure BDA0002777231930000166
represents the u-th combined signal data sample, L, in the i-th subsystem elementuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure BDA0002777231930000167
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure BDA0002777231930000168
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]C represents the number of sample tags contained in each subsystem unit;
step 2, each subsystem unit recombines the distributed data sample sets into:
Figure BDA0002777231930000171
Figure BDA0002777231930000172
j∈[1,L]
wherein,
Figure BDA0002777231930000173
representing the recombined data sample set of the ith subsystem element,
Figure BDA0002777231930000174
represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and IdjLabel representing the j-th signal data sample after recombination, fjAnd representing a feature vector of the j-th signal data sample after recombination, wherein a node represents whether the sample is selected as a splitting node or not, the default is false, sim represents the similarity of the sample and the splitting node, code represents the cluster code to which the sample belongs, and the initial value is-1.
Step 2, selecting a sample set to be split from the recombined sample set as follows:
Figure BDA0002777231930000175
Figure BDA0002777231930000176
j∈[1,l]
wherein,
Figure BDA0002777231930000177
represents the recombined to-be-split sample set of the ith subsystem unit (i.e. code-M data sample set),
Figure BDA0002777231930000178
represents the j (M) th code after being recombined in the i (th) subsystem unit, l represents the number of the (M) th code after being recombined in the i (th) subsystem unit, and sjTo represent
Figure BDA0002777231930000179
Similarity to split nodes.
Step 2, the calculation formula for calculating the similarity between the to-be-split sample and the split node is as follows:
Figure BDA00027772319300001710
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Step 2, sorting all samples in the sample set to be classified according to the similarity, wherein the sorted sample set is as follows:
Figure BDA0002777231930000181
Figure BDA0002777231930000182
i∈[1,N]
kj∈[1,l]
wherein, the first and second guide rollers are arranged in a row,
Figure BDA0002777231930000183
represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds to
Figure BDA0002777231930000184
Middle (k) thjA sample, < i > is
Figure BDA0002777231930000185
The number of the medium samples is the same as the number of the medium samples,
Figure BDA0002777231930000186
a label representing the ith sorted sample,
Figure BDA0002777231930000187
representing the feature vector of the ith sorted sample.
Step 2, selecting the similarity of the intermediate samples as a threshold value, wherein the form is as follows:
Figure BDA0002777231930000188
wherein,
Figure BDA0002777231930000189
indicating the degree of similarity ordering of the split sample sets
Figure BDA00027772319300001810
The similarity of the samples is used as a threshold value of the splitting node;
step 2, updating the coding formula of each sample to be split according to the relationship between the similarity and the threshold value is as follows:
Figure BDA00027772319300001811
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 2, continuously repeating the splitting of each cluster until the number of samples contained in each cluster is less than the specified number, specifically describing as:
after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending.
Step 2, the mode for updating each split node is as follows:
Figure BDA00027772319300001812
wherein,
Figure BDA0002777231930000191
the splitting node with the number of M in the ith subsystem unit is represented, M represents the region code to which the splitting node belongs, and thresh represents the threshold value of the splitting node;
step 2, splitting the node set psiiComprises the following steps:
Figure BDA0002777231930000192
Figure BDA0002777231930000193
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,
Figure BDA0002777231930000194
denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
And step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; the central control unit merges the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a merged set, eliminates samples with consistent labels, selects k samples with the maximum similarity as final output, namely topK, and a sample retrieval flow chart is shown in fig. 6.
And 3, extracting the characteristic vector of the target signal to be retrieved:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth target signal data to be retrieved, n is the number of target signal data to be retrieved, F represents the feature extractor, FkCharacteristic vectors of kth target signal data to be retrieved;
step 3, the information is packaged into a form containing the characteristic vector and the cluster code, the information is distributed to each subsystem unit, and the node with the cluster code of-1 is found out from the split set of each subsystem unit and begins to be:
Idk,Idka tag representing kth original signal data;
and step 3, the sample of the target signal data to be retrieved is as follows:
datak={fk,codek=-1}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be retrievedkRepresenting the clustering code of the sample of the kth target signal data to be retrieved, wherein the initial value is-1, and n is the number of the target signal data to be retrieved;
and 3, calculating the similarity between the split node and the sample to be retrieved, and updating the cluster code of the sample to be retrieved according to the similarity:
and step 3, the similarity calculation formula is as follows:
Figure BDA0002777231930000201
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And step 3, updating the cluster codes of the samples to be retrieved according to the similarity as follows:
Figure BDA0002777231930000202
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 3, repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit.
Updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending.
And 3, taking out all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment.
Traversing the codes of all samples in the sample set of the subsystem unit, and taking all cluster coded samples which are the same as the samples to be retrieved as a set, wherein the form is as follows:
Figure BDA0002777231930000211
Figure BDA0002777231930000212
j∈[1,l]
wherein,
Figure BDA0002777231930000213
a sample set representing code M in the ith subsystem unit,
Figure BDA0002777231930000214
a sample representing the jth code ═ M, l is
Figure BDA0002777231930000215
Number of medium samples.
And 3, sequencing all samples consistent with the cluster codes of the samples to be retrieved in the subsystem unit according to the similarity, wherein the sequenced sample set is as follows:
Figure BDA0002777231930000216
Figure BDA0002777231930000217
kj∈[1,l]
wherein,
Figure BDA0002777231930000218
represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds to
Figure BDA0002777231930000219
Middle (k) thjA sample, < i > is
Figure BDA00027772319300002110
The number of the medium samples is the same as the number of the medium samples,
Figure BDA00027772319300002111
a label representing the ith sorted sample,
Figure BDA00027772319300002112
representing the feature vector of the ith sorted sample.
And 3, the central control unit combines the retrieval results uploaded by each subsystem unit to obtain a combined set, wherein the combined set is as follows:
Figure BDA00027772319300002113
Figure BDA00027772319300002114
kj∈[1,m×N]
wherein phitarAnd representing the data set after the subsystem units are combined and sorted according to sim.
Step 3, removing samples with consistent labels:
if it is
Figure BDA00027772319300002115
And
Figure BDA00027772319300002116
are identical to each other,i≠j,
Figure BDA00027772319300002117
Then delete
Figure BDA00027772319300002118
Otherwise delete
Figure BDA00027772319300002119
Step 3, selecting the k samples with the maximum similarity as the final output:
Figure BDA00027772319300002120
Figure BDA00027772319300002121
i∈[1,k]
wherein: outtarA search set representing the output is generated,
Figure BDA00027772319300002122
and the ith sample with the maximum similarity to the sample to be retrieved in the retrieval result is represented.
And 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; and (3) randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step (3) until the corresponding split node cannot be found in the subsystem unit. Traversing the number of all samples in the sub-system unit to be inserted, which are the same as the cluster code of the sample to be inserted, if the number is greater than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all samples in the cluster and the split node, updating the cluster code of all nodes in the cluster, adding the split node into a split node set, and inserting the samples as shown in fig. 7.
And 4, extracting the characteristic vector of the original information of the sample to be inserted:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data to be inserted, n is the number of original signal data to be inserted, F represents the feature extractor, FkA characteristic vector of the kth original signal data to be inserted is obtained;
step 4, the label to be inserted with the original signal data is: idk,IdkA tag representing a kth original signal data to be inserted;
step 4, the samples of the original signal data to be inserted are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakRepresenting a sample of kth original signal data to be inserted, wherein n is the number of the original signal data to be inserted;
step 4, inserting the sample to be inserted into the data set (as shown in fig. 2) corresponding to the central control unit according to the label sequence by adopting a binary search method for the label attribute in the data set of the central control unit:
if the sample label to be inserted is found in the data set of the central control unit, inserting the characteristic vector of the sample to be inserted into the characteristic vector with the same label in the data set, wherein the form of the data set after insertion is as follows:
φ={φ12,...,φL}
Figure BDA0002777231930000221
u∈[1,L]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresenting the combined signal data samples, L, identical to the sample labels to be inserteduIs shown andthe number of feature vectors in the combined signal data samples to be inserted with the same sample label,
Figure BDA0002777231930000222
indicating the same label in the combined signal data sample as the sample label to be inserted, i.e. corresponding to the kthuThe label of the sample of the original signal data,
Figure BDA0002777231930000231
representing the v-th eigenvector in the combined signal data sample identical to the sample label to be inserted, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];
Figure BDA0002777231930000232
Representing the feature vector of the sample to be inserted.
If the to-be-inserted sample label cannot be found in the central control unit data set, the position of the to-be-inserted sample is found according to the binary search, the to-be-inserted sample is inserted into the central data set, and the inserted data set is as follows:
φ={φ12,...,φLL+1}
Figure BDA0002777231930000233
u∈[1,L+1]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure BDA0002777231930000234
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure BDA0002777231930000235
denotes the u-th sumAnd the v-th feature vector in the post-signal data sample corresponds to the k-th feature vectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];φL+1Representing the data sample to be inserted into the sample.
And 4, randomly selecting one subsystem unit as the subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3.
Encapsulating samples of signal data to be inserted as:
datak={fk,codek}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be insertedkRepresenting the clustering code of the sample of the kth target signal data to be inserted, wherein the initial value is-1, and n is the number of the target signal data to be inserted;
calculating the similarity between the splitting node and the sample to be inserted, and updating the cluster code of the sample to be inserted according to the similarity:
the similarity calculation formula is as follows:
Figure BDA0002777231930000241
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And updating the cluster code of the sample to be inserted according to the similarity, wherein the cluster code updating formula is as follows:
Figure BDA0002777231930000242
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
And 4, until the corresponding split node cannot be found in the subsystem unit.
The description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending.
Step 4, splitting the node set psiiComprises the following steps:
Figure BDA0002777231930000243
Figure BDA0002777231930000244
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,
Figure BDA0002777231930000245
denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
And 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code as the search result sample in the split node set of the subsystem unit, wherein the sample deletion is as shown in FIG. 8.
Step 5, searching the data obtained by combining the sample to be deleted in the data set corresponding to the central control unit by the central control unit through a binary search method for the label, and if the combined data with the same label as the sample to be deleted is found in the data set of the central control unit, extracting the signal data obtained by combining the label to be deleted as follows:
Figure BDA0002777231930000251
u∈[1,L]
wherein phi isuIndicating that a combined signal data sample set which is the same as the sample label to be deleted is found in the central control unit, wherein L indicates the number of characteristic vectors (namely the number of characteristics to be deleted) in combined signal data samples which are the same as the sample label to be deleted in the central control unit, and IdkIndicating that the sample label is to be deleted,
Figure BDA0002777231930000252
representing the v-th feature vector to be deleted.
Step 5, searching each feature vector in each subsystem unit by using the feature vector of each data to be deleted as the feature vector to be searched by the method in step 3 to obtain top 1;
packaging the characteristic vector of each label signal data to be deleted into data to be deleted:
datak={Idk fk,codek}
k∈[0,n]
wherein, the datakDenotes the kth data to be deleted, IdkA label indicating a target signal to be deleted, fkK-th eigenvector, code, representing target signal data to be deleted by the central control unitkIndicating a target message to be deletedClustering coding of kth data of the number data, wherein the initial value is-1, and n is the number of target signal data to be deleted;
and (3) taking the feature vector of each data to be deleted as the feature vector to be retrieved, retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set.
Wherein the similarity calculation formula is as follows:
Figure BDA0002777231930000261
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Updating the cluster code of the sample to be deleted, wherein the cluster code updating formula is as follows:
Figure BDA0002777231930000262
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 5, if the label of the retrieval result is the same as the label of the sample to be deleted, deleting the retrieval result sample in the subsystem unit is as follows:
datakindicating that the data is to be deleted,
Figure BDA0002777231930000263
representation datakTop1 data output in the ith subsystem element as a sample to be retrieved,
Figure BDA0002777231930000264
the serial number in the ith subsystem unit is m, if
Figure BDA0002777231930000265
Then delete it
Figure BDA0002777231930000266
Step 5, traversing the number of all the samples in the subsystem unit to be inserted, which are the same as the cluster code of the retrieval result sample, and if the number is zero, deleting the split nodes in the split node set of the subsystem unit, which are the same as the cluster code of the retrieval result sample, as follows:
datakindicating that the data is to be deleted,
Figure BDA0002777231930000267
data in split node set representing ith subsystem unitkClustering the split nodes with the same code, and counting the data in the ith subsystem unitkClustering encodes the number of identical samples, and if the number is zero, deleting
Figure BDA0002777231930000268
It should be understood that parts of the application not described in detail are prior art.
It should be understood that the above description of the preferred embodiments is given for clearness of understanding and no unnecessary limitations should be understood therefrom, and all changes and modifications may be made by those skilled in the art without departing from the scope of the invention as defined by the appended claims.

Claims (6)

1. A retrieval method based on a rapid retrieval system facing a mass vector library is characterized in that:
the rapid retrieval system facing the massive vector library comprises: a central control unit and a plurality of subsystem units;
the central control unit is sequentially connected with the plurality of subsystem units;
the retrieval method comprises the following steps:
step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, and combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set;
step 2: the central control unit distributes a data sample set to each subsystem unit, each subsystem unit recombines the distributed data sample sets, selects a sample set to be split from the recombined sample set, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample set to be split by utilizing the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of the samples contained in each cluster is less than the designated number, updates the splitting node at the same time, and adds the splitting node into the splitting node set;
and step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; the central control unit combines the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a combined set, eliminates samples with consistent labels, and selects k samples with the maximum similarity as final output, namely topK;
and 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3 until the corresponding split node cannot be found in the subsystem unit; traversing the number of all samples which are the same as the cluster codes of the samples to be inserted in the subsystem unit to be inserted, if the number is larger than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all the samples in the cluster and the split node, updating the cluster codes of all the nodes in the cluster, and adding the split node into a split node set;
and 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code with the search result sample in the split node set of the subsystem unit.
2. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
step 1, extracting the feature vector of the original signal is as follows:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data, n is the number of original signal data, F represents the feature extractor, FkA feature vector of the kth original signal data;
the label of the original signal data in the step 1 is as follows: idk,IdkA tag representing kth original signal data;
step 1, the samples of the original signal data are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakA sample representing kth original signal data, n being the number of original signal data;
step 1 the sample ordering of the raw signal data is:
from the label, i.e. Id, of the original signal datakSequencing from small to large, and obtaining samples of the sequenced signal data as follows:
Figure FDA0002777231920000031
Figure FDA0002777231920000032
i∈[1,n]
ki∈[1,n]
wherein,
Figure FDA0002777231920000033
representing the ith ordered signalData samples, i.e. corresponding to the kthiSamples of the original signal data, n being the number of original signal data,
Figure FDA0002777231920000034
representing the label in the ith sorted signal data sample,
Figure FDA0002777231920000035
representing a feature vector in the ith sorted signal data sample;
step 1, merging the sorted signal data samples corresponding to the labels in the same sorted signal data samples into:
if it is
Figure FDA0002777231920000036
And
Figure FDA0002777231920000037
same, i ≠ j, then it will
Figure FDA0002777231920000038
And
Figure FDA0002777231920000039
merging the signals into the same merged signal data sample set;
in step 1, the combined signal data sample set is:
φ={φ12,...,φL}
Figure FDA00027772319200000310
u∈[1,L]
where φ represents a set of combined signal data samples, φuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure FDA00027772319200000311
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure FDA00027772319200000312
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]。
3. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
step 2, the central control unit allocates a data sample set for each subsystem unit as follows:
Figure FDA00027772319200000313
Figure FDA00027772319200000314
u∈[1,c]
wherein phi isiRepresenting the set of data samples assigned by the central control unit to the i-th subsystem element,
Figure FDA00027772319200000315
represents the u-th combined signal data sample, L, in the i-th subsystem elementuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure FDA00027772319200000316
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure FDA0002777231920000041
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u]C represents the number of sample tags contained in each subsystem unit;
step 2, each subsystem unit recombines the distributed data sample sets into:
Figure FDA0002777231920000042
Figure FDA0002777231920000043
j∈[1,L]
wherein,
Figure FDA0002777231920000044
representing the recombined data sample set of the ith subsystem element,
Figure FDA0002777231920000045
represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and IdjLabel representing the j-th signal data sample after recombination, fjRepresenting a feature vector of a j-th signal data sample after recombination, representing whether the sample is selected as a split node or not by a node, default of the node is false, representing the similarity between the sample and the split node by sim, representing a cluster code to which the sample belongs by a code, and having an initial value of-1;
step 2, selecting a sample set to be split from the recombined sample set as follows:
Figure FDA0002777231920000046
Figure FDA0002777231920000047
j∈[1,l]
wherein,
Figure FDA0002777231920000048
represents the recombined to-be-split sample set of the ith subsystem unit (i.e. code-M data sample set),
Figure FDA0002777231920000049
represents the j (M) th code after being recombined in the i (th) subsystem unit, l represents the number of the (M) th code after being recombined in the i (th) subsystem unit, and sjTo represent
Figure FDA00027772319200000410
Similarity to split nodes;
step 2, the calculation formula for calculating the similarity between the to-be-split sample and the split node is as follows:
Figure FDA00027772319200000411
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Step 2, sorting all samples in the sample set to be classified according to the similarity, wherein the sorted sample set is as follows:
Figure FDA0002777231920000051
Figure FDA0002777231920000052
i∈[1,N]
kj∈[1,l]
wherein,
Figure FDA0002777231920000053
represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds to
Figure FDA0002777231920000054
Middle (k) thjA sample, < i > is
Figure FDA0002777231920000055
The number of the medium samples is the same as the number of the medium samples,
Figure FDA0002777231920000056
a label representing the ith sorted sample,
Figure FDA0002777231920000057
a feature vector representing the ith sorted sample;
step 2, selecting the similarity of the intermediate samples as a threshold value, wherein the form is as follows:
Figure FDA0002777231920000058
wherein,
Figure FDA0002777231920000059
indicating the degree of similarity ordering of the split sample sets
Figure FDA00027772319200000510
The similarity of the samples is used as a threshold value of the splitting node;
step 2, updating the coding formula of each sample to be split according to the relationship between the similarity and the threshold value is as follows:
Figure FDA00027772319200000511
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 2, continuously repeating the splitting of each cluster until the number of samples contained in each cluster is less than the specified number, specifically describing as:
after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending;
step 2, the mode for updating each split node is as follows:
Figure FDA0002777231920000061
wherein,
Figure FDA0002777231920000062
the splitting node with the number of M in the ith subsystem unit is represented, M represents the region code to which the splitting node belongs, and thresh represents the threshold value of the splitting node;
step 2, splitting the node set psiiComprises the following steps:
Figure FDA0002777231920000063
Figure FDA0002777231920000064
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,
Figure FDA0002777231920000065
denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
4. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
and 3, extracting the characteristic vector of the target signal to be retrieved:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth target signal data to be retrieved, n is the number of target signal data to be retrieved, F represents the feature extractor, FkCharacteristic vectors of kth target signal data to be retrieved;
step 3, the information is packaged into a form containing the characteristic vector and the cluster code, the information is distributed to each subsystem unit, and the node with the cluster code of-1 is found out from the split set of each subsystem unit and begins to be:
Idk,Idka tag representing kth original signal data;
and step 3, the sample of the target signal data to be retrieved is as follows:
datak={fk,codek=-1}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be retrievedkRepresenting the clustering code of the sample of the kth target signal data to be retrieved, wherein the initial value is-1, and n is the number of the target signal data to be retrieved;
and 3, calculating the similarity between the split node and the sample to be retrieved, and updating the cluster code of the sample to be retrieved according to the similarity:
and step 3, the similarity calculation formula is as follows:
Figure FDA0002777231920000071
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And step 3, updating the cluster codes of the samples to be retrieved according to the similarity as follows:
Figure FDA0002777231920000072
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit;
updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending;
step 3, all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment are taken out;
traversing the codes of all samples in the sample set of the subsystem unit, and taking all cluster coded samples which are the same as the samples to be retrieved as a set, wherein the form is as follows:
Figure FDA0002777231920000073
Figure FDA0002777231920000074
j∈[1,l]
wherein,
Figure FDA0002777231920000075
a sample set representing code M in the ith subsystem unit,
Figure FDA0002777231920000076
a sample representing the jth code ═ M, l is
Figure FDA0002777231920000077
The number of medium samples;
and 3, sequencing all samples consistent with the cluster codes of the samples to be retrieved in the subsystem unit according to the similarity, wherein the sequenced sample set is as follows:
Figure FDA0002777231920000081
Figure FDA0002777231920000082
kj∈[1,l]
wherein,
Figure FDA0002777231920000083
represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds to
Figure FDA0002777231920000084
Middle (k) thjA sample, < i > is
Figure FDA0002777231920000085
The number of the medium samples is the same as the number of the medium samples,
Figure FDA0002777231920000086
a label representing the ith sorted sample,
Figure FDA0002777231920000087
a feature vector representing the ith sorted sample;
and 3, the central control unit combines the retrieval results uploaded by each subsystem unit to obtain a combined set, wherein the combined set is as follows:
Figure FDA0002777231920000088
Figure FDA0002777231920000089
kj∈[1,m×N]
wherein phitarRepresenting a data set which is formed by merging all subsystem units and sorting the subsystem units according to sim;
step 3, removing samples with consistent labels:
if it is
Figure FDA00027772319200000810
And
Figure FDA00027772319200000811
the same, i ≠ j,
Figure FDA00027772319200000812
then delete
Figure FDA00027772319200000813
Otherwise delete
Figure FDA00027772319200000814
Step 3, selecting the k samples with the maximum similarity as the final output:
Figure FDA00027772319200000815
Figure FDA00027772319200000816
i∈[1,k]
wherein: outtarA search set representing the output is generated,
Figure FDA00027772319200000817
and the ith sample with the maximum similarity to the sample to be retrieved in the retrieval result is represented.
5. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
and 4, extracting the characteristic vector of the original information of the sample to be inserted:
fk=F(xk)
k∈[0,n]
wherein x iskIs the kth original signal data to be inserted, n is the number of original signal data to be inserted, F represents the feature extractor, FkA characteristic vector of the kth original signal data to be inserted is obtained;
step 4, the label to be inserted with the original signal data is: idk,IdkA tag representing a kth original signal data to be inserted;
step 4, the samples of the original signal data to be inserted are:
datak={Idk,fk}
k∈[0,n]
wherein, the datakRepresenting a sample of kth original signal data to be inserted, wherein n is the number of the original signal data to be inserted;
step 4, inserting the sample to be inserted into the data set (as shown in fig. 2) corresponding to the central control unit according to the label sequence by adopting a binary search method for the label attribute in the data set of the central control unit:
if the sample label to be inserted is found in the data set of the central control unit, inserting the characteristic vector of the sample to be inserted into the characteristic vector with the same label in the data set, wherein the form of the data set after insertion is as follows:
φ={φ12,...,φL}
Figure FDA0002777231920000091
u∈[1,L]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresenting the combined signal data samples, L, identical to the sample labels to be inserteduRepresenting the number of feature vectors in the combined signal data samples that are identical to the sample labels to be inserted,
Figure FDA0002777231920000092
indicating the same label in the combined signal data sample as the sample label to be inserted, i.e. corresponding to the kthuThe label of the sample of the original signal data,
Figure FDA0002777231920000093
representing the v-th eigenvector in the combined signal data sample identical to the sample label to be inserted, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];
Figure FDA0002777231920000094
Indicating to be insertedA feature vector of the sample;
if the to-be-inserted sample label cannot be found in the central control unit data set, the position of the to-be-inserted sample is found according to the binary search, the to-be-inserted sample is inserted into the central data set, and the inserted data set is as follows:
φ={φ12,...,φLL+1}
Figure FDA0002777231920000095
u∈[1,L+1]
wherein phi represents a signal data sample set after the central control unit is combined, phiuRepresents the u-th combined signal data sample, LuRepresenting the number of feature vectors in the u-th combined signal data sample,
Figure FDA0002777231920000096
indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-thuThe label of the sample of the original signal data,
Figure FDA0002777231920000101
representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvectoru,vFeature vector of a sample of raw signal data, v ∈ [1, L ]u];φL+1A data sample representing a sample to be inserted;
4, randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3;
encapsulating samples of signal data to be inserted as:
datak={fk,codek}
k∈[0,n]
wherein, the datakSamples, codes, representing the kth target signal data to be insertedkCluster coding of samples representing the kth target signal data to be inserted with an initial value of-1, n is the amount of target signal data to be inserted;
calculating the similarity between the splitting node and the sample to be inserted, and updating the cluster code of the sample to be inserted according to the similarity:
the similarity calculation formula is as follows:
Figure FDA0002777231920000102
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
And updating the cluster code of the sample to be inserted according to the similarity, wherein the cluster code updating formula is as follows:
Figure FDA0002777231920000103
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 4, until the corresponding split node can not be found in the subsystem unit;
the description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending;
step 4, splitting the node set psiiComprises the following steps:
Figure FDA0002777231920000111
Figure FDA0002777231920000112
j∈[1,q]
wherein psiiRepresenting the ith set of subsystem unit split nodes,
Figure FDA0002777231920000113
denotes the jth split node, M, in the ith subsystem elementjCode representing the split node, TjRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.
6. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:
step 5, searching the data obtained by combining the sample to be deleted in the data set corresponding to the central control unit by the central control unit through a binary search method for the label, and if the combined data with the same label as the sample to be deleted is found in the data set of the central control unit, extracting the signal data obtained by combining the label to be deleted as follows:
Figure FDA0002777231920000114
u∈[1,L]
wherein phi isuIndicating that a combined signal data sample set which is the same as the sample label to be deleted is found in the central control unit, wherein L indicates the number of characteristic vectors (namely the number of characteristics to be deleted) in combined signal data samples which are the same as the sample label to be deleted in the central control unit, and IdkIndicating that the sample label is to be deleted,
Figure FDA0002777231920000115
representing the v-th feature vector to be deleted;
step 5, searching each feature vector in each subsystem unit by using the feature vector of each data to be deleted as the feature vector to be searched by the method in step 3 to obtain top 1;
packaging the characteristic vector of each label signal data to be deleted into data to be deleted:
datak={Idk fk,codek}
k∈[0,n]
wherein, the datakDenotes the kth data to be deleted, IdkA label indicating a target signal to be deleted, fkK-th eigenvector, code, representing target signal data to be deleted by the central control unitkRepresenting the cluster coding of the kth data of the target signal data to be deleted, wherein the initial value is-1, and n is the number of the target signal data to be deleted;
taking the feature vector of each data to be deleted as a feature vector to be retrieved, and retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set;
wherein the similarity calculation formula is as follows:
Figure FDA0002777231920000121
wherein simABRepresenting the degree of similarity of the vectors a, B,
a and B respectively represent two vectors, Ai,BiRepresents the ith dimension value of the vectors a, B,
n denotes the dimension of the vector, i ∈ [1, n ]
Updating the cluster code of the sample to be deleted, wherein the cluster code updating formula is as follows:
Figure FDA0002777231920000122
wherein code represents the cluster code to which the sample belongs,
sim represents the similarity of the sample to the split node calculation,
thresh represents the threshold for the split node
Step 5, if the label of the retrieval result is the same as the label of the sample to be deleted, deleting the retrieval result sample in the subsystem unit is as follows:
datakindicating that the data is to be deleted,
Figure FDA0002777231920000123
representation datakTop1 data output in the ith subsystem element as a sample to be retrieved,
Figure FDA0002777231920000124
the serial number in the ith subsystem unit is m, if
Figure FDA0002777231920000125
Then delete it
Figure FDA0002777231920000126
Step 5, traversing the number of all the samples in the subsystem unit to be inserted, which are the same as the cluster code of the retrieval result sample, and if the number is zero, deleting the split nodes in the split node set of the subsystem unit, which are the same as the cluster code of the retrieval result sample, as follows:
datakindicating that the data is to be deleted,
Figure FDA0002777231920000131
data in split node set representing ith subsystem unitkClustering the split nodes with the same code, and counting the data in the ith subsystem unitkClustering encodes the number of identical samples, and if the number is zero, deleting
Figure FDA0002777231920000132
CN202011269580.6A 2020-11-13 2020-11-13 Rapid retrieval system and method for massive vector libraries Active CN112364080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011269580.6A CN112364080B (en) 2020-11-13 2020-11-13 Rapid retrieval system and method for massive vector libraries

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011269580.6A CN112364080B (en) 2020-11-13 2020-11-13 Rapid retrieval system and method for massive vector libraries

Publications (2)

Publication Number Publication Date
CN112364080A true CN112364080A (en) 2021-02-12
CN112364080B CN112364080B (en) 2024-04-09

Family

ID=74514764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011269580.6A Active CN112364080B (en) 2020-11-13 2020-11-13 Rapid retrieval system and method for massive vector libraries

Country Status (1)

Country Link
CN (1) CN112364080B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595065A (en) * 2023-05-09 2023-08-15 上海任意门科技有限公司 Content duplicate identification method, device, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156728A (en) * 2011-03-31 2011-08-17 河南理工大学 Improved personalized summary system based on user interest model
CN103699648A (en) * 2013-12-26 2014-04-02 成都市卓睿科技有限公司 Tree-form data structure used for quick retrieval and implementation method of tree-form data structure
CN107563715A (en) * 2017-07-19 2018-01-09 天津云脉三六五科技有限公司 Foreign trade set-off marketing system and method
CN109918529A (en) * 2019-02-25 2019-06-21 重庆邮电大学 An Image Retrieval Method Based on Tree Clustering Vector Quantization

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156728A (en) * 2011-03-31 2011-08-17 河南理工大学 Improved personalized summary system based on user interest model
CN103699648A (en) * 2013-12-26 2014-04-02 成都市卓睿科技有限公司 Tree-form data structure used for quick retrieval and implementation method of tree-form data structure
CN107563715A (en) * 2017-07-19 2018-01-09 天津云脉三六五科技有限公司 Foreign trade set-off marketing system and method
CN109918529A (en) * 2019-02-25 2019-06-21 重庆邮电大学 An Image Retrieval Method Based on Tree Clustering Vector Quantization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595065A (en) * 2023-05-09 2023-08-15 上海任意门科技有限公司 Content duplicate identification method, device, system and storage medium
CN116595065B (en) * 2023-05-09 2024-04-02 上海任意门科技有限公司 Content duplicate identification method, device, system and storage medium

Also Published As

Publication number Publication date
CN112364080B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
US6792163B2 (en) Method and apparatus for searching, browsing and summarizing moving image data using fidelity of tree-structured moving image hierarchy
CN112307762B (en) Search result sorting method and device, storage medium and electronic device
CN111611488B (en) Information recommendation method and device based on artificial intelligence and electronic equipment
CN113298197B (en) Data clustering method, device, equipment and readable storage medium
JP2023502863A (en) Image incremental clustering method and apparatus, electronic device, storage medium and program product
CN106874445A (en) High in the clouds image-recognizing method based on words tree retrieval with similarity checking
CN113590898A (en) Data retrieval method and device, electronic equipment, storage medium and computer product
CN112364080B (en) Rapid retrieval system and method for massive vector libraries
CN109344309A (en) Extensive file and picture classification method and system are stacked based on convolutional neural networks
CN110110120B (en) Image retrieval method and device based on deep learning
JPH08305718A (en) Method and device for processing information
CN108256083A (en) Content recommendation method based on deep learning
CN116304213B (en) Subgraph matching query optimization method for RDF graph database based on graph neural network
CN113204676B (en) Compression storage method based on graph structure data
CN112905792B (en) Text clustering method, device, equipment and storage medium based on non-text scene
CN102369525A (en) System for searching visual information
CN116680325A (en) Time-series record link data matching method and device based on attribute correlation
CN115329133A (en) Remote sensing video hash retrieval method based on key frame fusion and attention mechanism
CN108256086A (en) Data characteristics statistical analysis technique
CN108280176A (en) Data mining optimization method based on MapReduce
JP3497713B2 (en) Information classification method, apparatus and system
CN118093659B (en) Database Gao Weishu query method based on three-input network and high-point tree
CN105205172A (en) Database retrieval method
CN116821404B (en) Data retrieval method, device, apparatus, storage medium and program product
CN106844725A (en) A kind of high in the clouds image data base generation and recognition methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant