CN112364080A

CN112364080A - Rapid retrieval system and method for massive vector library

Info

Publication number: CN112364080A
Application number: CN202011269580.6A
Authority: CN
Inventors: 谢建
Original assignee: WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD; Wuhan Yangtze Communications Zhilian Technology Co ltd
Current assignee: WUHAN YANGTZE COMMUNICATIONS INDUSTRY GROUP CO LTD; Wuhan Yangtze Communications Zhilian Technology Co ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-02-12
Anticipated expiration: 2040-11-13
Also published as: CN112364080B

Abstract

The invention provides a rapid retrieval system and method for a massive vector library. The system comprises a central control unit and a plurality of subsystem units, wherein the central control unit is responsible for signal feature vector extraction, task distribution and combination. The subsystems establish a data structure which does not need clustering and incremental, namely the complexity of constructing the data structure is simplified, the constructed data structure is independent of data set distribution, massive vectors can be quickly retrieved, dynamic insertion and deletion of samples are realized, and more actual scene requirements can be met. The invention randomly selects nodes from the sample set to be split, thereby simplifying the calculation and simultaneously ensuring that the constructed data structure does not depend on the original data distribution; the invention can dynamically add and delete samples in the data structure without reconstructing the data structure model after adding and deleting samples each time.

Description

Rapid retrieval system and method for massive vector library

Technical Field

The invention belongs to the field of massive vector retrieval, and particularly relates to a massive vector library-oriented rapid retrieval system and method.

Background

The current mass vector retrieval method is a mass search method based on Hadoop and other frameworks, and the method distributes target characteristic vectors to different subsystem units, each subsystem unit independently completes respective retrieval tasks, and finally respective retrieval structures are combined to obtain a final result; the method based on the data structure comprises the steps of firstly dividing mass vector characteristics through clustering, and then constructing a data structure model through clustering results. During retrieval, only the retrieved target characteristic vector needs to be quickly found out of the cluster category to which the target characteristic vector belongs through a data structure, and then all samples in the cluster category are traversed to realize target vector retrieval; based on the cascade method, the method firstly filters the sample by using simple characteristics, reduces the retrieval range and then carries out accurate retrieval in a small range.

The major disadvantages of the framework based on Hadoop and the like are large calculation amount, high resource consumption and low retrieval efficiency. The method adopts a violent search mode, and the sample to be retrieved needs to be matched with all samples in the sample library.

The method based on the data structure has the main defects that dynamic addition and deletion of the sample library cannot be realized, and the memory requirement is high. The method needs to cluster sample data in advance, then generates a data structure model by using a clustering result, and when the data volume is large, the characteristic clustering and the establishment of the data structure model are time-consuming; in the searching process, the whole data structure model is loaded into a memory, and the size of the model is in direct proportion to the number of samples.

The method based on cascade connection has the main disadvantages of low precision and low retrieval efficiency. The simple features cannot completely describe the real information of the sample, and performance degradation may be caused by screening and filtering by using the simple features; because the similarity between the screening and filtering needs to be calculated with all samples, although the calculation is simplified by adopting simple characteristics, the sample data size is huge, the time consumption condition cannot be ignored, and the efficiency is still low.

In summary, the main technical problems of the existing mass vector retrieval method are as follows:

the retrieval efficiency is low, the retrieval time is proportional to the size of the sample library, and when the sample library is very large (more than millions), the vector retrieval speed cannot meet the real-time requirement.

The dynamic sample addition and deletion can not be realized, and the conventional rapid massive vector retrieval is to cluster a sample library and then construct a specific data structure model on the basis of clustering. Once the data structure model is built, the samples cannot be added or deleted.

The resource occupancy is high, if a 512-dimensional vector is used to describe one sample, the space required by each sample is about 2k, and when the number of samples is more than 1 hundred million, the required storage space exceeds 200G. In order to realize quick retrieval, all the data is often required to be loaded into a memory, and the resource consumption is huge.

Disclosure of Invention

In order to solve the technical problems, the invention provides a rapid retrieval system and a rapid retrieval method for a massive vector library.

The invention solves the following technical problems in the prior massive vector library retrieval:

the system comprises a central control unit and a plurality of subsystem units, wherein the central control unit is sequentially connected with the subsystem units.

The technical scheme of the invention is a rapid retrieval method facing to a massive vector library, which is characterized by comprising the following steps:

step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, and combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set;

step 2: the central control unit distributes a data sample set to each subsystem unit, each subsystem unit recombines the distributed data sample sets, selects a sample set to be split from the recombined sample set, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample set to be split by utilizing the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of the samples contained in each cluster is less than the designated number, updates the splitting node at the same time, and adds the splitting node into the splitting node set;

and step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; and the central control unit combines the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a combined set, eliminates samples with consistent labels, and selects k samples with the maximum similarity as final output, namely topK.

And 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; and (3) randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step (3) until the corresponding split node cannot be found in the subsystem unit. Traversing the number of all samples which are the same as the cluster codes of the samples to be inserted in the subsystem unit to be inserted, if the number is larger than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all the samples in the cluster and the split node, updating the cluster codes of all the nodes in the cluster, and adding the split node into a split node set.

And 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code with the search result sample in the split node set of the subsystem unit.

Preferably, the feature vector of the original signal extracted in step 1 is:

f_k＝F(x_k)

k∈[0,n]

wherein x is_kIs the kth original signal data, n is the number of original signal data, F represents the feature extractor, F_kA feature vector of the kth original signal data;

the label of the original signal data in the step 1 is as follows: id_k，Id_kA tag representing kth original signal data;

step 1, the samples of the original signal data are:

data_k＝{Id_k,f_k}

k∈[0,n]

wherein, the data_kA sample representing kth original signal data, n being the number of original signal data;

step 1 the sample ordering of the raw signal data is:

from the label, i.e. Id, of the original signal data_kSequencing from small to large, and obtaining samples of the sequenced signal data as follows:

i∈[1,n]

k_i∈[1,n]

wherein,

representing the ith sequenced signal data sample, i.e. corresponding to the kth_iSamples of the original signal data, n being the number of original signal data,

representing the label in the ith sorted signal data sample,

representing a feature vector in the ith sorted signal data sample;

step 1, merging the sorted signal data samples corresponding to the labels in the same sorted signal data samples into:

if it is

And

same, i ≠ j, then it will

And

merging the signals into the same merged signal data sample set;

in step 1, the combined signal data sample set is:

φ＝{φ₁,φ₂,...,φ_L}

u∈[1,L]

where φ represents a set of combined signal data samples, φ_uRepresents the u-th combined signal data sample, L_uRepresenting the number of feature vectors in the u-th combined signal data sample,

indicating the label in the u-th combined signal data sample, i.e. corresponding to the k-th_uThe label of the sample of the original signal data,

representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvector_u,vFeature vector of a sample of raw signal data, v ∈ [1, L ]_u]；

Preferably, in step 2, the central control unit allocates a data sample set for each subsystem unit as follows:

u∈[1,c]

wherein phi isⁱRepresenting the set of data samples assigned by the central control unit to the i-th subsystem element,

represents the u-th combined signal data sample, L, in the i-th subsystem element_uRepresenting the number of feature vectors in the u-th combined signal data sample,

representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvector_u,vFeature vector of a sample of raw signal data, v ∈ [1, L ]_u]C represents the number of sample tags contained in each subsystem unit;

step 2, each subsystem unit recombines the distributed data sample sets into:

j∈[1,L]

wherein,

representing the recombined data sample set of the ith subsystem element,

represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and Id_jLabel representing the j-th signal data sample after recombination, f_jAnd representing a feature vector of the j-th signal data sample after recombination, wherein a node represents whether the sample is selected as a splitting node or not, the default is false, sim represents the similarity of the sample and the splitting node, code represents the cluster code to which the sample belongs, and the initial value is-1.

Step 2, selecting a sample set to be split from the recombined sample set as follows:

j∈[1,l]

wherein,

represents the recombined to-be-split sample set of the ith subsystem unit (i.e. code-M data sample set),

represents the j (M) th code after being recombined in the i (th) subsystem unit, l represents the number of the (M) th code after being recombined in the i (th) subsystem unit, and s_jTo represent

Similarity to split nodes.

Step 2, the calculation formula for calculating the similarity between the to-be-split sample and the split node is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

a and B respectively represent two vectors, A_i，B_iRepresents the ith dimension value of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

Step 2, sorting all samples in the sample set to be classified according to the similarity, wherein the sorted sample set is as follows:

i∈[1,N]

k_j∈[1,l]

wherein, the first and second guide rollers are arranged in a row,

represents the jth sample sorted according to sim, i represents the ith subsystem element, i.e., corresponds to

Middle (k) th_jA sample, is

The number of the medium samples is the same as the number of the medium samples,

a label representing the ith sorted sample,

representing the feature vector of the ith sorted sample.

Step 2, selecting the similarity of the intermediate samples as a threshold value, wherein the form is as follows:

wherein,

indicating the degree of similarity ordering of the split sample sets

The similarity of the samples is used as a threshold value of the splitting node;

step 2, updating the coding formula of each sample to be split according to the relationship between the similarity and the threshold value is as follows:

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

Step 2, continuously repeating the splitting of each cluster until the number of samples contained in each cluster is less than the specified number, specifically describing as:

after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending.

Step 2, the mode for updating each split node is as follows:

wherein,

the splitting node with the number of M in the ith subsystem unit is represented, M represents the region code to which the splitting node belongs, and thresh represents the threshold value of the splitting node;

step 2, splitting the node set psiⁱComprises the following steps:

j∈[1,q]

wherein psiⁱRepresenting the ith set of subsystem unit split nodes,

denotes the jth split node, M, in the ith subsystem element_jCode representing the split node, T_jRepresenting the split node threshold and q representing the number of split nodes for the subsystem unit.

Preferably, the extracting the feature vector of the target signal to be retrieved in step 3 is:

f_k＝F(x_k)

k∈[0,n]

wherein x is_kIs the kth target signal data to be retrieved, n is the number of target signal data to be retrieved, F represents the feature extractor, F_kCharacteristic vectors of kth target signal data to be retrieved;

step 3, the information is packaged into a form containing the characteristic vector and the cluster code, the information is distributed to each subsystem unit, and the node with the cluster code of-1 is found out from the split set of each subsystem unit and begins to be:

Id_k，Id_ka tag representing kth original signal data;

and step 3, the sample of the target signal data to be retrieved is as follows:

data_k＝{f_k,code_k＝-1}

k∈[0,n]

wherein, the data_kSamples, codes, representing the kth target signal data to be retrieved_kRepresenting the clustering code of the sample of the kth target signal data to be retrieved, wherein the initial value is-1, and n is the number of the target signal data to be retrieved;

and 3, calculating the similarity between the split node and the sample to be retrieved, and updating the cluster code of the sample to be retrieved according to the similarity:

and step 3, the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

And step 3, updating the cluster codes of the samples to be retrieved according to the similarity as follows:

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

And 3, repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit.

Updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending.

And 3, taking out all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment.

Traversing the codes of all samples in the sample set of the subsystem unit, and taking all cluster coded samples which are the same as the samples to be retrieved as a set, wherein the form is as follows:

j∈[1,l]

wherein,

a sample set representing code M in the ith subsystem unit,

a sample representing the jth code ═ M, l is

Number of medium samples.

And 3, sequencing all samples consistent with the cluster codes of the samples to be retrieved in the subsystem unit according to the similarity, wherein the sequenced sample set is as follows:

k_j∈[1,l]

wherein,

Middle (k) th_jA sample, is

a label representing the ith sorted sample,

representing the feature vector of the ith sorted sample.

And 3, the central control unit combines the retrieval results uploaded by each subsystem unit to obtain a combined set, wherein the combined set is as follows:

k_j∈[1,m×N]

wherein phi_tarAnd representing the data set after the subsystem units are combined and sorted according to sim.

Step 3, removing samples with consistent labels:

if it is

And

the same, i ≠ j,

then delete

Otherwise delete

Step 3, selecting the k samples with the maximum similarity as the final output:

i∈[1,k]

wherein: out_tarA search set representing the output is generated,

and the ith sample with the maximum similarity to the sample to be retrieved in the retrieval result is represented.

Preferably, the extracting of the feature vector of the original information of the sample to be inserted in step 4 is as follows:

f_k＝F(x_k)

k∈[0,n]

wherein x is_kIs the kth original signal data to be inserted, n is the number of original signal data to be inserted, F represents the feature extractor, F_kA characteristic vector of the kth original signal data to be inserted is obtained;

step 4, the label to be inserted with the original signal data is: id_k，Id_kA tag representing a kth original signal data to be inserted;

step 4, the samples of the original signal data to be inserted are:

data_k＝{Id_k,f_k}

k∈[0,n]

wherein, the data_kRepresenting a sample of kth original signal data to be inserted, wherein n is the number of the original signal data to be inserted;

step 4, inserting the sample to be inserted into the data set (as shown in fig. 2) corresponding to the central control unit according to the label sequence by adopting a binary search method for the label attribute in the data set of the central control unit:

if the sample label to be inserted is found in the data set of the central control unit, inserting the characteristic vector of the sample to be inserted into the characteristic vector with the same label in the data set, wherein the form of the data set after insertion is as follows:

φ＝{φ₁,φ₂,...,φ_L}

u∈[1,L]

wherein phi represents a signal data sample set after the central control unit is combined, phi_uRepresenting the combined signal data samples, L, identical to the sample labels to be inserted_uRepresenting the number of feature vectors in the combined signal data samples that are identical to the sample labels to be inserted,

indicating the same label in the combined signal data sample as the sample label to be inserted, i.e. corresponding to the kth_uThe label of the sample of the original signal data,

representing the v-th eigenvector in the combined signal data sample identical to the sample label to be inserted, i.e. corresponding to the k-th eigenvector_u,vFeature vector of a sample of raw signal data, v ∈ [1, L ]_u]；

Representing the feature vector of the sample to be inserted.

If the to-be-inserted sample label cannot be found in the central control unit data set, the position of the to-be-inserted sample is found according to the binary search, the to-be-inserted sample is inserted into the central data set, and the inserted data set is as follows:

φ＝{φ₁,φ₂,...,φ_L,φ_L+1}

u∈[1,L+1]

wherein phi represents a signal data sample set after the central control unit is combined, phi_uRepresents the u-th combined signal data sample, L_uRepresenting bits in the u-th combined signal data sampleThe number of the eigenvectors is,

representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvector_u,vFeature vector of a sample of raw signal data, v ∈ [1, L ]_u]；φ_L+1Representing the data sample to be inserted into the sample.

And 4, randomly selecting one subsystem unit as the subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3.

Encapsulating samples of signal data to be inserted as:

data_k＝{f_k,code_k}

k∈[0,n]

wherein, the data_kSamples, codes, representing the kth target signal data to be inserted_kRepresenting the clustering code of the sample of the kth target signal data to be inserted, wherein the initial value is-1, and n is the number of the target signal data to be inserted;

calculating the similarity between the splitting node and the sample to be inserted, and updating the cluster code of the sample to be inserted according to the similarity:

the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

And updating the cluster code of the sample to be inserted according to the similarity, wherein the cluster code updating formula is as follows:

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

And 4, until the corresponding split node cannot be found in the subsystem unit.

The description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending.

Step 4, splitting the node set psiⁱComprises the following steps:

j∈[1,q]

wherein psiⁱRepresenting the ith set of subsystem unit split nodes,

Preferably, in step 5, the central control unit searches for the data obtained by combining the to-be-deleted samples in the data set corresponding to the central control unit by using a binary search method for the tags, and if the combined data with the same tags as the to-be-deleted samples is found in the data set of the central control unit, the signal data obtained by combining the to-be-deleted tags is extracted as:

u∈[1,L]

wherein phi is_uIndicating that a combined signal data sample set which is the same as the sample label to be deleted is found in the central control unit, wherein L indicates the number of characteristic vectors (namely the number of characteristics to be deleted) in combined signal data samples which are the same as the sample label to be deleted in the central control unit, and Id_kIndicating that the sample label is to be deleted,

representing the v-th feature vector to be deleted.

Step 5, searching each feature vector in each subsystem unit by using the feature vector of each data to be deleted as the feature vector to be searched by the method in step 3 to obtain top 1;

packaging the characteristic vector of each label signal data to be deleted into data to be deleted:

data_k＝{Id_k f_k,code_k}

k∈[0,n]

wherein, the data_kDenotes the kth data to be deleted, Id_kA label indicating a target signal to be deleted, f_kK-th eigenvector, code, representing target signal data to be deleted by the central control unit_kRepresenting the cluster coding of the kth data of the target signal data to be deleted, wherein the initial value is-1, and n is the number of the target signal data to be deleted;

and (3) taking the feature vector of each data to be deleted as the feature vector to be retrieved, retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set.

Wherein the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

Updating the cluster code of the sample to be deleted, wherein the cluster code updating formula is as follows:

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

Step 5, if the label of the retrieval result is the same as the label of the sample to be deleted, deleting the retrieval result sample in the subsystem unit is as follows:

data_kindicating that the data is to be deleted,

representation data_kTop1 data output in the ith subsystem element as a sample to be retrieved,

the serial number in the ith subsystem unit is m, if

Then delete it

Step 5, traversing the number of all the samples in the subsystem unit to be inserted, which are the same as the cluster code of the retrieval result sample, and if the number is zero, deleting the split nodes in the split node set of the subsystem unit, which are the same as the cluster code of the retrieval result sample, as follows:

data_kindicating that the data is to be deleted,

data in split node set representing ith subsystem unit_kClustering the split nodes with the same code, and counting the data in the ith subsystem unit_kClustering encodes the number of identical samples, and if the number is zero, deleting

The invention has the advantages that:

and (3) establishing a data structure: the invention does not need clustering, randomly selects the samples in the class as the central nodes, and constructs the samples into the tree structure, thereby simplifying the calculation, ensuring that the constructed data structure does not depend on the distribution of the original data, and realizing the dynamic addition and deletion of the samples.

Fine granularity of data structure leaf node partitioning: the leaf nodes are not necessarily divided into single samples, but contain a certain number of sample sets, and the retrieval precision can be improved.

Dynamic data structure: the data structure established in the subsystem unit is dynamic and is not a fixed model, and the processing of the sample is more flexible.

Data in the data structure can be added or deleted: the system can directly add and delete samples to the data structure of the subsystem unit, so that the retrieval system can meet more requirements.

Drawings

FIG. 1: the invention is implemented in a scene diagram.

FIG. 2: a central control unit database.

FIG. 3: the invention is a flow chart of a subsystem unit.

FIG. 4: and (5) dividing subsystem unit data.

FIG. 5: the tree structure diagram of the subsystem unit features.

FIG. 6: sample retrieval flow diagram.

FIG. 7: sample insertion flow chart.

FIG. 8: sample delete map flow.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. For the parameters that need to be analyzed in the actual situation, we have noted the parameter setting method above and will not be described herein.

As shown in fig. 1, which is a schematic view of an implementation scenario of the present invention, the method of the present invention includes a central control unit and a plurality of subsystem units, and the central control unit is sequentially connected to the plurality of subsystem units.

The following describes the embodiments of the present invention with reference to fig. 1 to 8:

step 1: the central control unit extracts a feature vector of an original signal, manually marks a label of original signal data, constructs a sample of the original signal data by combining the feature vector of the original signal, sorts the sample of the original signal data according to the label of the original signal data to obtain sorted signal data samples, combines the sorted signal data samples corresponding to the labels in the same sorted signal data samples to obtain a combined signal data sample set, and the form of the combined data sample set is shown in FIG. 2;

step 1, extracting the feature vector of the original signal is as follows:

f_k＝F(x_k)

k∈[0,n]

wherein x is_kIs the number of kth original signalAccording to n is the number of raw signal data, F represents a feature extractor, F_kA feature vector of the kth original signal data;

step 1, the samples of the original signal data are:

data_k＝{Id_k,f_k}

k∈[0,n]

step 1 the sample ordering of the raw signal data is:

i∈[1,n]

k_i∈[1,n]

wherein,

representing the label in the ith sorted signal data sample,

representing a feature vector in the ith sorted signal data sample;

if it is

And

same, i ≠ j, then it will

And

merging the signals into the same merged signal data sample set;

in step 1, the combined signal data sample set is:

φ＝{φ₁,φ₂,...,φ_L}

u∈[1,L]

Step 2: the central control unit allocates a data sample set to each subsystem unit, each subsystem unit recombines the allocated data sample sets, selects a sample set to be split from the recombined sample sets, randomly selects a sample from the sample set to be split as a splitting node, calculates the similarity between all samples in the sample set to be split and the splitting node, sorts the sample sets to be split by using the similarity, selects the similarity of an intermediate sample as a threshold, updates the code of each sample to be split according to the relation between the similarity and the threshold, continuously and repeatedly splits each cluster until the number of samples contained in each cluster is less than a specified number, updates the splitting node and adds the splitting node into the splitting node set, the subsystem unit data structure construction flow chart is shown in figure 3, the set splitting chart is shown in figure 4, the constructed tree data structure is shown in FIG. 5;

step 2, the central control unit allocates a data sample set for each subsystem unit as follows:

u∈[1,c]

step 2, each subsystem unit recombines the distributed data sample sets into:

j∈[1,L]

wherein,

representing the recombined data sample set of the ith subsystem element,

j∈[1,l]

wherein,

Similarity to split nodes.

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

i∈[1,N]

k_j∈[1,l]

wherein, the first and second guide rollers are arranged in a row,

Middle (k) th_jA sample, is

a label representing the ith sorted sample,

representing the feature vector of the ith sorted sample.

wherein,

indicating the degree of similarity ordering of the split sample sets

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

Step 2, the mode for updating each split node is as follows:

wherein,

step 2, splitting the node set psiⁱComprises the following steps:

j∈[1,q]

wherein psiⁱRepresenting the ith set of subsystem unit split nodes,

And step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; the central control unit merges the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a merged set, eliminates samples with consistent labels, selects k samples with the maximum similarity as final output, namely topK, and a sample retrieval flow chart is shown in fig. 6.

And 3, extracting the characteristic vector of the target signal to be retrieved:

f_k＝F(x_k)

k∈[0,n]

Id_k，Id_ka tag representing kth original signal data;

and step 3, the sample of the target signal data to be retrieved is as follows:

data_k＝{f_k,code_k＝-1}

k∈[0,n]

and step 3, the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

j∈[1,l]

wherein,

a sample set representing code M in the ith subsystem unit,

a sample representing the jth code ═ M, l is

Number of medium samples.

k_j∈[1,l]

wherein,

Middle (k) th_jA sample, is

a label representing the ith sorted sample,

representing the feature vector of the ith sorted sample.

k_j∈[1,m×N]

Step 3, removing samples with consistent labels:

if it is

And

are identical to each other，i≠j,

Then delete

Otherwise delete

i∈[1,k]

wherein: out_tarA search set representing the output is generated,

And 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; and (3) randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step (3) until the corresponding split node cannot be found in the subsystem unit. Traversing the number of all samples in the sub-system unit to be inserted, which are the same as the cluster code of the sample to be inserted, if the number is greater than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all samples in the cluster and the split node, updating the cluster code of all nodes in the cluster, adding the split node into a split node set, and inserting the samples as shown in fig. 7.

And 4, extracting the characteristic vector of the original information of the sample to be inserted:

f_k＝F(x_k)

k∈[0,n]

step 4, the samples of the original signal data to be inserted are:

data_k＝{Id_k,f_k}

k∈[0,n]

φ＝{φ₁,φ₂,...,φ_L}

u∈[1,L]

wherein phi represents a signal data sample set after the central control unit is combined, phi_uRepresenting the combined signal data samples, L, identical to the sample labels to be inserted_uIs shown andthe number of feature vectors in the combined signal data samples to be inserted with the same sample label,

Representing the feature vector of the sample to be inserted.

φ＝{φ₁,φ₂,...,φ_L,φ_L+1}

u∈[1,L+1]

wherein phi represents a signal data sample set after the central control unit is combined, phi_uRepresents the u-th combined signal data sample, L_uRepresenting the number of feature vectors in the u-th combined signal data sample,

denotes the u-th sumAnd the v-th feature vector in the post-signal data sample corresponds to the k-th feature vector_u,vFeature vector of a sample of raw signal data, v ∈ [1, L ]_u]；φ_L+1Representing the data sample to be inserted into the sample.

Encapsulating samples of signal data to be inserted as:

data_k＝{f_k,code_k}

k∈[0,n]

the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

Step 4, splitting the node set psiⁱComprises the following steps:

j∈[1,q]

wherein psiⁱRepresenting the ith set of subsystem unit split nodes,

And 5: manually giving a label of a sample to be deleted, and searching data obtained by combining the sample to be deleted in a data set corresponding to a central control unit by the central control unit through a binary search method for the label; if the search result is zero, deleting the split nodes which are in the same cluster code as the search result sample in the split node set of the subsystem unit, wherein the sample deletion is as shown in FIG. 8.

Step 5, searching the data obtained by combining the sample to be deleted in the data set corresponding to the central control unit by the central control unit through a binary search method for the label, and if the combined data with the same label as the sample to be deleted is found in the data set of the central control unit, extracting the signal data obtained by combining the label to be deleted as follows:

u∈[1,L]

representing the v-th feature vector to be deleted.

data_k＝{Id_k f_k,code_k}

k∈[0,n]

wherein, the data_kDenotes the kth data to be deleted, Id_kA label indicating a target signal to be deleted, f_kK-th eigenvector, code, representing target signal data to be deleted by the central control unit_kIndicating a target message to be deletedClustering coding of kth data of the number data, wherein the initial value is-1, and n is the number of target signal data to be deleted;

Wherein the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

data_kindicating that the data is to be deleted,

the serial number in the ith subsystem unit is m, if

Then delete it

data_kindicating that the data is to be deleted,

It should be understood that parts of the application not described in detail are prior art.

It should be understood that the above description of the preferred embodiments is given for clearness of understanding and no unnecessary limitations should be understood therefrom, and all changes and modifications may be made by those skilled in the art without departing from the scope of the invention as defined by the appended claims.

Claims

1. A retrieval method based on a rapid retrieval system facing a mass vector library is characterized in that:

the rapid retrieval system facing the massive vector library comprises: a central control unit and a plurality of subsystem units;

the central control unit is sequentially connected with the plurality of subsystem units;

the retrieval method comprises the following steps:

and step 3: extracting a characteristic vector of a target signal to be retrieved from a central control unit, packaging the characteristic vector into a form containing the characteristic vector and a cluster code, distributing the characteristic vector and the cluster code to each subsystem unit, finding out a node with the cluster code of-1 from a split set of each subsystem unit, calculating the similarity between the split node and a sample to be retrieved, updating the sample code to be retrieved according to the similarity, repeating the steps until the split node which is the same as the sample to be retrieved cannot be found in the subsystem unit, taking out all samples which are consistent with the cluster code of the sample to be retrieved in the subsystem unit at the moment, sequencing the similarity of all the taken samples, and selecting m samples with the maximum similarity as a retrieval result of the subsystem unit and uploading the retrieval result to the central control unit; the central control unit combines the retrieval results uploaded by each subsystem unit, sorts the retrieval results according to the similarity to obtain a combined set, eliminates samples with consistent labels, and selects k samples with the maximum similarity as final output, namely topK;

and 4, step 4: the central control unit extracts a characteristic vector of original information of a sample to be inserted, manually marks a label of the sample to be inserted, and inserts the sample to be inserted into a data set corresponding to the central control unit according to the label sequence in the data set of the central control unit by adopting a binary search method for the label attribute; randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3 until the corresponding split node cannot be found in the subsystem unit; traversing the number of all samples which are the same as the cluster codes of the samples to be inserted in the subsystem unit to be inserted, if the number is larger than a specified threshold value, selecting one sample from the cluster as a split node, calculating the similarity between all the samples in the cluster and the split node, updating the cluster codes of all the nodes in the cluster, and adding the split node into a split node set;

2. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:

step 1, extracting the feature vector of the original signal is as follows:

f_k＝F(x_k)

k∈[0,n]

step 1, the samples of the original signal data are:

data_k＝{Id_k,f_k}

k∈[0,n]

step 1 the sample ordering of the raw signal data is:

i∈[1,n]

k_i∈[1,n]

wherein,

representing the ith ordered signalData samples, i.e. corresponding to the kth_iSamples of the original signal data, n being the number of original signal data,

representing the label in the ith sorted signal data sample,

representing a feature vector in the ith sorted signal data sample;

if it is

And

same, i ≠ j, then it will

And

merging the signals into the same merged signal data sample set;

in step 1, the combined signal data sample set is:

φ＝{φ₁,φ₂,...,φ_L}

u∈[1,L]

representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvector_u,vFeature vector of a sample of raw signal data, v ∈ [1, L ]_u]。

3. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:

u∈[1,c]

step 2, each subsystem unit recombines the distributed data sample sets into:

j∈[1,L]

wherein,

representing the recombined data sample set of the ith subsystem element,

represents the j signal data sample after being recombined in the i subsystem unit, L represents the number of the sample after being recombined in the i subsystem unit, and Id_jLabel representing the j-th signal data sample after recombination, f_jRepresenting a feature vector of a j-th signal data sample after recombination, representing whether the sample is selected as a split node or not by a node, default of the node is false, representing the similarity between the sample and the split node by sim, representing a cluster code to which the sample belongs by a code, and having an initial value of-1;

j∈[1,l]

wherein,

Similarity to split nodes;

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

i∈[1,N]

k_j∈[1,l]

wherein,

Middle (k) th_jA sample, is

a label representing the ith sorted sample,

a feature vector representing the ith sorted sample;

wherein,

indicating the degree of similarity ordering of the split sample sets

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

after updating the cluster codes of the samples to be retrieved each time, counting the number of the updated cluster samples in the sample set, if the number exceeds the specified number, randomly selecting one sample from the new cluster as a split node, calculating the similarity between all samples in the new cluster and the new split node, and updating the cluster codes of all samples in the new cluster; otherwise, ending;

step 2, the mode for updating each split node is as follows:

wherein,

step 2, splitting the node set psiⁱComprises the following steps:

j∈[1,q]

wherein psiⁱRepresenting the ith set of subsystem unit split nodes,

4. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:

f_k＝F(x_k)

k∈[0,n]

Id_k，Id_ka tag representing kth original signal data;

and step 3, the sample of the target signal data to be retrieved is as follows:

data_k＝{f_k,code_k＝-1}

k∈[0,n]

and step 3, the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

Repeating the steps until the same split node as the sample to be retrieved cannot be found in the subsystem unit;

updating the cluster codes of the samples to be retrieved each time, searching whether the split samples identical to the cluster codes of the current samples to be retrieved exist in the split sample set again, if so, calculating the similarity between the samples to be retrieved and the new split nodes, and updating the cluster codes of the samples to be retrieved; otherwise, ending;

step 3, all samples which are in the same clustering code with the sample to be retrieved in the subsystem unit at the moment are taken out;

j∈[1,l]

wherein,

a sample set representing code M in the ith subsystem unit,

a sample representing the jth code ═ M, l is

The number of medium samples;

k_j∈[1,l]

wherein,

Middle (k) th_jA sample, is

a label representing the ith sorted sample,

a feature vector representing the ith sorted sample;

k_j∈[1,m×N]

wherein phi_tarRepresenting a data set which is formed by merging all subsystem units and sorting the subsystem units according to sim;

step 3, removing samples with consistent labels:

if it is

And

the same, i ≠ j,

then delete

Otherwise delete

i∈[1,k]

wherein: out_tarA search set representing the output is generated,

5. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:

f_k＝F(x_k)

k∈[0,n]

step 4, the samples of the original signal data to be inserted are:

data_k＝{Id_k,f_k}

k∈[0,n]

φ＝{φ₁,φ₂,...,φ_L}

u∈[1,L]

Indicating to be insertedA feature vector of the sample;

φ＝{φ₁,φ₂,...,φ_L,φ_L+1}

u∈[1,L+1]

representing the v-th eigenvector in the u-th combined signal data sample, i.e. corresponding to the k-th eigenvector_u,vFeature vector of a sample of raw signal data, v ∈ [1, L ]_u]；φ_L+1A data sample representing a sample to be inserted;

4, randomly selecting a subsystem unit as a subsystem unit to be inserted, and updating the cluster code of the sample to be inserted according to the step 3;

encapsulating samples of signal data to be inserted as:

data_k＝{f_k,code_k}

k∈[0,n]

wherein, the data_kSamples, codes, representing the kth target signal data to be inserted_kCluster coding of samples representing the kth target signal data to be inserted with an initial value of-1, n is the amount of target signal data to be inserted;

the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

Step 4, until the corresponding split node can not be found in the subsystem unit;

the description is as follows: updating the cluster codes of the samples to be inserted each time, searching whether the split nodes identical to the cluster codes of the samples to be inserted exist in the split sample set again, if so, calculating the similarity between the samples to be inserted and the new split nodes, and updating the cluster codes of the samples to be inserted; otherwise, ending;

step 4, splitting the node set psiⁱComprises the following steps:

j∈[1,q]

wherein psiⁱRepresenting the ith set of subsystem unit split nodes,

6. The retrieval method based on the rapid retrieval system facing the massive vector library according to claim 1, characterized in that:

u∈[1,L]

representing the v-th feature vector to be deleted;

data_k＝{Id_k f_k,code_k}

k∈[0,n]

taking the feature vector of each data to be deleted as a feature vector to be retrieved, and retrieving in each subsystem unit to obtain top1, wherein the method is consistent with the step 3, namely, the similarity between the feature vector of the data to be deleted and the splitting node is calculated, and the data code to be deleted is updated until the splitting node which is the same as the sample cluster code to be deleted cannot be found in the splitting node set;

wherein the similarity calculation formula is as follows:

wherein sim_ABRepresenting the degree of similarity of the vectors a, B,

n denotes the dimension of the vector, i ∈ [1, n ]

wherein code represents the cluster code to which the sample belongs,

sim represents the similarity of the sample to the split node calculation,

thresh represents the threshold for the split node

data_kindicating that the data is to be deleted,

the serial number in the ith subsystem unit is m, if

Then delete it

data_kindicating that the data is to be deleted,