CN114817717A

CN114817717A - Search method, search device, computer equipment and storage medium

Info

Publication number: CN114817717A
Application number: CN202210423020.4A
Authority: CN
Inventors: 袁俊杰; 王波; 潘彭丹; 吴潇; 段云; 裴军
Original assignee: Guoke Huadun Beijing Technology Co ltd
Current assignee: Guoke Huadun Beijing Technology Co ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-29

Abstract

The present application relates to a search method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: responding to a query request aiming at a target object, and determining a feature vector to be queried corresponding to the target object; dividing the characteristic vector to be queried according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be queried, and performing cluster calculation on the sub-vectors to be queried based on a pre-stored codebook database to obtain sub-distances between the sub-vectors to be queried and a plurality of cluster centers corresponding to each segment; determining a target index distance according to the sub-distances corresponding to the multiple segments of sub-vectors to be queried, and determining a target feature vector corresponding to the target index distance according to the target index distance and the index sub-vectors corresponding to the clustering centers; and searching according to the identification information of the target feature vector to obtain the media data corresponding to the target object. By adopting the method, the newly added object can be quickly retrieved and processed in real time in mass data.

Description

Search method, search device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a search method, an apparatus, a computer device, a storage medium, and a computer program product.

Background

With the rapid development of internet Technology and the rapid development of the fifth Generation Mobile Communication Technology (5G), mainstream applications represented by microblog, WeChat, tremble, and Happy will generate massive media data, which may include various types of data, such as picture data, voice data, and video data, etc., and which have become the main carriers of information dissemination. With the generation of mass media data, the data types are increasingly diversified, the multimedia technology antagonism is rapidly enhanced, the service scene analysis requirements for the media data are more and more, and the media retrieval dimension and means requirements are higher and more.

The media analysis and retrieval method in the related art is to perform customized scene training on a certain media engine in advance, such as specific person detection, specific environment detection and specific object detection, analyze, process and score real-time data aiming at the customized scenes, determine whether a certain scene is satisfied, mark data meeting the scene, perform data search according to the mark by an upper-layer service application system, and cannot perform retrieval on a newly added object.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device, a computer readable storage medium and a computer program product for searching for a newly added object.

In a first aspect, the present application provides a search method. The method comprises the following steps:

responding to a query request aiming at a target object, and determining a feature vector to be queried corresponding to the target object;

dividing the characteristic vector to be queried according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be queried;

aiming at the sub-vector to be queried corresponding to each section, carrying out clustering calculation on the sub-vector to be queried based on a pre-stored codebook database to obtain sub-distances between the sub-vector to be queried and a plurality of clustering centers corresponding to each section;

determining a target index distance according to the sub-distances corresponding to the multiple segments of sub-vectors to be queried, and determining a target feature vector corresponding to the target index distance according to the target index distance and the index sub-vectors corresponding to the cluster centers;

and searching according to the identification information of the target feature vector to obtain media data corresponding to the target object.

In one embodiment, the method further comprises:

calculating a characteristic value of the characteristic vector to be inquired, and inquiring the characteristic value in a prestored codebook database;

if the eigenvalue can be inquired, determining a target eigenvector according to the eigenvalue, wherein the prestored codebook database comprises the target eigenvector and the eigenvalue of the target eigenvector;

and if the characteristic value cannot be inquired, executing the step of dividing the characteristic vector to be inquired according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be inquired.

In one embodiment, the determining a target index distance according to the sub-distances corresponding to the multiple segments of sub-vectors to be queried includes:

determining a target index distance according to the sub-distances corresponding to the multiple segments of the sub-vectors to be queried, including:

obtaining a plurality of initial index distances according to the to-be-queried subvectors corresponding to the segments and the sub-distances corresponding to the plurality of segments of the to-be-queried subvectors; the initial index distance is obtained by adding the sub-distances of one cluster center corresponding to each segment;

and screening according to a preset screening condition in the plurality of initial index distances, and taking the minimum first number of initial index distances as target index distances.

In one embodiment, the method further comprises:

acquiring media data in a preset message queue, wherein the media data comprises data of at least one target object, and the target object comprises at least one attribute dimension;

determining a target attribute dimension of the media data in the at least one attribute dimension through a vector calculation engine manager, and calculating target feature vectors corresponding to the media data in the target attribute dimension respectively;

aiming at each target feature vector, dividing the target feature vector according to a preset vector segment division rule to obtain a plurality of segments of feature sub-vectors;

aiming at a plurality of feature sub-vectors corresponding to a plurality of target feature vectors in each section, performing clustering calculation on the plurality of feature sub-vectors corresponding to the plurality of target feature vectors in each section through a preset clustering algorithm to obtain a plurality of clustering centers;

calculating a target clustering center corresponding to the characteristic sub-vector, and determining an index of the target clustering center;

obtaining an index sequence of the target feature vector according to indexes corresponding to a plurality of feature sub-vectors contained in the target feature vector;

and adding the target characteristic vector, the identification information of the target characteristic vector, the index sequence corresponding to the target characteristic vector and the identification information of an index file to a preset codebook database, wherein the index file comprises the index sequence of the target characteristic vector.

In one embodiment, the vector calculation engine manager comprises vector calculation engines corresponding to the attribute dimensions respectively;

the determining, by the vector calculation engine manager, a target attribute dimension of the media data in the at least one attribute dimension, and calculating target feature vectors corresponding to the media data in the target attribute dimension, respectively, includes:

determining, by a vector calculation engine manager, a target attribute dimension of the media data among the at least one attribute dimension;

and calculating a target characteristic vector corresponding to the media data in the target attribute dimension through a vector calculation engine corresponding to the target attribute dimension.

In one embodiment, the calculating a target clustering center corresponding to the feature subvector and determining an index of the target clustering center includes:

calculating the distance between the characteristic sub-vector and each clustering center;

and determining a target clustering center corresponding to the characteristic sub-vector and an index of the target clustering center according to the distance between the characteristic sub-vector and each clustering center.

In a second aspect, the present application further provides a search apparatus. The device comprises:

the response module is used for responding to a query request aiming at a target object and determining a feature vector to be queried corresponding to the target object;

the dividing module is used for dividing the characteristic vector to be queried according to a preset vector segment dividing rule to obtain a plurality of segments of sub-vectors to be queried;

the cluster calculation module is used for carrying out cluster calculation on the to-be-queried subvectors aiming at the to-be-queried subvectors corresponding to each section based on a pre-stored codebook database to obtain the sub-distances between the to-be-queried subvectors and a plurality of cluster centers corresponding to each section;

the determining module is used for determining a target index distance according to the sub-distances corresponding to the multi-segment sub-vectors to be queried, and determining a target feature vector corresponding to the target index distance according to the target index distance and the index sub-vectors corresponding to the clustering centers;

and the searching module is used for searching according to the identification information of the target characteristic vector to obtain the media data corresponding to the target object.

In one embodiment, the apparatus further comprises:

the query module is used for calculating a characteristic value of the characteristic vector to be queried and querying the characteristic value in a pre-stored codebook database; if the eigenvalue can be inquired, determining a target eigenvector according to the eigenvalue, wherein the prestored codebook database comprises the target eigenvector and the eigenvalue of the target eigenvector; and if the characteristic value cannot be inquired, executing the step of dividing the characteristic vector to be inquired according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be inquired.

In one embodiment, the determining module is specifically configured to:

In one embodiment, the apparatus further comprises:

the storage module is used for acquiring media data in a preset message queue, wherein the media data comprises data of at least one target object, and the target object comprises at least one attribute dimension; determining a target attribute dimension of the media data in the at least one attribute dimension through a vector calculation engine manager, and calculating target feature vectors corresponding to the media data in the target attribute dimension respectively; aiming at each target feature vector, dividing the target feature vector according to a preset vector segment division rule to obtain a plurality of segments of feature sub-vectors; aiming at a plurality of feature sub-vectors corresponding to a plurality of target feature vectors in each section, performing clustering calculation on the plurality of feature sub-vectors corresponding to the plurality of target feature vectors in each section through a preset clustering algorithm to obtain a plurality of clustering centers; calculating a target clustering center corresponding to the characteristic sub-vector, and determining an index of the target clustering center; obtaining an index sequence of the target feature vector according to indexes corresponding to a plurality of feature sub-vectors contained in the target feature vector; and adding the target characteristic vector, the identification information of the target characteristic vector, the index sequence corresponding to the target characteristic vector and the identification information of an index file to a preset codebook database, wherein the index file comprises the index sequence of the target characteristic vector.

the storage module is specifically configured to:

In one embodiment, the storage module is specifically configured to:

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

dividing the characteristic vector to be queried according to a preset vector segment division rule to obtain a plurality of segments of sub vectors to be queried;

and searching according to the identification information of the target characteristic vector to obtain media data corresponding to the target object.

According to the searching method, the searching device, the computer equipment, the storage medium and the computer program product, the characteristic vector to be queried corresponding to the target object is determined in response to the query request aiming at the target object; dividing the characteristic vector to be queried according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be queried; aiming at the subvectors to be queried corresponding to each section, carrying out clustering calculation on the subvectors to be queried based on a prestored codebook database to obtain the subvectors of the subvectors to be queried and a plurality of clustering centers corresponding to each section; determining a target index distance according to the sub-distances corresponding to the multiple segments of sub-vectors to be queried, and determining a target feature vector corresponding to the target index distance according to the target index distance and the index sub-vectors corresponding to the clustering centers; and searching according to the identification information of the target feature vector to obtain the media data corresponding to the target object. By adopting the method, the newly added object can be quickly retrieved and processed in real time in mass data.

Drawings

FIG. 1 is a schematic flow chart diagram of a search method in one embodiment;

FIG. 2 is a schematic flow chart diagram illustrating the eigenvalue lookup step in one embodiment;

FIG. 3 is a flowchart illustrating the step of obtaining a target index distance in one embodiment;

FIG. 4 is a flowchart illustrating the storing step of the preset codebook database in one embodiment;

FIG. 5 is a flowchart illustrating the step of calculating a target feature vector in one embodiment;

FIG. 6 is a diagram illustrating the structure of calculating a target feature vector according to one embodiment;

FIG. 7 is a flowchart illustrating the step of obtaining a target cluster center in one embodiment;

FIG. 8 is a schematic structural diagram of a search method in another embodiment;

FIG. 9 is a block diagram showing the structure of a search apparatus according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a search method is provided, and this embodiment is illustrated by applying the method to a terminal, it is to be understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented by interaction between the terminal and the server, where the terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device and the like, and the server can be realized by an independent server or a server cluster formed by a plurality of servers. In this embodiment, the search method includes the following steps:

step 102, responding to the query request aiming at the target object, and determining the feature vector to be queried corresponding to the target object.

Specifically, a user may send a query request to a terminal through a retrieval interface in a background device of the service system, where the query request is used to retrieve a target object. The terminal responds to the query request, analyzes the query request, and determines the feature vector to be queried of one attribute dimension or a plurality of attribute dimensions corresponding to the target object to be queried, which is contained in the query request.

In one example, the terminal may perform a parsing operation on the query request through a vector calculation engine manager, and the vector calculation engine manager may return one or more feature vectors to be queried corresponding to the target object included in the query request to the terminal.

In another example, the vector calculation engine manager may have multiple versions, the version information of the engine may include a version number of the version of the engine, attribute dimensions of the feature vector, online time information, and offline time information, and the terminal may select a corresponding vector calculation engine manager to process the query request, determine a preset vector segment division rule, and the like according to the version information of the vector calculation engine manager corresponding to the feature vector to be queried.

And 104, dividing the characteristic vector to be queried according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be queried.

Specifically, the preset vector segment division rule may be a rule that divides according to the length of the feature vector, or a rule that divides according to a user selection. Therefore, the terminal can divide the feature vector to be queried to obtain a plurality of sections of sub-vectors to be queried.

In one example, the target eigenvector stored in the preset codebook database is also divided according to the preset vector segment division rule, that is, the number of eigenvectors of the target eigenvector stored in the codebook database is the same as the number of to-be-queried subvectors of the to-be-queried eigenvector.

And 106, carrying out clustering calculation on the to-be-queried subvectors aiming at the to-be-queried subvectors corresponding to each section based on a pre-stored codebook database to obtain the sub-distances between the to-be-queried subvectors and a plurality of clustering centers corresponding to each section.

Specifically, the terminal performs clustering calculation on the to-be-queried sub-vectors corresponding to each segment through a preset clustering algorithm based on a pre-stored codebook database. The specific process of the terminal performing the clustering calculation may be that, firstly, the terminal obtains a plurality of clustering centers corresponding to the segment where the to-be-queried sub-vector is located based on a preset pre-stored codebook database, and then, the terminal may calculate sub-distances between the to-be-queried sub-vector and the plurality of clustering centers respectively.

In one example, the terminal divides the vector a to be queried, and the obtained multiple segments of sub-vectors to be queried may be a1, a2, A3, and a4, respectively. For example, the subvector to be queried corresponding to the first segment may be a1, and the plurality of cluster centers corresponding to the first segment may be Q1, …, Qq. Thus, the terminal needs to calculate a plurality of sub-distances between the sub-vector a1 to be queried and the first segment, which may be Q1, …, and Qq, respectively, through a preset clustering algorithm. The calculation processes of the sub-distances between the a2, the A3 and the a4 and the corresponding cluster centers of the corresponding segments are similar, and are not described herein again.

And 108, determining a target index distance according to the sub-distances corresponding to the multi-segment sub-vectors to be queried, and determining a target feature vector corresponding to the target index distance according to the target index distance and the index sub-vectors corresponding to the clustering centers.

Specifically, the terminal may extract a sub-distance corresponding to the to-be-queried sub-vector of each segment, and add the extracted sub-distances of each segment to obtain the target index distance. In this way, the terminal can also obtain the index subvectors corresponding to the plurality of clustering centers corresponding to each segment based on the pre-stored codebook database. Therefore, the terminal can obtain the target characteristic vector corresponding to the target index distance through combination based on the index sub-vectors corresponding to the clustering centers corresponding to the target index distances.

And step 110, searching according to the identification information of the target characteristic vector to obtain media data corresponding to the target object.

Specifically, the terminal may determine the identification information of the target feature vector and the identification information of the index file of the target feature vector according to the target feature vector, so that the terminal may return the target feature vector, the identification information of the target feature vector, and the identification information of the index file of the target feature vector to the background device (user) of the service system. The terminal can also search based on the identification information of the target feature vector to obtain one or more pieces of media data corresponding to the target object in the query request.

In the searching method, a query request aiming at a target object is responded, and a characteristic vector to be queried corresponding to the target object is determined; dividing the characteristic vector to be queried according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be queried; aiming at the subvectors to be queried corresponding to each section, carrying out clustering calculation on the subvectors to be queried based on a prestored codebook database to obtain the subvectors of the subvectors to be queried and a plurality of clustering centers corresponding to each section; determining a target index distance according to the sub-distances corresponding to the multiple segments of sub-vectors to be queried, and determining a target feature vector corresponding to the target index distance according to the target index distance and the index sub-vectors corresponding to the clustering centers; and searching according to the identification information of the target feature vector to obtain the media data corresponding to the target object. By adopting the method, the newly added object can be quickly retrieved and processed in real time in mass data.

In one embodiment, as shown in fig. 2, the search method further includes:

step 202, calculating the characteristic value of the characteristic vector to be queried, and querying the characteristic value in a pre-stored codebook database.

The feature value of the vector to be queried may be a hash value, an MD5 value, or the like.

Specifically, the terminal calculates the eigenvalue of the eigenvector to be queried through a preset eigenvalue algorithm, and queries the eigenvalue in a prestored codebook database. If the terminal queries the eigenvalue in a pre-stored codebook database, it indicates that the target eigenvector corresponding to the eigenvalue is stored in the codebook database, and the terminal may execute the step of step 204. If the terminal cannot query the eigenvalue in the pre-stored codebook database, it indicates that the codebook database does not store the target eigenvector corresponding to the eigenvalue, and the terminal may execute the step of step 206.

Step 204, if the eigenvalue can be inquired, determining a target eigenvector according to the eigenvalue, wherein the prestored codebook database comprises the target eigenvector and the eigenvalue of the target eigenvector.

And step 206, if the characteristic value cannot be queried, executing a step of dividing the characteristic vector to be queried according to a preset vector segment division rule to obtain a plurality of segments of sub-vectors to be queried.

Specifically, if the terminal cannot inquire the feature value in the pre-stored codebook database, the terminal may perform the execution of steps 104 to 110 in the above-described embodiment.

In this embodiment, the terminal searches for the vector through the correspondence between the feature value (MD5 value) and the vector (vector) stored in the memory in advance, so that the vector search process can be accelerated, and the search efficiency can be improved.

In an embodiment, as shown in fig. 3, a specific process of step 108 "determining a target index distance according to sub-distances corresponding to a plurality of sub-vectors to be queried" includes:

step 302, obtaining a plurality of initial index distances according to the to-be-queried subvectors corresponding to the segments and the sub-distances corresponding to the plurality of segments of the to-be-queried subvectors.

The initial index distance is obtained by adding the sub-distances of one cluster center corresponding to each segment.

Specifically, the terminal may extract the sub-distance of each segment from the sub-distances of the to-be-queried sub-vector corresponding to each segment and the plurality of clustering centers corresponding to each segment, and the terminal may add the extracted sub-distances to obtain an initial index distance. In this way, by repeating the process of the above embodiment, the terminal can obtain a plurality of initial index distances.

And 304, screening according to preset screening conditions in the plurality of initial index distances, and taking the minimum first number of initial index distances as target index distances.

Specifically, the preset filtering condition may be to perform filtering according to the size of each initial index distance, and extract a first number of minimum initial index distances as the target index distance of the feature vector to be queried.

In one example, the terminals are maliciously sorted according to the size of each initial index distance, and top k initial index distances with the smallest distance, that is, target index distances, are obtained.

In one embodiment, as shown in fig. 4, the search method further includes:

step 402, acquiring media data in a preset message queue.

Wherein the media data comprises data of at least one target object, the target object comprising at least one attribute dimension.

Specifically, the terminal obtains a preset message queue, where the message queue may be a real-time in-and-out queue, the message queue includes at least one media data of at least one target object, and the attribute dimension of the media data may include a face attribute dimension, an entity attribute dimension, a text attribute dimension, a keyword attribute dimension in text, and the like.

Step 404, determining a target attribute dimension of the media data in at least one attribute dimension through the vector calculation engine manager, and calculating target feature vectors corresponding to the media data in the target attribute dimension.

Specifically, the terminal determines a target attribute dimension of the media data through the vector calculation engine manager, so that the terminal can calculate a plurality of target feature vectors of the media data in the target attribute dimension, and similarly, the calculated target feature vectors may also be a plurality of target attribute dimensions.

And 406, dividing the target characteristic vectors according to a preset vector segment division rule aiming at each target characteristic vector to obtain a plurality of segments of characteristic sub-vectors.

Specifically, for each target feature vector, because the target feature vector is a multidimensional feature vector, the terminal may be divided into multiple segments according to the length of each target feature vector, so as to obtain multiple feature sub-vectors corresponding to the target feature vector.

In one example, the target feature vector may be an n-dimensional feature vector, and thus, the terminal may divide the target feature vector into m segments, i.e., each target feature vector includes m segments of feature sub-vectors. Alternatively, n may be 128 and m may be 4.

Step 408, for a plurality of feature sub-vectors corresponding to the plurality of target feature vectors in each segment, performing cluster calculation on the plurality of feature sub-vectors corresponding to the plurality of target feature vectors in each segment through a preset clustering algorithm to obtain a plurality of cluster centers.

Specifically, the preset clustering algorithm may be a k-means clustering algorithm. The terminal can obtain a plurality of target feature vectors, and thus, for the feature sub-vectors respectively corresponding to each target feature vector in each segment, the terminal can perform clustering calculation on the feature sub-vectors respectively corresponding to each target feature vector in each segment through a preset clustering algorithm (k-means clustering algorithm) to obtain a preset number of clustering centers.

And step 410, calculating a target clustering center corresponding to the characteristic subvector and determining an index of the target clustering center.

Specifically, the terminal may calculate distances between the feature sub-vector and each of the clustering centers through a preset clustering algorithm, screen a target clustering center of the feature sub-vector according to the distances between the feature sub-vector and each of the clustering centers, and determine an index of the target clustering center, where the index is a number of a preset number of clustering centers of the target clustering center in each segment.

Optionally, the terminal may number the preset number of clustering centers according to the sequence obtained by clustering.

Step 412, obtaining an index sequence of the target feature vector according to the indexes corresponding to the plurality of feature sub-vectors included in the target feature vector.

Specifically, the target feature vector includes a plurality of segments of feature sub-vectors, and the terminal may respectively calculate the index of the target clustering center corresponding to the feature sub-vector of each segment in the target feature vector by using the method in the foregoing embodiment, so that the terminal may determine the index sequence of the target feature vector according to the index of the target clustering center corresponding to the feature sub-vector of each segment.

In one example, the target eigenvector a may include a plurality of eigenvectors of a number of indices a1, a2, A3, a4, the eigenvector a1 may correspond to a1, the eigenvector a2 may correspond to a2, the eigenvector A3 may correspond to A3, and the eigenvector a4 may correspond to a4, so that the index sequence of the target eigenvector may be [ a1, a2, A3, a4 ].

Step 414, adding the target feature vector, the identification information of the target feature vector, the index sequence corresponding to the target feature vector, and the identification information of the index file to the preset codebook database, wherein the index file includes the index sequence of the target feature vector.

Specifically, the terminal stores the target feature vector and the index sequence corresponding to the target feature vector to a preset codebook database. Thus, each target feature vector has identification Information (ID), and the index sequence of each target feature vector is stored in one index file, so that the terminal needs to store the identification information of the index file and the identification information of the target feature vector in the preset codebook database.

In this embodiment, by performing respective calculation on multiple attribute dimensions on media data, vectors and index files corresponding to the media data can be stored in an inverted index and data time slicing manner, so that feature vectors can be stored in a distributed storage manner, and the read-write efficiency and the disaster tolerance capability of the data are improved.

In one embodiment, the vector calculation engine manager includes a vector calculation engine corresponding to each attribute dimension.

Accordingly, as shown in fig. 5, the specific processing procedure of "determining a target attribute dimension of the media data in at least one attribute dimension by the vector calculation engine manager, and calculating target feature vectors corresponding to the target attribute dimensions of the media data respectively" includes:

step 502, determining, by the vector calculation engine manager, a target attribute dimension of the media data among the at least one attribute dimension.

Specifically, the terminal may calculate the media data through the vector calculation engine manager, and determine one or more attribute dimensions corresponding to the media data.

Step 504, a target feature vector corresponding to the media data in the target attribute dimension is calculated through a vector calculation engine corresponding to the target attribute dimension.

Specifically, the terminal may calculate the media data through a vector calculation engine corresponding to each target attribute dimension, respectively, to obtain a target feature vector of the media data in the target attribute dimension.

In one example, as shown in fig. 6, a vector calculation engine manager (vector manager) may be connected with a plurality of vector calculation engines (vector engines), including vector engine a, vector engine B, vector engine C. Thus, the terminal obtains the media data from the message queue and sends the media data to the vector manager, so that the vector manager can send the media data to a corresponding vector calculation engine according to the attribute dimension of the media data, for example, the terminal can send the face data to the engine of the face dimension to extract the face feature vector; the terminal may also send the text data to an engine of semantic dimensions, extract semantic feature vectors, and so on.

In an embodiment, as shown in fig. 7, the specific processing procedure of "calculating a target clustering center corresponding to the feature subvector and determining an index of the target clustering center" includes:

step 602, calculating the distance between the feature sub-vector and each cluster center.

Specifically, for a corresponding feature sub-vector in each segment of the target feature vector, the terminal calculates the distances between the feature sub-vector and a plurality of clustering centers obtained by clustering in the segment through a preset clustering algorithm.

And step 604, determining a target clustering center corresponding to the feature sub-vector and an index of the target clustering center according to the distance between the feature sub-vector and each clustering center.

Specifically, the terminal may use a cluster center corresponding to the minimum distance as a target cluster center of the feature sub-vector, and use an index of the target cluster center to represent the feature sub-vector. Wherein the index of the target cluster center may be the number of the cluster center.

In one example, the terminal performs clustering calculation on the feature subvectors in each segment of the plurality of target feature vectors to obtain a plurality of clustering centers in the segment, and the terminal may label each clustering center according to the appearance sequence of each clustering center to obtain an index of each clustering center.

The actual implementation of the above search method is described in detail below with reference to fig. 8:

step 1, engine management: the engine manager stores version information of each vector engine for each registered vector engine, wherein the version information comprises version number, feature vector dimension, online time, offline time and other information. The vector engines of different versions have different algorithms and vector dimensions, and are matched with the same vector version according to version information to query vector data corresponding to different versions.

Step 2, vector calculation: the data can be accessed in a message queue mode, specifically, media data in the message queue can be acquired by an engine manager through a grpc \ http interface, the media data is issued to a vector engine corresponding to each attribute dimension to extract vector characteristics of a target object, each attribute dimension can include attribute dimensions such as a face, an entity, characters and keywords in the characters, an engine scheduler (engine manager) sends the media data to engines with different dimensions according to the attributes of the target object, for example, the face is sent to the engine with the face dimension to extract a face feature vector, and text data is sent to the engine with the semantic dimension to extract a semantic feature vector.

In addition, the vector query process can be accelerated by the corresponding relation of the memory cache md5 to the vector (target feature vector). After engine vector calculation, the obtained feature vectors of different dimensions corresponding to the engines are pushed for further processing.

Step 3, vector indexing: by partitioning the results of the vector engine into tables according to different dimensions, for example, a 128-dimensional feature vector can be partitioned into 4 segments, kmeans clustering is performed on each segment of data, and 256 cluster centers are clustered on each segment, so that each segment of the original feature vector can be recoded by using the serial number of the cluster center on the segment, and therefore, only 4 bytes are needed for representing one original feature vector. The mapping of the serial number of the cluster center and the segmented vector is called as a codebook, vector storage is carried out in an inverted index mode, identification Information (ID) and index file ID are stored in each vector, and the file ID and the index ID are returned during searching.

In addition, the vector index realizes the monitoring and communication of the service state of each vector storage.

And 4, realizing Top K retrieval service of the high-dimensional characteristic vector by vector retrieval, realizing the query of different vector space versions and time partitions, and returning a vector ID in query time and an index file ID of a corresponding vector. The specific process of searching may be: the method comprises the steps of accessing the back end of a service system through a web page, sending a characteristic query request to an engine manager, obtaining the request by the engine manager, analyzing a corresponding vector and returning, obtaining a corresponding data vector by an interface, sending a vector to be queried through a retrieval interface, dividing the vector to be retrieved into 4 sections for clustering calculation, obtaining the distance between the vector to be queried and each index in a preset codebook database, obtaining an index original vector through a codebook, adding four groups of distances between the four groups of vectors and the index, determining the distance between the vector to be queried and the index vector, obtaining the closest topk corresponding vectors to be matched through sequencing the distances, and returning the ID of each vector to be matched and the ID of an index file of the vector to be matched. In this way, the media data of the target object can be screened in the basic information base through the ID of each vector to be matched and the ID of the index file thereof.

It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a search apparatus for implementing the search method mentioned above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so specific limitations in one or more embodiments of the search device provided below can be referred to the limitations on the search method in the above, and are not described herein again.

In one embodiment, as shown in fig. 9, there is provided a search apparatus 700, including:

a response module 701, configured to determine, in response to a query request for a target object, a feature vector to be queried corresponding to the target object;

a dividing module 702, configured to divide the feature vector to be queried according to a preset vector segment dividing rule to obtain multiple segments of sub-vectors to be queried;

a cluster calculation module 703, configured to perform cluster calculation on the to-be-queried subvectors based on a pre-stored codebook database for the to-be-queried subvectors corresponding to each segment, so as to obtain sub-distances between the to-be-queried subvectors and multiple cluster centers corresponding to each segment;

a determining module 704, configured to determine a target index distance according to the sub-distances corresponding to the multiple segments of sub-vectors to be queried, and determine a target feature vector corresponding to the target index distance according to the target index distance and the index sub-vectors corresponding to each cluster center;

the searching module 705 is configured to search according to the identification information of the target feature vector to obtain media data corresponding to the target object.

In one embodiment, the apparatus further comprises:

In one embodiment, the determining module is specifically configured to:

In one embodiment, the apparatus further comprises:

the storage module is specifically configured to:

In one embodiment, the storage module is specifically configured to:

The modules in the search apparatus 700 may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing feature vector data as well as media data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a search method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of searching, the method comprising:

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein the determining a target index distance according to the sub-distances corresponding to the plurality of segments of sub-vectors to be queried comprises:

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein the vector calculation engine manager comprises a vector calculation engine corresponding to each attribute dimension;

6. The method according to claim 4, wherein the calculating a target cluster center corresponding to the feature subvector and determining an index of the target cluster center comprises:

7. A search apparatus, characterized in that the apparatus comprises:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.