CN113326388A - Data retrieval method, system, medium and device based on inverted list - Google Patents

Data retrieval method, system, medium and device based on inverted list Download PDF

Info

Publication number
CN113326388A
CN113326388A CN202110554146.0A CN202110554146A CN113326388A CN 113326388 A CN113326388 A CN 113326388A CN 202110554146 A CN202110554146 A CN 202110554146A CN 113326388 A CN113326388 A CN 113326388A
Authority
CN
China
Prior art keywords
data
input
inverted
retrieval
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110554146.0A
Other languages
Chinese (zh)
Inventor
杨乔
田国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yunconghuilin Artificial Intelligence Technology Co ltd
Original Assignee
Shanghai Yunconghuilin Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yunconghuilin Artificial Intelligence Technology Co ltd filed Critical Shanghai Yunconghuilin Artificial Intelligence Technology Co ltd
Priority to CN202110554146.0A priority Critical patent/CN113326388A/en
Publication of CN113326388A publication Critical patent/CN113326388A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Abstract

The invention belongs to the technical field of data retrieval, and particularly relates to a data retrieval method, a data retrieval system, a data retrieval medium and a data retrieval device based on an inverted list. The invention aims to solve the problems that the existing method for coding the time and space attributes of the human face photo into the features can prolong the human face features, has higher requirement on equipment storage and prolongs the retrieval process. For this purpose, the invention searches data in all inverted tables based on input feature vectors and input label information, wherein the inverted tables store the feature vectors and the label information of the data, and the label information has space-time attributes. Therefore, the method can support large-scale vector similarity retrieval with time/space range requirements, reduce the length of the feature vector of the data and reduce the requirements on the storage capacity of the equipment. Meanwhile, the number of the feature vectors needing to be matched with the input feature vectors in the inverted list can be reduced, and the process of data retrieval is more efficient.

Description

Data retrieval method, system, medium and device based on inverted list
Technical Field
The invention belongs to the technical field of data retrieval, and particularly relates to a data retrieval method, a data retrieval system, a data retrieval medium and a data retrieval device based on an inverted list.
Background
The aspects of intelligent communities, intelligent security, AI cities and the like all relate to large-scale retrieval of picture or video data, and the retrieval based on the inverted index is a common method in large-scale vector similarity retrieval.
At present, in the process of constructing the inverted list, taking the retrieval of a large-scale face feature as an example, a face photo is shot by a certain specific camera in a specific area and has a spatial attribute, and meanwhile, the shooting time of the face photo can be used as the time attribute of the face photo, the time and spatial attributes of the face photo are generally coded into the face feature and are regarded as a part of the face feature, and then the face feature is further added into the inverted list, so that the face feature is added into a database.
However, the existing method for encoding the temporal and spatial attributes of the face photos into the features can make the face features longer, the requirement on equipment storage is higher, and the retrieval process becomes longer.
Accordingly, there is a need in the art for an improved inverted table-based data retrieval method that addresses the above-mentioned problems.
Disclosure of Invention
To solve or at least partially solve: the existing method for coding the time and space attributes of the face photo into the features can prolong the face features, has higher requirements on equipment storage, and simultaneously causes the retrieval process to become longer. The invention provides a data retrieval method, a system, a medium and a device based on an inverted list.
In a first aspect, the present invention provides a data retrieval method based on an inverted table, including: acquiring input feature vectors and input label information required by data retrieval; performing data retrieval in all inverted lists based on the input feature vectors and the input label information; the inverted list stores characteristic vectors of data and label information, and the label information has a time-space attribute.
As a preferable aspect of the data search method according to the present invention, the step of "performing data search in all inverted tables based on the input feature vector and the input label information" includes: respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all inverted lists to obtain a plurality of similar inverted lists similar to the input feature vector; respectively searching data in the similar inverted lists according to the initial label and the end label of the input label information to find a plurality of first feature vectors corresponding to the label information in the interval of the initial label and the end label; and respectively carrying out vector similarity retrieval on the input feature vector and the plurality of first feature vectors to determine target data corresponding to topK second feature vectors which are most similar to the input feature vector.
As a preferable aspect of the data retrieval method according to the present invention, the step of "performing vector similarity retrieval on the input feature vector and the plurality of first feature vectors, respectively, to identify target data corresponding to topK second feature vectors that are most similar to the first feature vectors" specifically includes: and respectively scanning and comparing a plurality of first eigenvectors in each similar inverted list with the input eigenvectors, collecting all second eigenvectors through a maximum heap/minimum heap with the size of topK according to the comparison result, then carrying out heap sorting on all the second eigenvectors, and determining target data corresponding to all the second eigenvectors according to the heap sorting result.
As a preferred technical solution of the data retrieval method provided by the present invention, the input feature vector is a feature vector of data to be newly added, and the input tag information is tag information of the data to be newly added; the step of "performing data retrieval in all inverted tables based on the input feature vector and the input tag information" includes: respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all inverted lists to obtain a most similar inverted list which is most similar to the input feature vector; and searching the position to be inserted of the data to be newly added in the most similar inverted list based on the input label information, and storing the data to be newly added to the position to be inserted.
As a preferable aspect of the data retrieval method provided by the present invention, the method for generating the inverted list includes: acquiring data containing feature vectors and label information in a training sample; clustering the characteristic vectors of all data in the training sample according to the number of preset cluster centers; and sequencing the data in each clustered cluster in an ascending mode of label information to form a plurality of inverted lists.
As a preferable aspect of the data retrieval method provided by the present invention, the method for deleting data in the inverted list includes: performing binary search in all the inverted lists based on the starting label and the ending label of the data to be deleted; and completely shifting out the label information and the characteristic vector of the data to be deleted in the interval of the starting label and the ending label in the inverted list.
In a second aspect, the present invention further provides an inverted table-based data retrieval system, including: the first acquisition module is used for acquiring input feature vectors and input label information required by data retrieval; the retrieval module is used for retrieving data in all the inverted lists based on the input feature vectors and the input label information; the inverted list stores characteristic vectors of data and label information, and the label information has a time-space attribute.
As a preferable technical solution of the data retrieval system provided by the present invention, the retrieval module specifically includes: the first similarity retrieval module is used for respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all the inverted lists to obtain a plurality of similar inverted lists similar to the input feature vector; the first searching module is used for respectively searching data in the similar inverted lists according to the starting label and the ending label of the input label information so as to find a plurality of first feature vectors corresponding to the label information in the interval of the starting label and the ending label; and the vector similarity retrieval module is used for respectively carrying out vector similarity retrieval on the input feature vector and the first feature vectors so as to determine target data corresponding to topK second feature vectors which are most similar to the input feature vector.
As a preferred technical solution of the data retrieval system provided by the present invention, the vector similarity retrieval module is specifically configured to: and respectively scanning and comparing a plurality of first eigenvectors in each similar inverted list with the input eigenvectors, collecting all second eigenvectors through a maximum heap/minimum heap with the size of topK according to the comparison result, then carrying out heap sorting on all the second eigenvectors, and determining target data corresponding to all the second eigenvectors according to the heap sorting result.
As a preferred technical solution of the data retrieval system provided by the present invention, the input feature vector is a feature vector of data to be newly added, and the input tag information is tag information of the data to be newly added; the retrieval module specifically further comprises: the second similarity retrieval module is used for respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all the inverted lists to obtain a most similar inverted list which is most similar to the input feature vector; and the second searching module is used for searching the position to be inserted of the data to be newly added in the most similar inverted list based on the input label information and storing the data to be newly added to the position to be inserted.
As a preferred technical solution of the above data retrieval system provided by the present invention, the data retrieval system further comprises a training module for establishing the inverted list; the training module specifically comprises: the second acquisition module is used for acquiring data containing the characteristic vectors and the label information in the training samples; the clustering module is used for clustering the feature vectors of all data in the training sample according to the number of the preset cluster centers; and the sorting module is used for sorting the data in each clustered cluster in an ascending mode of label information to form a plurality of inverted lists.
In a third aspect, the present invention further provides a computer-readable storage medium, in which a plurality of program codes are stored, and the program codes are adapted to be loaded and executed by a processor to perform the data retrieval method in any of the foregoing first aspects.
In a fourth aspect, the present invention further provides an apparatus for data retrieval based on an inverted table, including a processor and a memory, where the memory stores a plurality of program codes, and the program codes are adapted to be loaded and executed by the processor to perform the data retrieval method according to any one of the foregoing first aspects.
In the data retrieval method based on the inverted list, data retrieval is carried out in all inverted lists based on input feature vectors and input label information, wherein the inverted lists store the feature vectors and the label information of the data, and the label information has space-time attributes. Therefore, the method can support large-scale vector similarity retrieval with time/space range requirements, reduce the length of the feature vector of the data and reduce the requirements on the storage capacity of the equipment. Meanwhile, data retrieval is carried out in all inverted lists based on the input feature vectors and the input label information, the number of the feature vectors needing to be matched with the input feature vectors in the inverted lists can be reduced, and the data retrieval process is more efficient.
Further, in the data retrieval method based on the inverted list provided by the invention, on one hand, similarity retrieval is respectively carried out on the input feature vector and cluster centers of all inverted lists to narrow the data retrieval range; then, data searching is respectively carried out in a plurality of similar inverted lists obtained by similarity searching according to the initial label and the end label of the input label information, and the data searching range is further reduced; and finally, carrying out vector similarity retrieval on the input feature vectors and a plurality of first feature vectors obtained by data searching to obtain target data. Therefore, the efficiency of data retrieval is greatly improved.
Drawings
Specific embodiments of the present embodiments are described below with reference to the accompanying drawings, in which:
FIG. 1 is a schematic main flow chart of a data retrieval method based on an inverted list according to the present embodiment;
FIG. 2 is a detailed flowchart of the data retrieving method based on the inverted list according to the present embodiment;
FIG. 3 is a schematic diagram of an inverted table used in the inverted table-based data retrieval method according to the embodiment;
fig. 4 is a schematic diagram of a hardware structure of a first terminal device provided in this embodiment;
fig. 5 is a schematic diagram of a hardware structure of a second terminal device provided in this embodiment.
Detailed Description
Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
In the description of the present invention, a "module" or "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, may comprise software components such as program code, or may be a combination of software and hardware. The processor may be a central processing unit, microprocessor, image processor, digital signal processor, or any other suitable processor. The processor has data and/or signal processing functionality. The processor may be implemented in software, hardware, or a combination thereof. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random-access memory, and the like.
The term "a and/or B" denotes all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" means similar to "A and/or B" and may include only A, only B, or both A and B. The singular forms "a", "an" and "the" may include the plural forms as well. Of course, the above alternative embodiments, and the alternative embodiments and the preferred embodiments can also be used in a cross-matching manner, so that a new embodiment is combined to be suitable for a more specific application scenario.
To solve or at least partially solve: the existing method for coding the time and space attributes of the face photo into the features can prolong the face features, has higher requirements on equipment storage, and simultaneously causes the retrieval process to become longer. The embodiment provides a data retrieval method, a data retrieval system, a data retrieval medium and a data retrieval device based on an inverted list.
First aspect
The embodiment provides a data retrieval method based on an inverted list, as shown in fig. 1, the method includes:
and S1, acquiring input feature vectors and input label information required by data retrieval.
S2, data retrieval is carried out in all inverted lists based on the input feature vectors and the input label information; the inverted list stores characteristic vectors of data and label information, and the label information has space-time attributes.
As shown in fig. 3, the inverted table includes a feature container storing a feature vector of data and a tag container storing tag information of the data, and the tag information has a spatiotemporal attribute. Specifically, n inverted tables are shown in fig. 3, each corresponding to a cluster center. The label container is an ordered container, the data in each inverted list are arranged in an ascending label mode, and the feature vectors in the feature container are also arranged according to the label information sequence of the data. For example, fig. 3 shows that the 1 st data to the mth data are included in the inverted table 1, where "id _1_ 1" represents the tag information of the 1 st data in the tag container in the inverted table 1, and "feature 1_ 1" represents the feature vector of the 1 st data in the feature container in the inverted table 1.
For example, the data applicable to the data retrieval method based on the inverted table in the present embodiment may be video data, picture data, sound data, text data, and the like, which can be represented by a feature vector, and the data generally has tag Information (ID). For example, the number of the camera used for monitoring a specific area can represent the spatial information of the photo data or video data acquired by the camera, and the recorded photographing time or photographing time can represent the time information of the photo data or video data. Specifically, the tag information of the data may be represented by a 64-bit binary number, the first 32 bits represent spatial information (e.g., a camera number), and the second 32 bits represent time information, so that the tag information of the data has spatiotemporal properties. The embodiment can support large-scale vector similarity retrieval with time/space range requirements by storing information containing space-time attributes in the tag information of the data and reducing the length of the feature vector of the data.
It can be understood that, in the data retrieval method based on the inverted table provided in this embodiment, data retrieval is performed in all inverted tables based on the input feature vector and the input tag information, where the inverted tables store the feature vector of the data and the tag information, and the tag information has a spatiotemporal attribute. Therefore, the method can support large-scale vector similarity retrieval with time/space range requirements, reduce the length of the feature vector of the data and reduce the requirements on the storage capacity of the equipment. Meanwhile, data retrieval is carried out in all inverted lists based on the input feature vectors and the input label information, the number of the feature vectors needing to be matched with the input feature vectors in the inverted lists can be reduced, and the data retrieval process is more efficient.
As a preferred implementation manner of the foregoing data retrieval method provided in this embodiment, when it is required to access data having some characteristics in a certain time and/or space through retrieval, for example, when it is required to find a moving track of the person at different times according to an input face photo, the input feature vector is a feature vector corresponding to the face photo of the person, as shown in fig. 2, step S2 specifically includes:
and S211, respectively carrying out similarity retrieval on the input feature vectors and the cluster centers of all the inverted lists to obtain a plurality of similar inverted lists similar to the input feature vectors.
Illustratively, each inverted table corresponds to a cluster whose cluster center can be an average feature vector derived from the feature vectors of all data in the cluster. And calculating the distance from the input feature vector to the cluster centers of the N reverse lists, then sequencing the cluster centers according to the sequence of the distance from near to far, and taking the reverse lists corresponding to the S cluster centers closest to the input feature vector as a plurality of similar reverse lists.
S212, respectively searching data in the similar inverted lists according to the starting label and the ending label of the input label information to find a plurality of first feature vectors corresponding to the label information in the interval of the starting label and the ending label.
Illustratively, a binary search is performed on the tag container of each of the S similar inverted tables by traversing the S similar inverted tables to find data located within the start tag and end tag intervals, and the first feature vectors corresponding to the data.
And S213, respectively carrying out vector similarity retrieval on the input feature vector and the first feature vectors to determine target data corresponding to topK second feature vectors which are most similar to the input feature vector.
The step S213 may specifically include: and respectively scanning and comparing a plurality of first eigenvectors in each similar inverted list with the input eigenvectors, collecting all second eigenvectors through a maximum heap/minimum heap with the size of topK according to the comparison result, then carrying out heap sorting on all the second eigenvectors, and determining target data corresponding to all the second eigenvectors according to the heap sorting result.
Illustratively, when the plurality of first feature vectors in each similarity inverted table are respectively scanned and aligned with the input feature vector, a similarity score corresponding to each first feature vector can be obtained, and in each similarity inverted table, topK most similar data can be screened out and subjected to heap sorting according to the scores.
It can be understood that, in the data retrieval method based on the inverted table provided in this embodiment, on one hand, similarity retrieval is performed on the input feature vector and cluster centers of all inverted tables, respectively, so as to narrow the data retrieval range; then, data searching is respectively carried out in a plurality of similar inverted lists obtained by similarity searching according to the initial label and the end label of the input label information, and the data searching range is further reduced; and finally, carrying out vector similarity retrieval on the input feature vectors and a plurality of first feature vectors obtained by data searching to obtain target data. Therefore, the efficiency of data retrieval is greatly improved.
As a preferred implementation of the data retrieval method provided in this embodiment, the input feature vector is a feature vector of data to be newly added, and the input tag information is tag information of the data to be newly added; for example, when a new photo of a vehicle is required, an inverted list storing the vehicle type is required to be found and inserted into the specified location, where step S2 may specifically include:
and S221, respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all the inverted lists to obtain the most similar inverted list which is most similar to the input feature vector.
For example, the distances from the input feature vector to the N cluster centers may be calculated, and the inverted table corresponding to the cluster center closest to the input feature vector may be found as the most similar inverted table after sorting according to the distances.
S222, searching a to-be-inserted position of the to-be-added data in the most similar inverted list based on the input label information, and storing the to-be-added data to the to-be-inserted position. For example, the corresponding position of the data to be newly added inserted into the inverted list can be found in an ascending order of the tags. And storing the characteristic vector of the data to be newly added to a characteristic container and storing the label information of the data to be newly added to a label container at the position to be inserted.
As a preferred implementation of the data retrieval method provided in this embodiment, the method for generating the inverted list includes:
s101, obtaining data containing feature vectors and label information in training samples.
Illustratively, the data in the training samples requires label Information (ID) and feature vectors. In this step, the number of data in the training sample needs to be determined according to the number of the set inverted lists (i.e. the number of cluster centers) and the number of data which participate in each cluster center most. For example, if the number of cluster centers N is 4096 and the maximum number of data participating in each cluster center is 256, the total amount of data participating in training does not exceed 4096 × 256. When too much training data is available, sampling is required first. When the data available for training is too little, the data available for training needs to be increased first so that the generated cluster center has better representativeness
And S102, clustering the feature vectors of all data in the training sample according to the number of the preset cluster centers.
Illustratively, the specific implementation procedure of step S102 may be as follows:
1) initialization
And randomly selecting N data from the data in the training sample as initial cluster center points, normalizing the data, and then rounding each dimension.
2) Iteratively adjusting cluster centers
Each round of iterative algorithm rule comprises the following steps:
2.1 data clustering: for the data in each training sample, traversing all cluster centers, calculating the distance from the current data to each cluster center, finding the cluster center closest to the current data, and then considering that the current data belongs to the cluster, and storing the current data into the closest cluster.
2.2 update Cluster center: and traversing the generated clusters, accumulating the training data belonging to the same cluster, solving an average characteristic vector, and taking the average characteristic vector as a new cluster center.
2.3 large cluster resolution: due to the reason of initialization or other reasons in the iteration process, too much data in some clusters and very little data in some clusters may be caused, and in order to avoid the generated inverted list from having length imbalance to influence concurrent retrieval, a large cluster with too much data may be split into two clusters.
S103, sorting the data in each clustered cluster in an ascending mode of label information to form a plurality of inverted lists.
It can be understood that the feature vectors of all the data in the same clustered cluster are stored in the feature container, the label information of all the data in the same cluster is stored in the label container, and all the data in the same clustered cluster are sorted in an ascending manner according to the label information to form an inverted list, as shown in fig. 3. After iterative convergence in the clustering process, N inverted tables are formed, and N cluster centers are used as the entries of the inverted tables.
As a preferred implementation of the above data retrieval method provided in this embodiment, it can be understood that, when data of a specific time-space interval needs to be deleted, for example, in a case where a photo or a video taken by a certain camera in a certain time period needs to be deleted, the method for deleting data in the inverted list includes:
s41, performing a binary search in all inverted tables based on the start tag and the end tag of the data to be deleted.
And S41, removing all the label information and the feature vectors of the data to be deleted in the interval of the start label and the end label in the inverted list.
The inverted table comprises a feature container for storing feature vectors of the data and a label container for storing label information of the data, and the label information has a spatiotemporal attribute.
It can be understood that, because the inverted list used by the data retrieving and deleting method provided by this embodiment includes the feature container and the tag container, the data may also be retrieved only through the start tag and the end tag interval, and when the tag information of the data to be deleted is located in the start tag and the end tag interval, the data in the retrieval result may be directly deleted. Therefore, the data retrieval and deletion method based on the inverted list provided by the embodiment can support large-scale vector similarity retrieval with time/space range requirements, can meet corresponding retrieval requirements of users, and enables the retrieval process to be more efficient.
It should be noted that although the detailed steps of the method of the present embodiment are described in detail above, those skilled in the art can combine, split and change the order of the above steps without departing from the basic principle of the present embodiment, and the implementation paradigm after such modification does not change the basic concept of the present embodiment, and therefore, the implementation paradigm also falls within the protection scope of the present embodiment. For example, in fig. 3 and the steps S211 to 213 of the above embodiment, similarity search is performed on the input feature vector and the cluster centers of all the inverted tables, and then data search is performed in the inverted tables according to the start tag and the end tag of the input tag information. However, it is also possible to perform data search in the inverted table according to the start tag and the end tag of the input tag information, and then perform similarity search on the input feature vector and all cluster centers of the inverted table on the basis of the search result, which still falls within the protection scope of the present invention.
Second aspect of the invention
Embodiments also provide an inverted table-based data retrieval system, including: the first acquisition module is used for acquiring input feature vectors and input label information required by data retrieval; the retrieval module is used for retrieving data in all the inverted lists based on the input feature vectors and the input label information; the inverted list stores characteristic vectors of data and label information, and the label information has space-time attributes.
As a preferred implementation manner of the data retrieval system provided in this embodiment, the retrieval module specifically includes: the first similarity retrieval module is used for respectively carrying out similarity retrieval on the input feature vectors and the cluster centers of all the inverted lists to obtain a plurality of similar inverted lists similar to the input feature vectors; the first searching module is used for respectively searching data in the similar inverted lists according to the starting label and the ending label of the input label information so as to find a plurality of first feature vectors corresponding to the label information in the interval of the starting label and the ending label; and the vector similarity retrieval module is used for respectively carrying out vector similarity retrieval on the input feature vector and the first feature vectors so as to determine target data corresponding to topK second feature vectors which are most similar to the input feature vector.
As a preferred implementation manner of the data retrieval system provided in this embodiment, the vector similarity retrieval module is specifically configured to: and respectively scanning and comparing a plurality of first eigenvectors in each similar inverted list with the input eigenvectors, collecting all second eigenvectors through a maximum heap/minimum heap with the size of topK according to the comparison result, then carrying out heap sorting on all the second eigenvectors, and determining target data corresponding to all the second eigenvectors according to the heap sorting result.
As a preferred implementation manner of the data retrieval system provided in this embodiment, the input feature vector is a feature vector of data to be newly added, and the input tag information is tag information of the data to be newly added; the retrieval module further comprises: the second similarity retrieval module is used for respectively retrieving the similarity of the input feature vector and the cluster centers of all the inverted lists to obtain the most similar inverted list which is most similar to the input feature vector; and the second searching module is used for searching the position to be inserted of the data to be newly added in the most similar inverted list based on the input label information and storing the data to be newly added to the position to be inserted.
As a preferred implementation manner of the data retrieval system provided in this embodiment, the data retrieval system further includes a training module for establishing an inverted list; the training module specifically comprises: the second acquisition module is used for acquiring data containing the characteristic vectors and the label information in the training samples; the clustering module is used for clustering the feature vectors of all data in the training sample according to the number of the preset cluster centers; and the sorting module is used for sorting the data in each clustered cluster in an ascending mode of the label information to form a plurality of inverted lists.
It should be noted that the data retrieval system based on the inverted list provided in this embodiment corresponds to the data retrieval method based on the inverted list in the first aspect, so that details of the system in this embodiment are not repeated, and for the description of the system, refer to the contents in the first aspect.
It should be further noted that, the inverted table-based data retrieval system provided in the foregoing embodiment is only illustrated by dividing the functional modules (such as the first obtaining module and the retrieving module, etc.), and in practical applications, the functional modules may be completed by different functional modules according to needs, that is, the functional modules in the embodiment of the present invention are further decomposed or combined, for example, the functional modules in the foregoing embodiment may be combined into one functional module, or further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the function modules related to the embodiment of the present invention are only for distinguishing and are not to be construed as an improper limitation to the embodiment.
Third aspect of the invention
It will be appreciated by those skilled in the art that the present embodiment provides a computer readable storage medium, wherein a plurality of program codes are stored, and the program codes are suitable for being loaded and executed by a processor to execute the data retrieval method in any of the foregoing embodiments of the first aspect.
The storage medium includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Fourth aspect of the invention
The present embodiment also provides an apparatus for data retrieval based on an inverted table, the apparatus includes a processor and a memory, the memory stores a plurality of program codes, and the program codes are suitable for being loaded and executed by the processor to perform the data retrieval method in any of the foregoing first aspect embodiments.
Fifth aspect of the invention
The embodiment further explains the implementation of the data retrieval method based on the inverted list, mainly by applying the method to a scene of a terminal device. The hardware structure of the terminal device is shown in fig. 4. The terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like. In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.
Fig. 5 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. Fig. 5 is a specific embodiment of the implementation process of fig. 4. As shown in fig. 5, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the first aspect and the inverted table based data retrieval method as shown in fig. 1 and 2. The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing module 1200. The terminal device may further include: a communication module 1203, a power module 1204, a multimedia module 1205, a voice module 1206, input/output interfaces 1207, and/or a sensor module 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing module 1200 generally controls the overall operation of the terminal device. The processing module 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method shown in fig. 1 described above. Further, the processing module 1200 may include one or more modules that facilitate interaction between the processing module 1200 and other components. For example, the processing module 1200 may include a multimedia module to facilitate interaction between the multimedia module 1205 and the processing module 1200. A power module 1204 provides power to the various components of the terminal device. The power module 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal devices. The multimedia module 1205 includes a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The voice module 1206 is configured to output and/or input a voice signal. For example, the voice module 1206 includes a Microphone (MIC) configured to receive an external voice signal when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication module 1203. In some embodiments, the voice module 1206 further includes a speaker for outputting voice signals.
Input/output interface 1207 provides an interface between processing module 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor module 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor module 1208 may detect an open/closed status of the terminal device, relative positioning of components, presence or absence of user contact with the terminal device. The sensor module 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor module 1208 may also include a camera or the like.
The communication module 1203 is configured to facilitate communication between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication module 1203, the voice module 1206, the input/output interface 1207, and the sensor module 1208 in the embodiment of fig. 5 may be implemented as the input device in the embodiment of fig. 4.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims of the present invention, any of the claimed embodiments may be used in any combination.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (13)

1. A data retrieval method based on an inverted list is characterized by comprising the following steps:
acquiring input feature vectors and input label information required by data retrieval;
performing data retrieval in all inverted lists based on the input feature vectors and the input label information;
the inverted list stores characteristic vectors of data and label information, and the label information has a time-space attribute.
2. The data retrieval method of claim 1, wherein the step of performing data retrieval in all inverted lists based on the input feature vector and the input tag information comprises:
respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all inverted lists to obtain a plurality of similar inverted lists similar to the input feature vector;
respectively searching data in the similar inverted lists according to the initial label and the end label of the input label information to find a plurality of first feature vectors corresponding to the label information in the interval of the initial label and the end label;
and respectively carrying out vector similarity retrieval on the input feature vector and the plurality of first feature vectors to determine target data corresponding to topK second feature vectors which are most similar to the input feature vector.
3. The data retrieval method according to claim 2, wherein the step of performing vector similarity retrieval on the input feature vector and the plurality of first feature vectors respectively to determine target data corresponding to topK second feature vectors most similar to the first feature vectors specifically comprises:
and respectively scanning and comparing a plurality of first eigenvectors in each similar inverted list with the input eigenvectors, collecting all second eigenvectors through a maximum heap/minimum heap with the size of topK according to the comparison result, then carrying out heap sorting on all the second eigenvectors, and determining target data corresponding to all the second eigenvectors according to the heap sorting result.
4. The data retrieval method according to claim 1, wherein the input feature vector is a feature vector of data to be newly added, and the input tag information is tag information of the data to be newly added; the step of "performing data retrieval in all inverted tables based on the input feature vector and the input tag information" includes:
respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all inverted lists to obtain a most similar inverted list which is most similar to the input feature vector;
and searching the position to be inserted of the data to be newly added in the most similar inverted list based on the input label information, and storing the data to be newly added to the position to be inserted.
5. The data retrieval method of claim 1, wherein the method of generating the inverted table comprises:
acquiring data containing feature vectors and label information in a training sample;
clustering the characteristic vectors of all data in the training sample according to the number of preset cluster centers;
and sequencing the data in each clustered cluster in an ascending mode of label information to form a plurality of inverted lists.
6. The data retrieval method of claim 1, wherein the method of deleting the data in the inverted list comprises: performing binary search in all the inverted lists based on the starting label and the ending label of the data to be deleted;
and completely shifting out the label information and the characteristic vector of the data to be deleted in the interval of the starting label and the ending label in the inverted list.
7. An inverted table-based data retrieval system, comprising:
the first acquisition module is used for acquiring input feature vectors and input label information required by data retrieval;
the retrieval module is used for retrieving data in all the inverted lists based on the input feature vectors and the input label information;
the inverted list stores characteristic vectors of data and label information, and the label information has a time-space attribute.
8. The data retrieval system of claim 7, wherein the retrieval module specifically comprises:
the first similarity retrieval module is used for respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all the inverted lists to obtain a plurality of similar inverted lists similar to the input feature vector;
the first searching module is used for respectively searching data in the similar inverted lists according to the starting label and the ending label of the input label information so as to find a plurality of first feature vectors corresponding to the label information in the interval of the starting label and the ending label;
and the vector similarity retrieval module is used for respectively carrying out vector similarity retrieval on the input feature vector and the first feature vectors so as to determine target data corresponding to topK second feature vectors which are most similar to the input feature vector.
9. The data retrieval system of claim 8, wherein the vector similarity retrieval module is specifically configured to:
and respectively scanning and comparing a plurality of first eigenvectors in each similar inverted list with the input eigenvectors, collecting all second eigenvectors through a maximum heap/minimum heap with the size of topK according to the comparison result, then carrying out heap sorting on all the second eigenvectors, and determining target data corresponding to all the second eigenvectors according to the heap sorting result.
10. The data retrieval system of claim 7, wherein the input feature vector is a feature vector of data to be added, and the input tag information is tag information of data to be added; the retrieval module specifically further comprises:
the second similarity retrieval module is used for respectively carrying out similarity retrieval on the input feature vector and the cluster centers of all the inverted lists to obtain a most similar inverted list which is most similar to the input feature vector;
and the second searching module is used for searching the position to be inserted of the data to be newly added in the most similar inverted list based on the input label information and storing the data to be newly added to the position to be inserted.
11. The data retrieval system of claim 7, further comprising a training module for building the inverted list; the training module specifically comprises:
the second acquisition module is used for acquiring data containing the characteristic vectors and the label information in the training samples;
the clustering module is used for clustering the feature vectors of all data in the training sample according to the number of the preset cluster centers;
and the sorting module is used for sorting the data in each clustered cluster in an ascending mode of label information to form a plurality of inverted lists.
12. A computer readable storage medium having stored therein a plurality of program codes, characterized in that the program codes are adapted to be loaded and run by a processor to perform the data retrieval method of any one of claims 1 to 6.
13. An apparatus for inverted table based data retrieval, comprising a processor and a memory, the memory having stored therein a plurality of program codes, wherein the program codes are adapted to be loaded and executed by the processor to perform the data retrieval method of any one of claims 1 to 6.
CN202110554146.0A 2021-05-20 2021-05-20 Data retrieval method, system, medium and device based on inverted list Pending CN113326388A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110554146.0A CN113326388A (en) 2021-05-20 2021-05-20 Data retrieval method, system, medium and device based on inverted list

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110554146.0A CN113326388A (en) 2021-05-20 2021-05-20 Data retrieval method, system, medium and device based on inverted list

Publications (1)

Publication Number Publication Date
CN113326388A true CN113326388A (en) 2021-08-31

Family

ID=77416115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110554146.0A Pending CN113326388A (en) 2021-05-20 2021-05-20 Data retrieval method, system, medium and device based on inverted list

Country Status (1)

Country Link
CN (1) CN113326388A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294813A (en) * 2013-06-07 2013-09-11 北京捷成世纪科技股份有限公司 Sensitive image search method and device
CN103955543A (en) * 2014-05-20 2014-07-30 电子科技大学 Multimode-based clothing image retrieval method
CN104298747A (en) * 2014-10-13 2015-01-21 福建星海通信科技有限公司 Storage method and retrieval method of massive images
CN106991102A (en) * 2016-01-21 2017-07-28 腾讯科技(深圳)有限公司 The processing method and processing system of key-value pair in inverted index
CN110609916A (en) * 2019-09-25 2019-12-24 四川东方网力科技有限公司 Video image data retrieval method, device, equipment and storage medium
CN110825902A (en) * 2019-09-20 2020-02-21 深圳云天励飞技术有限公司 Method and device for realizing feature similarity search, electronic equipment and storage medium
CN111104540A (en) * 2019-12-26 2020-05-05 深圳云天励飞技术有限公司 Image searching method, device, equipment and computer readable storage medium
CN111444363A (en) * 2020-03-02 2020-07-24 高新兴科技集团股份有限公司 Picture retrieval method and device, terminal equipment and storage medium
CN112417381A (en) * 2020-12-11 2021-02-26 中国搜索信息科技股份有限公司 Method and device for rapidly positioning infringement image applied to image copyright protection
WO2021051521A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Response information obtaining method and apparatus, computer device, and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294813A (en) * 2013-06-07 2013-09-11 北京捷成世纪科技股份有限公司 Sensitive image search method and device
CN103955543A (en) * 2014-05-20 2014-07-30 电子科技大学 Multimode-based clothing image retrieval method
CN104298747A (en) * 2014-10-13 2015-01-21 福建星海通信科技有限公司 Storage method and retrieval method of massive images
CN106991102A (en) * 2016-01-21 2017-07-28 腾讯科技(深圳)有限公司 The processing method and processing system of key-value pair in inverted index
WO2021051521A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Response information obtaining method and apparatus, computer device, and storage medium
CN110825902A (en) * 2019-09-20 2020-02-21 深圳云天励飞技术有限公司 Method and device for realizing feature similarity search, electronic equipment and storage medium
CN110609916A (en) * 2019-09-25 2019-12-24 四川东方网力科技有限公司 Video image data retrieval method, device, equipment and storage medium
CN111104540A (en) * 2019-12-26 2020-05-05 深圳云天励飞技术有限公司 Image searching method, device, equipment and computer readable storage medium
CN111444363A (en) * 2020-03-02 2020-07-24 高新兴科技集团股份有限公司 Picture retrieval method and device, terminal equipment and storage medium
CN112417381A (en) * 2020-12-11 2021-02-26 中国搜索信息科技股份有限公司 Method and device for rapidly positioning infringement image applied to image copyright protection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
田方杰等: "GeoSOT时空编码的海量照片组织检索方法", 测绘科学, pages 129 - 131 *

Similar Documents

Publication Publication Date Title
US8270684B2 (en) Automatic media sharing via shutter click
JP2022542127A (en) Image processing method and apparatus, electronic equipment and storage medium
CN115862088A (en) Identity recognition method and device
CN111598012B (en) Picture clustering management method, system, device and medium
JP2021034003A (en) Human object recognition method, apparatus, electronic device, storage medium, and program
CN110837581A (en) Method, device and storage medium for video public opinion analysis
CN113052079B (en) Regional passenger flow statistical method, system, equipment and medium based on face clustering
CN111506771A (en) Video retrieval method, device, equipment and storage medium
JP2021528715A (en) Image processing methods and devices, electronic devices and storage media
CN111800445B (en) Message pushing method and device, storage medium and electronic equipment
CN102932421A (en) Cloud back-up method and device
CN111178455B (en) Image clustering method, system, device and medium
CN112671878B (en) Block chain information subscription method, device, server and storage medium
CN110019907A (en) A kind of image search method and device
CN111275683A (en) Image quality grading processing method, system, device and medium
CN110781066A (en) User behavior analysis method, device, equipment and storage medium
CN113326388A (en) Data retrieval method, system, medium and device based on inverted list
CN115082999A (en) Group photo image person analysis method and device, computer equipment and storage medium
CN115393755A (en) Visual target tracking method, device, equipment and storage medium
CN115225308A (en) Attack group identification method and related equipment for large-scale group attack traffic
CN110413603B (en) Method and device for determining repeated data, electronic equipment and computer storage medium
CN114003753A (en) Picture retrieval method and device
CN105488119A (en) Process finding method and device
CN112801130B (en) Image clustering quality evaluation method, system, medium, and apparatus
CN113821657A (en) Artificial intelligence-based image processing model training method and image processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination