WO2020048145A1 - 数据检索的方法和装置 - Google Patents

数据检索的方法和装置 Download PDF

Info

Publication number
WO2020048145A1
WO2020048145A1 PCT/CN2019/085483 CN2019085483W WO2020048145A1 WO 2020048145 A1 WO2020048145 A1 WO 2020048145A1 CN 2019085483 W CN2019085483 W CN 2019085483W WO 2020048145 A1 WO2020048145 A1 WO 2020048145A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
type
retrieval
cluster
classification processing
Prior art date
Application number
PCT/CN2019/085483
Other languages
English (en)
French (fr)
Inventor
王正
赵章云
傅蓉蓉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19856658.0A priority Critical patent/EP3835976A4/en
Publication of WO2020048145A1 publication Critical patent/WO2020048145A1/zh
Priority to US17/190,188 priority patent/US11816117B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2134Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/54Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats

Definitions

  • the present application relates to the field of computers, and in particular, to a method and an apparatus for data retrieval.
  • image-based retrieval methods are used in more and more fields.
  • the public security system uses real-time collected portrait data to compare with data in the database to identify criminals.
  • the traffic system uses real-time collection of license plate information to accurately locate the vehicle trajectory, and is used to find the vehicle that caused the accident.
  • a large amount of data is stored in the database, and with the continuous increase of collected data, more and more data needs to be compared in the retrieval process, which brings great challenges to the speed and accuracy of image retrieval. How to provide a method that can ensure the accuracy of image retrieval and improve the speed of image retrieval has become an urgent problem in the field of image retrieval.
  • This application provides a method and device for data retrieval, which can ensure the retrieval speed and improve the retrieval accuracy.
  • a data retrieval method includes: dividing N data in a database into a first type of data and a second type of data, N ⁇ 2.
  • a first search range is determined in the first type of data, and the data to be retrieved is searched in the first search range to obtain a first search result, where the data in the first search range is a subset of the first type of data.
  • the data to be retrieved is retrieved in the entire range of the second type of data to obtain a second retrieval result.
  • the final retrieval result of the data to be retrieved is determined from the first retrieval result and the second retrieval result.
  • the data in the original database is divided into the first type of data and the second type of data.
  • the first type of data is data that affects the retrieval accuracy
  • the second type of data is data that affects the retrieval speed.
  • the data is narrowed down, the brute-forced search is performed in the second type of data, and the final search result is obtained based on the search results of the two types of data, so as to ensure both the accuracy of the search and the speed of the search.
  • the N data is divided into M clusters according to a clustering algorithm, each data corresponds to a cluster, each cluster has a center point, and each data is associated with the center of the cluster to which it belongs Points have close similarity, M ⁇ 2, and the cluster index of each cluster is used to uniquely identify a cluster.
  • the second type of data is a set of data of the edge points of each cluster
  • the first type of data is a set of data other than the second type of data in the original database.
  • the second type of data is a set of data of edge points of each cluster.
  • the edge point is the data centered on it
  • the first threshold is the data that contains the data of two or more clusters within the radius. Then the first type of data is data other than the second type of data in the original database.
  • the first type of data may also be divided into multiple layers according to a preset algorithm, and each layer includes at least one first type of data, and each first type of data belongs to one layer.
  • the layer index of each layer is used to uniquely identify a layer.
  • the first type of data is divided into multiple layers.
  • the search range can be selected by combining the clusters and layers of the first type of data.
  • the search can be narrowed and the search time can be reduced. To improve retrieval efficiency.
  • dividing the N data into the first type of data and the second type of data includes: selecting a comparison cluster from M clusters; and then selecting z reference data from the N data, 1 ⁇ z ⁇ N; for each reference data, the following data classification processing is performed: the data to be classified is retrieved from the database according to the current reference data, where the data to be classified is data close to the similarity of the current reference data; the data to be classified is determined Whether it belongs to the comparison cluster, if yes, divide the data to be classified into the first type of data, and if not, divide the data to be classified into the second type of data.
  • the original data can be divided into a set of the first type of data and the second type of data.
  • different retrieval methods can be used in different types of data to ensure retrieval accuracy. While improving retrieval speed.
  • the data to be classified is retrieved from a database, and the data to be classified is data close to the similarity of the current reference data, including: calculating between the current reference data and other N-1 data The similarity is calculated according to the calculated similarity to obtain m data sorted from highest to lowest similarity to the current reference data, 1 ⁇ m ⁇ N-1; or, calculating the current reference data and M clusters The similarity between the central points determines the m clusters sorted from the highest similarity to the current reference data, and uses the data in the m clusters as the data to be classified, 1 ⁇ m ⁇ N-1.
  • the data in the original database can be divided into the first type of data and the second type of data.
  • the first type of data is divided into multiple layers based on the combination of the first type of data and the second type of data.
  • the data to be retrieved is retrieved by reducing the scope of the search, reducing the retrieval time and improving Retrieval efficiency.
  • each of the z data is used as the current reference data to complete a round of data classification processing
  • each of the z data is used as the reference data again, and the next round is executed.
  • Data classification processing in which the comparison clusters selected in each round of data classification processing are the same, and the number of comparison clusters selected in the next round of data classification processing is greater than the number of comparison clusters selected in the previous round of data classification processing.
  • the comparison cluster selected in the previous round of data classification processing is a child of the comparison cluster selected in the next round of data classification processing. When the number of comparison clusters reaches a preset maximum value, the data classification processing ends.
  • the first-type data may be divided into different layers by further combining the above-mentioned combination of the first-type data and the second-type data.
  • the search index of the first type of data is determined by combining the layer index and the cluster index, and a narrow search is performed to reduce the search time and improve the search efficiency.
  • the P-th data is any one of the N data.
  • the data is identified as the second type of data in this round of data classification processing, and there is no need to participate in the subsequent processing of this round of data processing, reducing The amount and time of data classification processing.
  • each round of data classification processing obtains a set of classification results, the X-th classification classification result of the X-th round of data classification processing is the final classification result, and the X-round is the last round of data classification processing;
  • a cluster index and a layer index are configured for each data according to a clustering result and a hierarchical result of each of the N data.
  • the first type of data has a cluster index and a layer index, and indication information of a first search range is set in advance, and the indication information is used to specify clusters and layers included in the first search range; A cluster index and a layer index of a layer included in the first search range are determined according to the data to be retrieved and the indication information.
  • one or more first type data of the layer index belonging to A clusters belonging to the B layers are selected as the first search range; where 1 ⁇ A ⁇ a , 1 ⁇ B ⁇ b, a is the total number of clusters divided in the database, b is the total number of layers divided by the first type of data, a ⁇ 1, b ⁇ 1, a, b, A, and B are positive integers.
  • one or more first data in the C clusters whose layer indexes belong to the B layers are selected as the first search range;
  • C A + first bias Shift value
  • the first offset value ⁇ 1,1 ⁇ A ⁇ a, 1 ⁇ B ⁇ b is the total number of clusters divided in the database
  • b is the total number of layers divided by the first type of data
  • a ⁇ 1, b ⁇ 1, a, b, A, B, C and the first offset value are all positive integers.
  • the total number of a, b ⁇ 1, a, b, A, B, and D are all positive integers.
  • one or more first data of the layer index in the E-th cluster belonging to the F-th layer are selected as the first search range;
  • E A + First offset value
  • F B + second offset value
  • first offset value ⁇ 1, second offset value ⁇ 1 both the first offset value and the second offset value are positive integers
  • 1 ⁇ A ⁇ a, 1 ⁇ B ⁇ b a is the total number of clusters divided in the database
  • b is the total number of layers divided by the first type of data
  • a ⁇ 1, b ⁇ 1, a, b, A, B, E, and F Both are positive integers.
  • an image to be retrieved is received; then, feature information of the image to be retrieved is extracted as the data to be retrieved.
  • the data of the database is divided into the first type of data that affects the retrieval speed and the second type of data that affects the retrieval accuracy, and then the data of the first type of data is divided into multiple layers.
  • the second type of data is retrieved using brute force retrieval to ensure retrieval accuracy; the first type of data is retrieved in a manner that narrows the scope of the retrieval to ensure the speed of retrieval.
  • the data that affect the retrieval accuracy mainly refers to the data in the database that has a low degree of similarity with other data, which directly affects the retrieval accuracy.
  • the data that affects the retrieval speed refers to the data that has a direct impact on the retrieval speed.
  • This part of the data indirectly affects the retrieval accuracy.
  • the amount of data selected in this part of the data affects the retrieval speed.
  • the original data is divided into the first type of data and the second type of data, and different retrieval methods are used in the first type of data and the second type of data.
  • the combination of the two ensures the accuracy and speed of the entire retrieval process and reduces the accuracy.
  • the time-consuming retrieval process improves the efficiency of the retrieval process.
  • the present application provides a data retrieval apparatus, and the retrieval apparatus includes various modules for executing the data retrieval method in the first aspect or any possible implementation manner of the first aspect.
  • the present application provides a device for data retrieval.
  • the device includes a processor, a memory, a communication interface, and a bus.
  • the processor, the memory, and the communication interface are connected by a bus and complete communication with each other.
  • the memory is used to store program code, and when the device is running, the processor executes the program code in the memory to use the hardware resources in the device to execute the first aspect or any possible implementation of the first aspect Steps of the method described in the mode.
  • the present application provides a heterogeneous device system for data retrieval, the heterogeneous device system includes a first device and a second device, the first device and the second device communicate using a network, and the second device The device is configured to assist the first device to perform data retrieval processing.
  • the first device includes a first processor, a memory, a communication interface, and a bus.
  • the first processor, the memory, and the communication interface are connected through a bus and complete communication with each other.
  • the memory is used to store program code.
  • the second device includes a first processor, a second processor, a memory, a communication interface, and a bus.
  • the first processor, the second processor, the memory, and the communication interface are connected by a bus and complete communication with each other.
  • the memory is used to store program code.
  • the first processor is configured to execute program code in the memory of the first device here, and send the data to be retrieved to the second device, and the first processor of the second device receives the After the data is retrieved, the second processor executes the degree code in the second device to complete the retrieval of the data to be retrieved, and sends the result to the first processor of the first device.
  • the first processor in the first device and the second device includes a CPU, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic devices, and discrete Gate or transistor logic devices, discrete hardware components, etc.
  • the second processor includes a graphics processing unit (GPU) and a neural network processing unit (NPU).
  • the second device when the second processor is a GPU, the second device further includes a video memory.
  • the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which when executed on a computer, causes the computer to execute the methods described in the above aspects.
  • the present application provides a computer program product containing instructions that, when run on a computer, causes the computer to perform the methods described in the above aspects.
  • FIG. 1 is a schematic structural diagram of a data retrieval system provided by the present application.
  • FIG. 2 is a schematic flowchart of a data retrieval method provided by the present application.
  • FIG. 3 is a schematic flowchart of a data classification processing method provided by the present application.
  • FIG. 5 is a schematic structural diagram of a data retrieval device provided by this application.
  • FIG. 6 is a schematic structural diagram of another data retrieval device provided by this application.
  • FIG. 7 is a schematic structural diagram of a data retrieval heterogeneous device system provided by the present application.
  • FIG. 1 is a schematic structural diagram of a retrieval system according to an embodiment of the present application.
  • the retrieval system includes a retrieval server 100 and a database cluster 200, and the retrieval server 100 and the database cluster 200 communicate through a network 300.
  • the search server 100 is configured to receive a search task and search for matching data in the database cluster 200.
  • the system architecture shown in FIG. 1 may include one or more search servers 100, and FIG. 1 describes that the search system includes only one search server 100 as an example.
  • the database cluster 200 includes at least one database server 201.
  • the database application can be deployed in a distributed manner, or each server can deploy different types of database applications according to business requirements and store different data.
  • the images stored in the database can extract feature information of each image according to preset rules.
  • Each image includes multiple feature information.
  • the multiple feature information identifying the same image can also be called a feature vector.
  • an image can be composed of multiple pixels, and each pixel is regarded as a feature of the image.
  • the information used to identify each pixel can be called a feature information of the image.
  • the collection of feature information is called a feature vector.
  • the feature vector can be represented by a matrix, and each element in the matrix can represent a feature information of the image. Further, the feature vectors of the same image include a long feature vector and a short feature vector.
  • the long feature vector includes all feature information of the same image
  • the short feature vector includes only part of feature information of the same image. That is, the same image can be represented by one long feature vector, or multiple short feature vectors.
  • each image is described by using a long feature vector identifier as an example.
  • the feature vectors in the following description of the embodiment of the present application all represent long feature vectors.
  • the retrieval server 100 and the database server 201 in the database cluster 200 may also be deployed together. That is, in addition to storing the original data, the database server 201 is also used for processing retrieval tasks.
  • the database server storing data in FIG. 1 may also be directly deployed using one or more database servers.
  • each database server in the retrieval system can deploy distributed database applications, non-distributed database applications, or other forms of database applications, which are not limited in this application.
  • the retrieval server can be deployed in unison with a server in the database server.
  • the data retrieval described in this application includes text, image, audio, and video retrieval.
  • image retrieval is taken as an example for further description in the following embodiments of the present application.
  • the data stored in the database is image data
  • the data to be retrieved is also image data.
  • a database cluster deployed in the retrieval system shown in FIG. 1 is used, and the database cluster includes three database servers 201 as an example for description.
  • the raw data stored in the database cluster can be stored in the database cluster 200 at one time, or can be updated periodically or in real time. For example, after an application server (not shown in FIG. 1) obtains image data through a camera, the data is stored in the database server 201 of the database cluster 200.
  • an embodiment of the present application provides a data retrieval method, which specifically includes two parts, a data preparation process and a retrieval process.
  • the data preparation process the original data of the database is divided into the first type of data and the second type of data.
  • a narrowed search is performed in the first type of data, that is, a first search range is selected in the first type of data, and then the data to be retrieved is searched in the first search range.
  • a brute force search is performed in the second type of data, that is, the data to be retrieved is searched in the entire range of the second type of data.
  • the final search results are determined based on the search results of the first type of data and the search results of the second type of data.
  • the process of dividing the data in the original database into the first type of data and the second type of data includes: first dividing the original data into multiple clusters, each cluster has a cluster center point, and the cluster index is used to uniquely identify a cluster. Then, the data of the edge points of each cluster is divided into data of the second type, that is, the data of the second type is a set of data of the edge points of each cluster. The edge point is data centered on the edge point, and the first threshold value is data that includes data of two or more clusters within a radius. Then the first type of data is data other than the second type of data in the original database.
  • Brute force search refers to calculating the similarity between the data to be retrieved and each piece of data in the second type of data, and selecting the first n pieces of data in the order of similarity from the data to be retrieved as the matching result, where n is an integer greater than 1.
  • determining the similarity of the two data includes obtaining the distance by calculating the distance between the two data.
  • the specific method of calculating the distance includes the Euclidean distance formula and the cosine distance formula. For ease of description, this application uses the distance between two data to identify the similarity of the two data as an example for illustration.
  • the retrieval process includes: first, calculating the distance between the data to be retrieved and all the data in the retrieval range according to the Euclidean distance formula; and then sorting according to the above distance to filter out the distance from the data.
  • the first n pieces of data in the low-to-high order are used as the matching result.
  • the retrieval of the narrow search refers to clustering the original data according to the similarity according to the preset clustering algorithm.
  • Each cluster data set is called a cluster, and each cluster has a center point.
  • the preset cluster Algorithms include the KMEANS algorithm. Then, the distance between the center point of each of the data to be retrieved and each cluster is calculated, and the data of the cluster where the center points of m clusters with similarity to the data to be retrieved are selected is used as the search range. Then, calculate the distance between the data to be retrieved and each data within the selected search range, and then determine the search result according to the distance order.
  • the data retrieval method includes two parts: a data preparation process and a retrieval process, wherein steps S110 to S120 are data preparation processes, and steps S130 to S160 are retrieval processes.
  • the data preparation process can be completed during the system initialization process, or the data in the database can be updated in real time or periodically according to business needs.
  • the method includes:
  • the data preparation process needs to be completed first, that is, the data is first divided into the first type of data and the second type of data.
  • the feature vectors in the database are divided into two or more sets. Each set can be called a cluster. Each cluster has a center point.
  • the feature vectors divided into the same cluster are A feature vector whose similarity to the center point of the cluster satisfies a second threshold, and each cluster has a globally unique identifier, and the cluster index of each cluster is used to uniquely identify a cluster.
  • the second threshold value can be set artificially according to business requirements or can be calculated based on historical data of the retrieval system, which is not limited in this application.
  • the preset clustering algorithm includes KMEANS clustering algorithm.
  • the principle of KMEANS clustering algorithm includes the following steps:
  • step 2) Repeat step 2) -3) until the average value of all the feature vectors of k cluster points and 3) of each cluster is the same. At this time, the obtained k feature vectors are the central points of k clusters.
  • each feature vector in the database will be divided into a cluster whose center point and its similarity meet the second threshold.
  • the identifier of the cluster is the feature vector.
  • the cluster index and cluster ID can be represented by the cluster number.
  • the process of determining the cluster and cluster index to which each feature vector belongs because the original data in the database needs to be processed one by one, and the processing process will occupy more computing resources of the processor in the retrieval server.
  • This can be done using multiple processors or graphics processing units (GPUs).
  • Each round of data classification processing divides multiple feature vectors in the database into at least one set of first-type data and second-type data, based on at least one set of first-type data and The combination of two types of data divides the first type of data obtained in the last round of data classification processing into multiple layers, each layer includes at least one first type of data, and the layer index is used to uniquely identify a layer.
  • the original data can be divided into different layers according to the following steps, and a layer index is added to the feature vector in each layer.
  • S1201. Perform at least one round of data classification processing, and each round of data classification processing divides multiple feature vectors in the database into first-type data and second-type data, respectively.
  • Method 1 The original data is divided into the first type of data and the second type of data by brute force search.
  • each feature vector in the original database uses each feature vector in the original database as a reference data, calculate the distance between the feature vector and other feature vectors in the original database according to the Euclidean distance formula, and determine the current to-be-retrieved data in the database to be classified according to the above distance. That is, the first n feature vectors in the order of the similarity with the feature vector are determined from high to low, that is, the feature vectors corresponding to the first n distances are determined in the order of the distance from the feature vector from small to large. Then, determine whether the data to be classified belongs to the comparison data of the current comparison cluster, that is, determine whether n feature vectors belong to the cluster selected by the current data classification process.
  • the x feature vectors are considered to be the first type of data that affects the speed during the round of data classification processing; if the y feature vectors of the n feature vectors are not included in the above For the selected cluster, the y feature vectors are considered to be the second type of data that affects accuracy during the data classification process. This completes a round of data classification processing.
  • x ⁇ n, y ⁇ n, and x + y n, n, x, and y are all positive integers greater than or equal to 1.
  • k_choose represents the number of clusters selected during the current round of data classification processing.
  • k_choose 1
  • cluster 1 is selected as the comparison cluster, and all feature vectors in cluster 1 are used as the comparison data for the current round of data classification processing.
  • each feature vector in the original database is used as a reference data, and the reference data is subjected to brute force search. That is, the distances between the feature vector and all other feature vectors in the original database are calculated separately, and the above distances are sorted to determine the similarity with the feature vector to be retrieved.
  • the first n feature vectors are sorted from high to low, that is, The feature vectors corresponding to the first n distances after the distances are sorted from low to high. Then determine whether each of the n feature vectors belongs to a cluster with a cluster index of 1. If the feature vector belongs to a cluster with a cluster index of 1, the feature vector is used as the first to affect the retrieval speed during the current data classification process. Class data 1; if the feature vector does not belong to a cluster with a cluster index of 1, the feature vector is used as the second type of data 1 that affects the retrieval accuracy during the current data classification process. After the above round of data classification processing, the original data will be divided into two types: the first type of data 1 and the second type of data 1.
  • Method 2 The original data is divided into the first type of data and the second type of data by narrowing down the search orientation.
  • each feature vector in the original database as a reference data, calculate the distance between each reference data and the center point of each cluster divided in step S110, and select the features of the top m clusters in the order of distance from the current reference data from low to high Vector as the data to be classified. Then, it is judged whether the first m clusters are the comparison clusters selected in the current round of data classification processing.
  • the feature vector is the first type of data that affects the retrieval speed; if there are y clusters in m clusters that are not selected as the comparison cluster for this round of data classification processing, the feature vectors in the y clusters are considered to affect the retrieval accuracy Data of the second type.
  • k_choose represents the number of comparison clusters selected during the current round of data classification processing.
  • k_choose is 1, it means that a cluster is selected as the comparison cluster in this round of data classification processing, and all feature vectors of the cluster are used as the comparison data.
  • cluster 1 is selected as the comparison cluster, and all feature vectors of cluster 1 are used as the comparison data for the current data classification process.
  • each feature vector in the original database is used as a reference data, and the reference data is searched with a narrowed search range.
  • the distances between the current reference data and the center points of each cluster divided in step S110 are calculated respectively, and the clusters where the center points of the first m clusters in the order of the distance from the current reference data are selected as the data to be classified. Then, it is determined whether the above m clusters are the comparison clusters selected in the current round of classification processing.
  • the feature vector in the cluster is the first type of data 1 that affects the retrieval speed during the round of data classification processing; if there are y clusters in m clusters that are not the comparison cluster selected in this round of data classification processing, the y
  • the feature vector in each cluster is the second type of data that affects the retrieval accuracy during the data classification process.
  • the original data can be divided into a set of first-class data regardless of the manner 1 or 2 described above. And second-class data.
  • one, two, three, ..., or i clusters can be selected as the comparison data to perform a data classification process, and each round of data classification process can divide the original data into a set of first-class data and first A collection of two types of data. Finally, a set of the first type data and the second type data of the i group is obtained.
  • the comparison clusters in the current round of data classification processing are obtained by adding one or more new clusters based on the comparison clusters selected in the previous round of data classification processing.
  • the comparison data in the current round of data classification processing are all based on the comparison data selected in the previous round of data classification processing, and new cluster data is added as the comparison data in this round of data classification processing.
  • the value range of i includes 1 ⁇ i ⁇ n, i is a positive integer, and n is the total number of clusters divided in the original database in step S110.
  • step S1201 For example, as shown in FIG. 3, feature vectors of 1, 2, 3,..., 10 clusters are respectively selected as comparison data for one round of data classification processing, and 10 rounds of data classification processing are performed.
  • k_choose 1, all feature vectors of cluster 1 are selected as the comparison data for the current round of data classification processing.
  • the first type data 1 and the second type data 1 are obtained.
  • k_choose 2, all feature vectors of clusters 1 and 2 are selected as comparison data for the current round of data classification processing.
  • step S1201 first type data 2 and second type are obtained.
  • Data 2 Data 2.
  • step S1201 When k_choose is 3, all feature vectors of cluster 1, cluster 2 and cluster 3 are selected as the comparison data for the classification data processing of this round. After the processing of method 1 or method 2 in step S1201, the first type of data 3 and The second type of data 3.
  • k_choose 10
  • all feature vectors of cluster 1, cluster 2, cluster 3, ..., cluster 9 and cluster 10 are selected as the comparison data for the current round of data classification processing.
  • step 1 or method of step S1201 After the processing of 2 is obtained, the first type data 10 and the second type data 10 are obtained.
  • one or more clusters may be selected as the comparison clusters.
  • the selection process of each round of comparison data may be based on the cluster identification, and based on the comparison data of the previous round of data classification processing, at least one cluster is added as the comparison cluster of this round of data classification processing.
  • the comparison data of this round of data classification processing is composed of the data of the comparison cluster of the previous round and the data of the newly added cluster of this round.
  • the identifier includes a number and a name, or other information that can uniquely represent the cluster, or a combination of the above.
  • a cluster can be selected at random for each round of comparison data. Based on the comparison data selected in the previous round of data classification processing, all feature vectors of the cluster are added together as comparison data for this round of data classification processing. .
  • cluster 1 is selected as the comparison cluster
  • clusters 1 and 2 are selected as the comparison cluster in the second round of data classification processing
  • Select cluster 1, cluster 2 and cluster 3 together as the comparison cluster for the third round of data classification processing.
  • cluster 1, cluster 2, cluster 3, and cluster 5 are selected as the comparison clusters in the fourth round of data classification processing.
  • cluster 1 and cluster 2 are selected as the comparison data in the first round of data classification processing; cluster 1, cluster 2 and cluster 3 are selected as the comparison cluster in the second round of data classification processing; In the third round of data classification processing, cluster 1, cluster 2, cluster 3, and cluster 4 are selected together as the comparison cluster in the third round of data classification processing. In the fourth round of data classification processing, cluster 1, cluster 2, cluster 3, cluster 4, and cluster 5 are selected as comparison clusters in the fourth round of data classification processing.
  • a feature vector in order to avoid the problem that the same feature vector may belong to both the first type of data and the second type of data in a round of data classification processing, when a feature vector has been determined to belong to For the second type of data, it will not participate in the data classification process of other feature vectors as reference data.
  • cluster 1 is the comparison cluster for the current round of data classification processing
  • feature vector 1 is used as the current reference data
  • feature vector 7 belongs to the second type of data.
  • feature vector 2 is used as the reference data, it is not necessary to calculate features again. The similarity between vector 2 and feature vector 7.
  • the data classification processing shown in FIG. 3 is used.
  • one cluster is used as the comparison data.
  • Each round of data except the first round of data classification processing In the classification process the comparison clusters in the previous round of data classification processing are used as the basis. According to the cluster number, one cluster is added as the comparison cluster in the current round of data classification processing as an example.
  • the first-class data obtained in the last round of data classification processing is divided into multiple layers, and each layer includes at least one first-class data.
  • the index is used to uniquely identify a layer.
  • At least one round of data classification processing may be performed on the original data. For example, as shown in FIG. 3, 1, 2, 3,..., I (for example, i is taken as 10) are selected as comparison data, and the first group of data and the second group of i as shown in FIG. 3 are obtained. The combination of data. At this time, the division of the layer of the first type of data can be obtained according to the following formula (1):
  • Tier 1 data Tier 1 data obtained in the first round of data classification processing
  • the first type of data in the i-th layer the first type of data obtained in the i-th round of data classification processing-the first type of data obtained in the i-th round of data classification processing.
  • the feature vector of the first type of data in the first layer is the feature vector corresponding to the first type of data obtained by the first round of data classification data processing
  • the feature vector of the first type of data in the second layer is the feature vector obtained by the second round of data classification processing.
  • the first type of data is subtracted from the feature vector of the first type of data obtained from the first round of data classification processing, that is, the feature vector of the first type of the second layer is the ratio of the first type of data obtained from the second round of data classification data processing
  • the extra feature vectors for the first type of data obtained in the first round of data classification processing are 2 ⁇ i ⁇ X, where X is the maximum number of rounds for performing data classification processing.
  • the layer index of the first type of data is used to uniquely identify a layer of the first type of data.
  • the layer index of the first type of data may be identified by i, or may be identified by other characters or a combination of characters and data.
  • the first type of data obtained in the last round can be divided into multiple layers as shown in FIG. 3.
  • FIG. 3 is described by taking S1201 for 10 rounds of data classification processing as an example.
  • the set of all feature vectors corresponding to the layer indexes 1 to 10 of the first type of data is obtained by the 10th round of data classification processing. All feature vectors included in the first type of data 10.
  • the first type of data includes only one layer.
  • the second type of data may also be divided into multiple layers based on the combination of the multiple types of data of the first type and the second type of data obtained by the above-mentioned multiple groups of data classification processing, which may be specifically obtained according to the following formula (2):
  • Tier 1 and Tier 2 data Tier 2 data obtained in the first round of data classification processing
  • the j-th layer of the second type of data the second type of data obtained from the j-th round of data classification processing-the second type of data obtained from the j-1th round of data classification processing;
  • the feature vector of the first-type and second-type data is the feature vector corresponding to the second-type data obtained in the first round of data classification processing.
  • the feature vector of the second type of data in the second layer is the second type of data obtained from the second round of data classification processing minus the second feature of the second type of data obtained in the first round of data classification processing, that is, the second
  • the feature vector of the class is the feature vector of the second type of data obtained from the second round of data classification data processing than the second type of data obtained from the first round of data classification processing. 2 ⁇ j ⁇ X, X is the data classification process. The maximum number of rounds.
  • the layer index of the second type of data is used to uniquely identify a layer of the second type of data.
  • a coding method different from the layer index of the first type of data may be used to represent the layer of the second type of data. For example, when the layer index of the first type of data is When represented by i, the layer index of the second type of data is represented by 2i + 1.
  • the second type of data in order to ensure the retrieval accuracy, the second type of data needs to be retrieved one by one using a brute force search method.
  • the data retrieval speed is high, it can also be based on The layer index and cluster index of the second type of data select a part of the data in the second type of data to perform a brute force search, thereby improving the speed of data retrieval.
  • the multiple rounds of data classification processing described in step S1201 are independent of each other.
  • the multiple rounds of data classification processing can be processed in parallel, and finally multiple sets of first-class data and second-class data are obtained.
  • the data is combined, and the first type of data obtained in the last round of data classification processing is divided into multiple layers according to the method of step S1202.
  • the same processor can be used to schedule different tasks to complete, or different processors can be used to schedule different tasks to complete.
  • CPU central processing unit
  • GPU graphics processing units
  • a layer of the first-type data may also be determined after each round of data processing. For example, during the data preparation phase, three rounds of data classification processing are performed. When the first round of data classification processing is completed, it can be determined that the first type of data in the first layer is the first type of data obtained in the first round of data classification processing. When the second round of data classification processing is completed, it can be determined that the second type of data in the second layer is the first type of data obtained in the second round of data classification processing minus the first type of data obtained in the first round of data classification processing.
  • the first type of data in the third layer is the first type of data obtained in the third round of data classification processing minus the first type of data obtained in the second round of data classification processing.
  • the first type of data can also be divided into multiple layers.
  • the feature vectors of one or more clusters are selected as comparison data, and then the original data is divided into multiple sets of first-class data and
  • the above-mentioned formula (1) is used to divide the first type of data obtained in the last round of data classification processing into different layers, and a layer index is added for each feature vector.
  • the second type of data obtained in the last round of data classification processing is retrieved using brute force retrieval to ensure retrieval accuracy; the first type of data obtained in the last round of data classification processing is retrieved by narrowing the search scope To ensure the efficiency of retrieval. This balances the issues of retrieval speed and retrieval accuracy, and further improves retrieval efficiency while ensuring retrieval accuracy.
  • the feature vector in the database has increased the identification of the layer index and the cluster index.
  • Table 1 is an example of a data structure provided in the embodiment of the present application, including an image identifier and a feature vector identifier.
  • the image identification is used to uniquely identify an image in the database.
  • the feature vector identifier includes feature information included in the feature vector, and a cluster index and a layer index of the feature vector.
  • the database further includes collection information of the recorded data, including collection time and collection location.
  • the identifier of the feature vector shown in Table 1 may further include a long feature vector and a short feature vector, a cluster index and a layer index of the long feature vector, and a cluster index and a layer index of the short feature vector.
  • the retrieval process refers to the cluster index and layer index of the long feature vector in Table 1 for processing.
  • the retrieval process refers to the cluster index and layer index of the short feature vector in Table 1 for processing .
  • step S1201 the number of rounds of the data classification process in step S1201 is directly proportional to the subsequent retrieval efficiency, that is, the more iteration rounds of the data classification process, the more the combination of the first type of data and the second type of data obtained.
  • the more the second type of data that is classified as affecting retrieval accuracy in step S1202 is smaller, correspondingly, in the subsequent retrieval processing, the less data that needs to be retrieved violently, the less time it takes to retrieve, and the higher the accuracy.
  • the more rounds of data classification processing the more combinations of the first type of data and the second type of data are obtained.
  • the retrieval range can be further determined in the first type of data according to the hierarchical relationship of the subdivisions, and the data to be retrieved is retrieved in the retrieval range, which reduces the retrieval time.
  • the center point of each cluster and the feature vector in each cluster whose distance from the center point of the cluster is less than or equal to the third threshold may be divided into the first layer, and the distance from the center point of each cluster to the cluster is less than or The feature vector equal to the fourth threshold is divided into the second layer, ..., the feature vector in each cluster whose distance from the center point of the cluster is greater than the n-th threshold is divided into the n-th layer, n ⁇ 1.
  • each feature vector is divided into data of a first type or data of a second type based on a distance between a center point of the feature vector and the cluster.
  • a feature vector corresponding to the image to be retrieved is obtained. Then, the distance between the feature vector to be retrieved and the center point of each cluster divided in step S110 is calculated, and all clusters are sorted according to the distance between the feature vector to be retrieved and the center point of each cluster to obtain the feature to be retrieved.
  • the order of the similarity between the vector and each cluster is from low to high, that is, the order of the distance between the feature vector to be retrieved and the center point of each cluster is low to high.
  • one or more clusters can be selected according to the similarity between the feature vector to be retrieved and each cluster to further determine the retrieval range of the current round of retrieval tasks, and then a retrieval operation is performed to obtain a retrieval result.
  • the original data in the database is divided into 3 clusters, cluster 1, cluster 2 and cluster 3.
  • the distance between the feature vector to be retrieved and the center points of cluster 1, cluster 2, and cluster 3 are 3, 4.5, and 3 according to the Euclidean distance formula.
  • the similarity between the feature vector to be retrieved and the three clusters is cluster 3, cluster 1 and cluster 2 in order from high to low.
  • the first type of data refers to the first type of data obtained in the last round of data classification processing in the multiple rounds of data classification processing performed in step S120.
  • the selection method of A clusters is the first A clusters selected after the similarity with the feature vector to be retrieved is sorted from high to low in step S130.
  • the selection method of the B layers is determined according to the number of the layer index.
  • a and B are preset values according to business requirements. For example, A can be 3 and B is 2.
  • the first three clusters with the distance from the feature vector to be retrieved are selected, and the layer indexes of the feature vectors in the three clusters are selected.
  • the feature vectors belonging to the selected two layers are used as the first search range.
  • a feature vector is the retrieval result of the first type of data.
  • the search result obtained in the first type of data search is referred to as the first search result.
  • the B layers may be selected according to the identifier of the layer index, or the B layers may be randomly selected.
  • a and B can be designated manually, and this application is not limited.
  • the second type of data refers to the second type of data obtained in the last round of data classification processing in the multiple rounds of data classification processing performed in step S120. Because the second type of data is a set of feature vectors with low similarity with other feature vectors in the original database, whether the use of the second type of data as the retrieval range will affect the retrieval accuracy. Therefore, in order to improve the retrieval accuracy of the retrieval process, all The second type of data needs to perform brute force search, that is, the distance between the feature vector to be retrieved and each second type of data is calculated to determine the feature vector closest to the feature vector to be retrieved as the retrieval result of the second type of data, thereby ensuring the retrieval The accuracy of the process. In order to distinguish it from the search result obtained in step S140, the search result obtained by performing a brute force search on the second type of data is referred to as a second search result.
  • At least one second type of data may also be selected as the second search range in the second type of data according to the cluster index and the layer index.
  • a brute force search is performed in the two search ranges to obtain at least one second distance as a search result.
  • step S140 may be performed first, and then step S150 may be performed.
  • step S150 may be performed first, and then step S140 may be performed.
  • Step S140 and step S150 may also be performed simultaneously.
  • the process of determining the matching feature vector includes: comparing the first retrieval result with the second retrieval result, and selecting the retrieval result with the highest similarity to the feature vector to be retrieved as the final retrieval result. That is, the first search result and the second search result are sorted, and the feature vector with the smallest distance from the feature vector to be retrieved is selected as the feature vector matching the feature vector to be retrieved, that is, the degree of similarity to the feature vector to be retrieved is selected. The highest feature vector is used as the final search result.
  • the original data is first divided into multiple clusters, and then the original data is divided into a combination of multiple types of data of the first type and the second type, and the combination of multiple sets of
  • the combination of the first type of data and the second type of data divides the first type of data in the last round of data classification processing into multiple levels, and finally obtains the second type of data that affects retrieval accuracy and the first type of data that affects speed.
  • the retrieval range of the first type of data is determined according to the layer index and the cluster index of each feature vector, and a narrow range retrieval is performed for this part of the data.
  • brute force searches are performed for the second type of data.
  • the search results obtained by the first type of data and the second type of data are sorted, and the feature vector with the highest similarity to the feature vector to be searched is determined as the final search result.
  • the feature vectors in the original database are first classified to identify the second type of data that has low similarity with other feature vectors and affects the retrieval accuracy, and a brute force search is performed on them to ensure the retrieval accuracy.
  • feature vectors with similar similarity in the original database that affect the retrieval speed are searched in a manner that narrows the retrieval range to ensure the retrieval speed. The combination of the two ensures the accuracy and speed of the entire retrieval process, reduces the time consuming of the retrieval process, and improves the efficiency of the retrieval process.
  • C clusters and layer indexes belonging to the selected B layer's feature vectors can also be executed.
  • C is obtained by adding A + plus a first offset value
  • the first offset value is an integer greater than or equal to 1
  • a is the total number of clusters divided in the database
  • b The total number of layers divided for the first type of data, a ⁇ 1, b ⁇ 1, a, b, A, B, and C are all positive integers.
  • the first offset value is a preset value. In specific implementation, the first offset value may be set according to the retrieval accuracy requirement.
  • the first offset value is a positive integer greater than or equal to 1.
  • C clusters are selected, and the feature index of the layer B belonging to the selected B layers is used as the retrieval range to perform a narrowed retrieval.
  • the search method of narrowing the search range is to improve the search speed.
  • a part of data with high similarity can be selected as the search range, and the size of the search range will affect the search accuracy.
  • the first offset value is added based on the first value selected by the user, and the search range can be appropriately expanded based on the user-selected search range to further improve the search accuracy.
  • a clusters can also be selected and the layer index belongs.
  • a narrowed search is performed on the feature vectors of the selected D layers.
  • D is obtained by adding B to a second offset value, the second offset value is an integer greater than or equal to 1, 1 ⁇ A ⁇ a, 1 ⁇ B ⁇ b, a is the total number of clusters divided in the database, b
  • the total number of layers divided for the first type of data, a ⁇ 1, b ⁇ 1, a, b, A, B, and D are all positive integers.
  • the second offset value is a preset value.
  • the second offset value may be set according to the retrieval accuracy requirement.
  • a clusters are selected, and the layer index belongs to the selected B for selection, and the feature vector of the second offset value layers is added as the search range, and a narrowed search is performed.
  • the search range is appropriately expanded by adding the second offset value layers to the B layers selected by the user.
  • the search accuracy can be further improved.
  • E clusters can also be selected and the layer index belongs.
  • a narrowed search is performed on the feature vectors of the selected F layers.
  • E is obtained by A plus the first offset value
  • F is obtained by B plus the second offset value
  • a is the total number of clusters divided in the database
  • b is the first
  • the total number of layers divided by class data, a ⁇ 1, b ⁇ 1, a, b, A, B, E and F are all positive integers.
  • the first offset value and the second offset value are both integers greater than or equal to 1.
  • the first offset value and the second offset value are both preset values.
  • the first offset value and the second offset value can be set according to the retrieval accuracy requirements. In other words, in the retrieval process, based on the A clusters selected by the user, and the layer index belongs to the selected B layers for selection, the first offset value clusters and the second offset value layers are added.
  • the feature vector is used as the search range to expand the search range and perform a narrowed search.
  • the search range is appropriately expanded by adding the first offset value clusters and the second offset value layers based on the clusters and layers selected by the user. In specific embodiments, the search accuracy can be further improved.
  • the process of dividing the layer index and the cluster index in the original database may be divided into several subdatabases, and clustering is performed on the data in each subdatabase. And the division of layers.
  • each feature vector to be retrieved is combined with the processing steps of steps S130 to S150 described above, and the feature vectors of the selected layers and clusters in the respective subdatabases are used as the search range to perform the search.
  • step S160 In the content, multiple distances are sorted, and a matching feature vector is determined according to the sorting of the distances. The above process can effectively reduce the time required for data preparation.
  • different hardware for example, GPU
  • GPU can be used to distinguish clusters and layers for different subdatabases, and then perform searches according to the selected search range, which is effective. Improve data preparation and retrieval efficiency. Because each sub-database contains less data than the original database, the number of reference data is also relatively small in each round of data classification processing. When the feature vector in each sub-database is used as reference data to calculate its distance from other data , The calculation amount is less than the calculation amount of the same reference data in the original database.
  • the original data in the database is an image. Each image is represented by a feature vector.
  • the first type of data and the first type are determined in step S1201.
  • the second type of data is taken as an example to further introduce the data retrieval method provided in the embodiments of the present application.
  • the original database includes feature vectors V1, V2, ..., V20 corresponding to 20 images.
  • the original data is first divided into cluster 1, cluster 2 and cluster 3 according to a preset clustering (for example, KMEANS) algorithm.
  • the center point of cluster 1 is V6, the center point of cluster 2 is V5, and cluster 3 The center point is V19.
  • two rounds of data classification processing are performed.
  • cluster 1 is selected as the comparison cluster.
  • the data of cluster 1 is the comparison data of the first round of data classification processing, and feature vectors in the original database are used as reference data, for example, V1 is used as the current data.
  • Reference data calculate the Euclidean distance between V1 and V2, V3, ..., V20, and the distance between V1 and other feature vectors in order from low to high is V2, V10, V3, ..., V7, assuming that the feature vectors before the distance ranking are obtained 3
  • V20, V9, and V3 are selected as the data to be classified as V1 as the current reference data. Then judge whether V20, V9, and V3 belong to cluster 1 (comparison data selected in the first round of data classification processing), and finally determine that V20 and V3 are not in cluster 1, and V9 is in cluster 1.
  • V9 can be determined It is the first type of data, and V20 and V3 are the second type of data. Then use V2 as the current reference data to calculate the Euclidean distance between V2 and other feature vectors except V20 and V3 in the original data; then, according to the distance of V1 and other feature vectors from low to high, they are sorted in order from the top 3 features Vectors V6, V3, and V19; Finally, it is determined that V6 belongs to cluster 1, and V3 and V19 do not belong to cluster 1, then V6 belongs to the first type of data, and V3 and V19 belong to the second type of data.
  • the original data will be divided into the first type of data 1 and the second type of data 1, where the first type of data 1 includes V6 and V9, and the second type of data 1 includes other than V6 and V9 Other feature vectors.
  • a method similar to the first round of data classification processing is used.
  • clusters 1 and 2 are selected as the comparison clusters in the second round of data classification processing. All feature vectors in clusters 1 and 2 are used as the second round of data.
  • each feature vector is used as reference data to judge whether each other feature vector belongs to the first type of data 2.
  • the first type of data 2 includes V1, V2, V3, V4, V5, V6, and V9
  • the second type of data 2 includes V7, V8, V10, V11, V12, V13, V14, V15, V16, V17, V18 , V19 and V20.
  • the first type of data 2 is divided into two layers, which are layer index 1 and layer index 2, respectively, where the first type of data of the first layer is V6 and V9, and the second type of the first type Data are V1, V2, V3, V4, and V5.
  • the distance between the data to be retrieved and the center point of each cluster is first calculated, and then, According to the distance sorting above, it is assumed that the distance sorting is in the order of cluster 1, cluster 2, cluster 3 from low to high.
  • the distance sorting is in the order of cluster 1, cluster 2, cluster 3 from low to high.
  • two clusters and one layer of data are selected as the first search range, and then based on the first type of data 2 and the second type of data 2 obtained in the second round of data classification processing above, in the first In the class data 2, clusters 1 and 2 are selected, and the data with the layer index of 1 is the first search range.
  • the data of the first search range includes V6 and V9. Then calculate the distance between the data to be retrieved and V6 and V9. It is assumed that the distance between the data to be retrieved and V6 and V9 is 6 and 9 respectively. Then, in the second type of data 2, the distance between each feature vector in the data to be retrieved and the second type of data 2 is calculated using brute force retrieval. It is assumed that the data to be retrieved and V7, V8, V10, V11, V12, V13, V14, V15 , V16, V17, V18, V19, and V20 are 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 respectively.
  • V6 is the feature vector with the highest similarity to the data to be retrieved.
  • V6 can be finally determined as the final retrieval result of the data to be retrieved.
  • the original data is divided into the first type of data and the second type of data in the data preparation stage, and the first type of data is divided into multiple layers.
  • the search range is divided according to clusters and layers in this search range. A narrow range search is performed in this search range, while a brute force search is performed in the second type of data, and the final search result is determined based on the search results of the above search process. It solves the problems of accuracy and speed in the traditional technology, and further improves the retrieval efficiency on the basis of ensuring the retrieval accuracy.
  • the retrieval method provided by the embodiment of the present application is described in detail above with reference to FIG. 1 to FIG. 4.
  • the device and heterogeneous server system for data retrieval provided by the embodiment of the present application will be described below with reference to FIGS. 5 to 7. .
  • FIG. 5 is a schematic structural diagram of a retrieval device according to an embodiment of the present application.
  • the data retrieval device 500 includes a first retrieval unit 501, a second retrieval unit 502, and a determination unit 503.
  • the database stores N data, and divides the N data in the database into first-type data and second data. Class data, N ⁇ 2;
  • a first retrieval unit 501 is configured to determine a first retrieval range in the first type of data, and retrieve data to be retrieved in the first retrieval range to obtain a first retrieval result, wherein the first retrieval range The data in is a subset of the first type of data;
  • a second retrieval unit 502 configured to retrieve the data to be retrieved in the entire range of the second type of data to obtain a second retrieval result
  • a determining unit 503 is configured to determine a final retrieval result of the data to be retrieved from the first retrieval result and the second retrieval result.
  • the apparatus 500 further includes a data classification processing unit 504, which is configured to divide the N data into M clusters according to a clustering algorithm, and each data corresponds to a cluster, and each cluster There is a center point, and the value of each data has a close similarity to the center point of the cluster to which it belongs, M ⁇ 2, and the cluster index of each cluster is used to uniquely identify a cluster;
  • the second type of data is each cluster A set of data of edge points, the first type of data is a set of data other than the second type of data in the original database.
  • the data classification processing unit 504 is further configured to divide the first type of data into multiple layers according to a preset algorithm, and each layer includes at least one first type of data, and each first type of data It belongs to one layer, and the layer index of each layer is used to uniquely identify a layer.
  • the data classification processing unit 504 is further configured to select the comparison cluster from the M clusters; select z reference data from the N data, 1 ⁇ z ⁇ N; for each reference
  • the data performs the following data classification processing: retrieve the to-be-classified data in the database according to the current reference data, the to-be-classified data being data close to the similarity of the current reference data; determine whether the to-be-classified data belongs to the The comparison cluster, if yes, divides the data to be classified into the first type of data, and if not, divides the data to be classified into the second type of data.
  • the data classification processing unit 504 calculates the similarity between the current reference data and other N-1 data, sorts according to the calculated similarity, and obtains a similarity from the highest to the current reference data. Low-ranking m data, using the m data as the data to be classified, 1 ⁇ m ⁇ N-1; or calculating the similarity between the current reference data and the center points of the M clusters, and determining The m clusters are sorted from highest to lowest similarity to the current reference data, and the data in the m clusters are used as the data to be classified, 1 ⁇ m ⁇ N-1.
  • the data classification processing unit 504 is further configured to, after completing each round of data classification processing for each of the N pieces of data as current reference data, again treat each of the i pieces of data as The reference data is executed in the next round of data classification processing, wherein the comparison clusters selected in each round of data classification processing are the same, and the number of comparison clusters selected in the next round of data classification processing is greater than that selected in the previous round of data classification processing. There are more matching clusters, and the comparison clusters selected in the previous round of data classification processing are a subset of the comparison clusters selected in the next round of data classification processing. When the number of comparison clusters reaches the preset maximum value At this time, the data classification processing ends.
  • the data classification processing unit 504 obtains a set of classification results for each round of data classification processing, the X-th classification classification result of the X-th round of data classification processing is the final classification result, and the X-round is the last round of data classification processing;
  • the data classification processing 504 is further configured to configure a cluster index and a layer index for each data according to a clustering result and a hierarchical result of each of the N data.
  • the first type of data has a cluster index and a layer index; the first retrieval unit 502 is further configured to preset indication information of the first retrieval range, and the indication information is used to specify the first retrieval range.
  • a cluster and a layer included in a search range; a cluster index and a layer index of a layer included in the first search range are determined according to the data to be retrieved and the indication information.
  • the device 500 in the embodiment of the present application is implemented by an integrated circuit (application-specific integrated circuit (ASIC) or a programmable logic device (programmable logic device (PLD)).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above PLD may be a complex program logic device (complex).
  • CPLD programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the apparatus 500 may correspond to executing the method described in the embodiment of the present application, and the above and other operations and / or functions of each unit in the apparatus 500 are respectively to implement the corresponding process of the method described in FIG. 2 in order to Concise, I won't repeat them here.
  • the original data is divided into the first type of data that affects speed and the second type of data that affects retrieval accuracy in the data preparation stage, and then the first type of data is divided into multiple layers.
  • the first retrieval range is determined according to the layer index and the cluster index of each feature vector, and a narrow range retrieval is performed for the part of the data.
  • brute force search is performed on all of its data.
  • the retrieval results obtained during the retrieval process of the first type of data and the second type of data are compared, and the data with the highest similarity to the data to be retrieved is determined as the final retrieval result.
  • the second type of data that affects accuracy is first identified during the data preparation stage, and a brute force search is performed on it to ensure retrieval accuracy.
  • the feature vectors that affect the retrieval speed are retrieved in a manner that narrows the retrieval range to ensure the retrieval speed. The combination of the two ensures the accuracy and speed of the entire retrieval process, reduces the time consuming of the retrieval process, and improves the efficiency of the retrieval process.
  • FIG. 6 is a schematic diagram of another data retrieval apparatus according to an embodiment of the present application.
  • the apparatus 600 includes a first processor 601, a memory 602, a communication interface 603, and a bus 604.
  • the first processor 601, the memory 602, and the communication interface 603 communicate through a bus 604, and communication may also be implemented through other means such as wireless transmission.
  • the memory 602 is used to store program code 6021
  • the first processor 601 is used to call the program code 6021 stored in the memory 602 to perform the following operations:
  • a first search range is determined in the first type of data, and data to be retrieved is searched in the first search range to obtain a first search result, wherein the data in the first search range is the first search range.
  • a final retrieval result of the data to be retrieved is determined from the first retrieval result and the second retrieval result.
  • the device 600 may further include a second processor 605, the first processor 601, the memory 602, the communication interface 603, and the second processor 605 communicate through a bus 604, and the second processor 605 is configured to assist the first processor
  • the processor 601 performs a data retrieval task.
  • the first processor 601 may be a CPU, and the first processor 601 may also be another general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), Programming gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA Programming gate array
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • the second processor 605 may be a dedicated search processor, for example, a GPU, a neural processing unit (NPU), or a CPU, or other general-purpose processors, digital signal processors (DSPs), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • the apparatus 600 further includes a video memory.
  • first processor 601 and the second processor 605 in the apparatus 600 shown in FIG. 6 are only an example, the number of the first processor 601 and the second processor 605, and the number of each processor. The number of audits does not limit the embodiments of the present application.
  • the memory 602 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrical memory Erase programmable read-only memory (EPROM, EEPROM) or flash memory.
  • the volatile memory may be a random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct RAMbus RAM direct RAMbus RAM, DR RAM
  • the bus 604 may also include a power bus, a control bus, and a status signal bus. However, for the sake of clarity, various buses are marked as the bus 604 in the figure.
  • the apparatus 600 may correspond to the apparatus 500 in the embodiment of the present application, and may correspond to executing the method according to FIG. 2 in the embodiment of the present application, and the above-mentioned sum of each module in the apparatus 600
  • Other operations and / or functions are respectively used to implement the corresponding process of the method in FIG. 2, and are not repeated here for brevity.
  • FIG. 7 is a schematic diagram of a heterogeneous device system according to an embodiment of the present application.
  • the heterogeneous device system includes a first device 700 and a second device 800.
  • the first device 700 and the second device 800 pass Network communication, the network includes Ethernet, fiber optic network, wireless bandwidth (IB) network.
  • the second device 800 is configured to assist the first device 700 in performing a data retrieval task.
  • the first device 700 includes a first processor 701, a memory 702, a communication interface 703, and a bus 704.
  • the first processor 701, the memory 702, and the communication interface 703 communicate through a bus 704, and may also implement communication through other means such as wireless transmission.
  • the memory 702 is used to store program code 7021
  • the first processor 701 is used to call the program code 7021 stored in the memory 702 to schedule the second device 800 to assist it to complete the data retrieval task.
  • the second device 800 includes a first processor 801, a memory 802, a communication interface 803, a bus 804, and a second processor 805.
  • the first processor 801, the memory 802, the communication interface 803, and the second processor 805 communicate through the bus 804, and communication may also be implemented by other means such as wireless transmission.
  • the first processor 801 of the second device 800 is configured to receive the data retrieval task of the first device 700, and instruct the second processor 805 to perform data retrieval.
  • the specific retrieval method is the same as that shown in FIG. More details.
  • the second device 800 may also be referred to as a heterogeneous device.
  • the first processor 701 and the first processor 801 may both be CPUs, and may also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), and field devices.
  • DSPs digital signal processors
  • ASICs application-specific integrated circuits
  • FPGA Programmable gate array
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • the second processor 805 may be a dedicated search processor, for example, a GPU, a neural processing unit (NPU), or a CPU. It may also be another general-purpose processor, a digital signal processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • the second device when the second processor 805 is a GPU, the second device further includes a video memory (not shown in the figure).
  • first processor 701, the first processor 801, and the second processor 805 in the heterogeneous device system shown in FIG. 7 are only examples, and the first processor 701, the first processor 801, and the first processor
  • the number of the two processors 805 and the number of cores of each processor do not constitute a limitation on the embodiments of the present application.
  • Both the memory 702 and the memory 802 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrical memory Erase programmable read-only memory (EPROM, EEPROM) or flash memory.
  • the volatile memory may be a random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct RAMbus RAM direct RAMbus RAM, DR RAM
  • bus 704 and the bus 804 may also include a power bus, a control bus, a status signal bus, and the like. However, for the sake of clarity, various buses are marked as the bus 704 or the bus 804 in the figure.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination.
  • the above embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded or executed on a computer, the processes or functions according to the embodiments of the present application are wholly or partially generated.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (for example, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like, including one or more sets of available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Automation & Control Theory (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开一种数据检索方法,具体包括数据准备过程和数据检索过程两部分。在数据准备过程中,将数据库原始数据划分为第一类数据和第二类数据。数据检索过程中,在第一类数据中确定第一检索范围,并在第一检索范围中检索待检索数据,获得第一检索结果。在第二类数据的全部数据中检索待检索数据,获得第二检索结果。最后,从第一检索结果和第二检索结果中确定最终检索结果,以此解决数据检索过程存在的检索速度和检索精度的问题。

Description

数据检索的方法和装置 技术领域
本申请涉及计算机领域,尤其涉及一种数据检索的方法和装置。
背景技术
随着计算机技术的发展,基于图像的检索方法被应用于越来越多的领域。例如,公安系统利用实时采集的人像数据与数据库内数据进行比对,可以识别罪犯。交通系统利用实时采集车牌信息可以准确定位车辆行驶轨迹,用于查找肇事车辆。上述应用场景中,数据库中存储大量数据,且随着采集数据的不断增多,检索过程中需要比对的数据也越来越多,这就给图像检索速度和精度带来了极大挑战。如何提供一种既能保证图像检索精度,又能提升图像检索速度的方法成为图像检索领域亟待解决的问题。
发明内容
本申请提供了一种数据检索的方法和装置,可以既保证检索速度,又提升检索精度。
第一方面,提供一种数据检索的方法,该方法包括:将数据库中N个数据划分为第一类数据和第二类数据,N≥2。在第一类数据中确定第一检索范围,并在第一检索范围中检索待检索数据,获得第一检索结果,其中,第一检索范围中的数据为第一类数据的子集。在第二类数据的全部范围中检索待检索数据,获得第二检索结果。最后,从第一检索结果和第二检索结果中确定待检索数据最终的检索结果。具体地,计算待检索数据和第一类数据的第一检索范围中各个数据的欧式距离,确定与待检索数据距离最近的数据为第一检索结果;以及计算待检索数据和各个第二类数据的欧式距离,确定与待检索数据距离最近的数据为第二检索结果;再取与待检索数据欧式距离最近的一个检索结果为最终的检索结果。通过上述过程的描述,将原始数据库中数据划分为第一类数据和第二类数据,第一类数据是影响检索精度的数据,第二类数据是影响检索速度的数据,分别在第一类数据中执行缩小检索范围的检索,在第二类数据中执行暴力检索,再根据两类数据的检索结果获得最终的检索结果,以此实现既保证检索精度,又保证检索速度的目的。
在一种可能的实现方式中,按照聚类算法将所述N个数据划分为M个簇,每个数据对应一个簇,每个簇有一个中心点,每个数据与其所归属的簇的中心点具有接近的相似度,M≥2,每个簇的簇索引用于唯一标识一个簇。那么,第二类数据为各个簇的边缘点的数据的集合,第一类数据为原始数据库中除第二类数据以外的其他数据的集合。通过上述描述,可以将原始数据划分为两类数据,进而识别出影响检索精度的第二类数据。
可选地,第二类数据为各个簇的边缘点的数据的集合。边缘点为以其为中心,第 一阈值为半径的范围内包含两个或两个以上簇的数据的数据。则第一类数据为原始数据库中除第二类数据以外的数据。
在另一种可能的实现方式中,还可以按照预置算法将第一类数据划分为多个层,每个层中包括至少一个第一类数据,每个第一类数据归属于一个层,每个层的层索引用于唯一标识一个层。通过上述描述,将第一类数据划分为多个层,数据检索中可以结合第一类数据的簇和层的划分选择检索范围,在第一类数据中实现缩小检索范围的检索,减少检索时间,提升检索效率。
在另一种可能的实现方式中,将N个数据划分为第一类数据和第二类数据包括:从M个簇中选择比对簇;再从N个数据中选择z个参考数据,1≤z≤N;针对每个参考数据执行下述数据分类处理:根据当前参考数据在数据库中检索得到待分类数据,其中,待分类数据为与当前参考数据相似度接近的数据;确定待分类数据是否属于所述比对簇,如果是,将待分类数据划分到第一类数据,如果否,将待分类数据划分至第二类数据。通过上述一轮数据分类处理过程的描述,原始数据可以被划分为一组第一类数据和第二类数据的组合,数据检索中可以在不同类型的数据中采用不同检索方式,在保证检索精度的同时,提升检索速度。
在另一种可能的实现方式中,在数据库中检索得到待分类数据,待分类数据为与所述当前参考数据相似度接近的数据,包括:计算当前参考数据与其它N-1个数据之间的相似度,根据计算出的相似度进行排序,获得与当前参考数据相近似度由高至低排序的m个数据,1≤m≤N-1;或者,计算当前参考数据与M个簇的中心点之间的相似度,确定与当前参考数据相似度由高至低排序的m个簇,将m个簇中的数据作为所述待分类数据,1≤m≤N-1。通过上述方法,在一轮数据分类处理中,原始数据库中数据可以分成第一类数据和第二类数据。后续在基于上述第一类数据和第二类数据的组合将第一类数据划分为多个层,最终,在第一类数据中采用缩小检索范围的方式检索待检索数据,减少检索时间,提升检索效率。
在另一种可能的实现方式中,从N个数据中选择z个参考数据包括:z=N,分别选择N个数据中的每个数据作为当前参考数据。
在另一种可能的实现方式中,当z个数据中的每个数据分别作为当前参考数据完成一轮数据分类处理之后,再次将z个数据的每个数据分别作为参考数据,执行下一轮数据分类处理,其中,每一轮数据分类处理中选择的比对簇相同,下一轮数据分类处理所选择的比对簇的数量比上一轮数据分类处理所选择的比对簇的数量多,且上一轮数据分类处理所选择的比对簇为下一轮数据分类处理所选择的比对簇的子,,当比对簇的数量达到预先设置的最大值时,结束数据分类处理。通过上述多轮数据分类处理,获得多组第一类数据和第二类数据的组合,可以进一步结合上述多组第一类数据和第二类数据的组合将第一类数据划分为不同层。数据检索中,结合层索引和簇索引确定第一类数据的检索范围,执行缩小检索范围的检索,减少检索时间,提升检索效率。
作为另一种可能的实现方式,在每一轮数据分类处理中,若第P个数据作为待分类数据被划分为所述第二类数据,则该第P个数据不再作为待分类数据参与本轮的后续数据分类处理,第P个数据为所述N个数据中的任意一个数据。为了提升数据准备 阶段的数据分类效率,当确定一个数据为第二类数据时,该数据即被标识为本轮数据分类处理的第二类数据,无需参与本轮数据处理的后续处理过程,减少数据分类处理的运算量和时间。
在另一种可能的实现方式中,每一轮数据分类处理获得一组分类结果,第X轮数据分类处理的第X组分类结果为最终分类结果,第X轮为最后一轮数据分类处理;对最后一轮数据分类处理获得的第一类数据进行分层,其中:第1层第一类数据为第1轮数据分类处理获得的第一类数据,第j层第一类数据=第j轮数据分类处理获得的第一类数据-第j-1轮数据分类处理获得的第一类数据,2≤j≤X。
在另一种可能的实现方式中,根据N个数据中的每个数据的分簇结果和分层结果,为每个数据配置簇索引和层索引。
在另一种可能的实现方式中,第一类数据具有簇索引和层索引,预先设置第一检索范围的指示信息,所述指示信息用于指定所述第一检索范围包括的簇和层;根据待检索数据以及指示信息,确定第一检索范围所包含的簇的簇索引和层的层索引。
在另一种可能的实现方式中,在第一类数据中,选择A个簇中层索引归属于B个层的一个或多个第一类数据作为第一检索范围;其中,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A和B均为正整数。
在另一种可能的实现方式中,在第一类数据中,选择C个簇中层索引归属于B个层的一个或多个第一数据作为第一检索范围;其中,C=A+第一偏移值,第一偏移值≥1,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A、B、C和第一偏移值均为正整数。
在另一种可能的实现方式中,第一类数据中,选择A个簇中层索引归属于D个层的一个或多个第一数据作为第一检索范围;其中,D=B+第二偏移值第二偏移值,第二偏移值为大于1的正整数,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A、B和D均为正整数。
在另一种可能的实现方式中,第一类数据中,选择第E值个簇中层索引归属于第F值个层的一个或多个第一数据作为第一检索范围;其中,E=A+第一偏移值,F=B+第二偏移值,第一偏移值≥1,第二偏移值≥1,第一偏移值和第二偏移值均为正整数,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A、B、E和F均为正整数。
在另一种可能的实现方式中,在执行第一检索处理之前,接收待检索图像;然后,提取待检索图像的特征信息作为待检索数据。
基于上述描述,数据准备过程中,将数据库的数据划分为影响检索速度的第一类数据和影响检索精度的第二类数据,再将第一类数据的数据划分为多个层。在后续数据检索过程中,对于第二类数据采用暴力检索方式检索,保证检索精度;对于第一类数据则采用缩小检索范围的方式检索,保证检索的速度。其中,影响检索精度的数据主要是指数据库中与其他数据相似度较低的数据,对检索精度有直接影响的数据。影响检索速度的数据是指对检索速度有直接影响的数据,此部分数据间接影响检索精度,在该部分数据中选择数据的多少影响检索的速度。通过将原始数据划分第一类数据和 第二类数据,并在第一类数据和第二类数据中分别采用不同的检索方法,二者结合,保证了整个检索过程的精度和速度,降低了检索过程的耗时,提升了检索过程的效率。
第二方面,本申请提供一种数据检索的装置,所述检索装置包括用于执行第一方面或第一方面任一种可能实现方式中的数据检索方法的各个模块。
第三方面,本申请提供一种数据检索的装置,所述装置包括处理器、存储器、通信接口、总线,所述处理器、存储器和通信接口之间通过总线连接并完成相互间的通信,所述存储器中用于存储程序代码,所述装置运行时,所述处理器执行所述存储器中的程序代码,以利用所述装置中的硬件资源执行第一方面或第一方面任一种可能实现方式中所述方法的操作步骤。
第四方面,本申请提供一种数据检索的异构装置系统,所述异构装置系统包括第一装置和第二装置,所述第一装置和第二装置利用网络进行通信,所述第二装置用于协助所述第一装置进行数据检索处理。所述第一装置包括第一处理器、存储器、通信接口、总线,所述第一处理器、存储器和通信接口之间通过总线连接并完成相互间的通信,所述存储器中用于存储程序代码。所述第二装置包括第一处理器、第二处理器、存储器、通信接口、总线,第一处理器、所述第二处理器、存储器和通信接口之间通过总线连接并完成相互间的通信,所述存储器中用于存储程序代码。所述第一装置运行时,所述第一处理器用于执行此处第一装置的存储器中程序代码,将待检索数据发送给所述第二装置,第二装置的第一处理器接收所述待检索数据后,利用所述第二处理器执行所述第二装置中程度代码完成待检索数据的检索,并将结果发送给所述第一装置的第一处理器。其中,第一装置和第二装置中的第一处理器包括CPU、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。第二处理器包括图形处理单元(GPU)、神经网络处理单元(NPU)。
可选地,当第二处理器为GPU时,第二装置中还包括显存。
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
第六方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请提供的一种数据检索系统的架构示意图;
图2为本申请提供的一种数据检索方法的流程示意图;
图3为本申请提供的一种数据分类处理方法的流程示意图;
图4为本申请提供的另一种数据分类处理方法的流程示意图;
图5为本申请提供的一种数据检索的装置的结构示意图;
图6为本申请提供的另一种数据检索的装置的结构示意图;
图7为本申请提供的一种数据检索的异构装置系统的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚描述。
图1为本申请实施例提供的一种检索系统的架构示意图。如图所示,该检索系统中包括检索服务器100和数据库集群200,检索服务器100和数据库集群200通过网络300通信。其中,检索服务器100用于接收检索任务,并在数据库集群200中查找匹配数据。图1所示的系统架构中可以包括一台或多台检索服务器100,图1以该检索系统仅包括一台检索服务器100为例进行描述。数据库集群200包括至少一台数据库服务器201。数据库应用的部署形式可以采用分布式部署,或每个服务器分别按照业务需求部署不同类型的数据库应用,并分别存储不同数据。
数据库中存储有多条数据,数据库中存储的数据也可以称为原始数据。数据库中存储的图像可以按照预置规则提取各个图像的特征信息,每个图像包括多个特征信息,标识同一个图像的多个特征信息也可以称为一个特征向量。例如,一个图像可以由多个像素组成,将每个像素作为该图像的一个特征,那么,用于标识每个像素的信息就可以称为该图像的一个特征信息,所有组成该图像的像素的特征信息的集合称为一个特征向量。特征向量可以用一个矩阵表示,矩阵中每个元素可以表示图像的一个特征信息。进一步地,同一个图像的特征向量又包括长特征向量和短特征向量。其中,长特征向量包括同一图像所有特征信息,短特征向量仅包括同一个图像的部分特征信息。也就是说,同一图像可以用一个长特征向量表示,也可以用多个短特征向量表示。为了便于描述,本申请实施例的以下描述中以每个图像利用长特征向量标识为例进行描述,除特殊说明外,本申请实施例以下描述中特征向量均表示长特征向量。
作为一种可能的实施例,图1所示的系统架构中,也可以将检索服务器100与数据库集群200中数据库服务器201合一部署。也就是说,数据库服务器201除了存储原始数据外,还用于处理检索任务。
作为另一种可能的实施例,图1中存储数据的数据库服务器除了以集群形式部署外,也可以直接利用一台或多台数据库服务器部署。此时,检索系统中各个数据库服务器可以部署分布式数据库应用,或非分布式的数据库应用,或其他形式的数据库应用,本申请不作限制。可选地,检索服务器可以和数据库服务器中一台服务器合一部署。
本申请中所述数据的检索包括文本、图像、音频、视频的检索。为便于表述,本申请的以下实施例中以图像检索为例进行进一步说明。具体地,数据库中存储的数据为图像数据,且待检索数据也为图像数据。
本申请实施例中以图1所示检索系统中部署数据库集群,该数据库集群包括三台数据库服务器201为例进行描述。数据库集群中存储的原始数据可以一次性存储至数据库集群200中,也可以周期性或实时更新。比如,当应用服务器(图1中未示出)通过摄像头获取图像数据后,将该数据存储至数据库集群200的数据库服务器201。
结合图1所示的数据检索系统,为了解决传统技术中数据检索速度和精度的问题,本申请实施例提供一种数据检索的方法,具体包括数据准备过程和检索过程两部分。在数据准备过程中,将数据库原始数据划分为第一类数据和第二类数据。在数据检索过程中,在第一类数据中,执行缩小检索范围的检索,也就是在第一类数据中选择第 一检索范围,然后,在第一检索范围中检索待检索数据。在第二类数据中执行暴力检索,也就是在第二类数据的的全部范围内检索待检索数据。最后,再根据第一类数据的检索结果和第二类数据的检索结果确定最终的检索结果。
其中,将原始数据库中数据划分第一类数据和第二类数据的过程包括:先将原始数据划分为多个簇,每个簇有一个簇中心点,簇索引用于唯一标识一个簇。然后,将各个簇的边缘点的数据划分为第二类数据,也就是说,第二类数据为各个簇的边缘点的数据的集合。边缘点为以其为中心,第一阈值为半径的范围内包含两个或两个以上簇的数据的数据。则第一类数据为原始数据库中除第二类数据以外的数据。
暴力检索是指计算待检索数据和第二类数据中每一条数据的相似度,选择与待检索数据相似度由高至低排序中前n条数据作为匹配结果,n为大于1的整数。其中,确定两个数据的相似度包括通过计算两个数据的距离获知,具体计算距离的方式包括欧式距离公式、余弦距离公式。为便于描述,本申请以利用两个数据的距离标识两个数据的相似度为例进行说明。例如,当计算距离公式采用欧式距离公式时,检索过程包括:首先,根据欧式距离公式计算待检索数据和检索范围中所有数据的距离;然后,再根据上述距离进行排序,筛选出与数据距离由低至高排序中前n条数据作为匹配结果。
缩小检索的检索则是指按照预置聚类算法先将原始数据按照相似度进行聚类,每个聚类数据的集合称为一个簇,每个簇有一个中心点,其中,预置聚类算法包括KMEANS算法。然后,计算每个待检索数据和各个簇的中心点的距离,选择与待检索数据相似度接近的m个簇的中心点所在簇的数据作为检索范围。然后,在上述选择的检索范围内计算待检索数据和各个数据的距离,再根据距离排序确定检索结果。
由上述描述可知,如果在数据库中采用暴力检索方式对所有数据检索待检索数据,虽然可以保证检索精度,但是,检索耗时较长,检索效率低。而如果在数据库中采用缩小检索范围方式对所有数据检索待检索数据,虽然检索耗时降低了,但是,无法保证检索精度。所以,本申请将原始数据划分为第一类数据和第二类数据,分别在两类数据中执行不同的检索方式,再对检索结果进行比较,确认最终的检索结果。以此解决在数据库中仅采用单一检索方式所引发的检索精度和检索时间的矛盾,在保证检索精度的基础上,进一步提升检索速度。
接下来,结合图2详细介绍本申请实施例提供的数据检索方法。该数据检索方法包括数据准备过程和检索过程两部分,其中,步骤S110至步骤S120为数据准备过程,步骤S130至步骤S160为检索过程。数据准备过程可以在系统初始化过程完成,也可以根据业务需求实时或周期性更新数据库中数据。如图所示,所述方法包括:
S110、按照预置聚类算法将数据库中多个特征向量划分成至少两个簇,每个簇有一个中心点,每个特征向量仅归属于一个簇,簇索引用于唯一标识一个簇。
在处理数据检索任务前,需要先完成数据准备过程,也就是先将数据划分为第一类数据和第二类数据。首先,按照预置聚类算法将数据库中特征向量划分成两个或两个以上的集合,每个集合可以称为一个簇,每个簇有一个中心点,划分至同一簇内的特征向量是与该簇的中心点相似度满足第二阈值的特征向量,而且,每个簇有一个全局唯一标识,每个簇的簇索引用于唯一标识一个簇。其中,第二阈值可以根据业务需 求人为设定,也可以根据检索系统历史数据计算获得,本申请对此不作限定。预置聚类算法包括KMEANS聚类算法。KMEANS聚类算法的原理包括以下几个步骤:
1)选择k个特征向量作为k个簇的中心点,其中,k为大于2的整数;
2)遍历数据库中所有特征向量,分别计算每个特征向量和1)中k个中心点的距离,并将各个特征向量划分至与其距离最近的中心点所在簇;
3)计算每个簇中所有特征向量的平均值,并将该平均值作为该簇的新的中心点;
4)重复执行步骤2)-3)直至获得k个中心点与3)各个簇的所有特征向量的平均值相同,此时,获得的k个特征向量为k个簇的中心点。
经过上述步骤1)-4)的数据处理后,数据库中每个特征向量均会被划分至簇的中心点与其相似度满足第二阈值的一个簇,此时,该簇的标识为该特征向量的簇索引,簇的标识可以利用该簇的编号表示。
作为一个可能的实施例,确定各个特征向量所归属的簇和簇索引的过程,因为需要将数据库中原始数据逐个进行处理,处理过程会占用检索服务器中处理器的较多计算资源,实际处理中可以利用多个处理器或图形处理器(graphics processing unit,GPU)完成。
S120、执行至少一轮数据分类处理,每轮数据分类处理分别将数据库中多个特征向量划分为至少一组第一类数据和第二类数据的组合,基于至少一组第一类数据和第二类数据的组合,将最后一轮数据分类处理获得的第一类数据划分为多个层,每个层包括至少一个第一类数据,层索引用于唯一标识一个层。
数据库中存储有上亿,甚至更多的数据,可以按照如下步骤将原始数据划分成不同层,并为每层中特征向量添加层索引。
S1201、执行至少一轮数据分类处理,每轮数据分类处理分别将数据库中多个特征向量划分为第一类数据和第二类数据。
具体地,为了确定数据库中每个特征向量所在的层和层索引,需要执行至少一轮数据分类处理,获得至少一组第一类数据和第二类数据的组合,每轮数据分类处理的方法相同,为便于理解,以下以一轮数据分类处理过程为例进行描述。首先,选择一个或多个簇作为比对簇,比对簇中所有的特征向量作为比对数据,然后,采用分别将原始数据库中每个特征向量作为一个参考数据,采用以下方式中任意一种完成一轮数据分类处理过程:
方式1:采用暴力检索方式将原始数据划分为第一类数据和第二类数据。
分别将原始数据库中每个特征向量作为一个参考数据,按照欧式距离公式计算该特征向量与原始数据库中其他特征向量的距离,并按照上述距离确定当前参考数据在数据库中检索得到的待分类数据,即确定与该特征向量相似度排序由高至低中前n个特征向量,也就是按照与该特征向量距离由小至大的顺序,确定前n个距离对应的特征向量。然后,判断上述待分类数据是否属于本轮比对簇的比对数据,即判断n个特征向量是否归属于本轮数据分类处理所选定的簇,如果该n个特征向量中存在x个特征向量归属于本轮数据分类处理选定的簇,则认为该x个特征向量为本轮数据分类处理过程中影响速度的第一类数据;如果该n个特征向量中y个特征向量不归属上述选 定的簇,则认为该y个特征向量为本轮数据分类处理过程中影响精度的第二类数据。以此完成一轮数据分类处理过程。其中,x≤n,y≤n,且x+y=n,n、x和y均为大于或等于1的正整数。
示例地,如图3所示,k_choose表示本轮数据分类处理过程中选中的簇的数量。当k_choose=1时,表示本轮数据分类处理过程选择一个簇作为比对簇,该簇的所有特征向量作为比对数据。例如,选择簇1作为比对簇,簇1中所有特征向量作为本轮数据分类处理的比对数据。然后,分别将原始数据库中各个特征向量作为一个参考数据,对该参考数据进行暴力检索。也就是说,分别计算该特征向量和原始数据库中其他所有特征向量的距离,再将上述距离进行排序,确定与该待检索特征向量相似度由高至低排序中前n个特征向量,也就是上述距离由低至高排序后前n个距离对应的特征向量。再判断n个特征向量中每个特征向量是否属于簇索引为1的簇,如果特征向量属于簇索引为1的簇,则将该特征向量作为本轮数据分类处理过程中影响检索速度的第一类数据1;如果该特征向量不属于簇索引为1的簇,则将该特征向量作为本轮数据分类处理过程中影响检索精度的第二类数据1。经过上述一轮数据分类处理后,原始数据会被划分为两类:第一类数据1和第二类数据1。
方式2:采用缩小检索方位方式将原始数据划分为第一类数据和第二类数据。
分别将原始数据库中每个特征向量作为一个参考数据,计算每个参考数据和步骤S110中划分的各个簇的中心点的距离,选择与当前参考数据距离由低至高排序中前m个簇的特征向量作为待分类数据。然后,判断前m个簇是否为本轮数据分类处理过程所选定的比对簇,如果m个簇中有x个簇为本轮数据分类处理所选定的簇,则认为该x个簇中特征向量为影响检索速度的第一类数据;如果m个簇中有y个簇不为本轮数据分类处理所选定的比对簇,则认为该y个簇中特征向量为影响检索精度的第二类数据。其中,x≤n,y≤n,且x+y=n,n、x和y均为大于或等于1的正整数。
示例地,如图3所示,k_choose表示本轮数据分类处理过程中选中的比对簇的数量。当k_choose为1时,表示本轮数据分类处理过程选择一个簇作为比对簇,该簇的所有特征向量作为比对数据。例如,选择簇1作为比对簇,簇1的所有特征向量作为本轮数据分类处理的比对数据。然后,分别将原始数据库中各个特征向量分别作为一个参考数据,对该参考数据进行缩小检索范围的检索。也就是说,分别计算当前参考数据和步骤S110中划分的各个簇的中心点的距离,选择与当前参考数据距离由低至高排序中前m个簇中心点所在簇作为待分类数据。然后,判断上述m个簇是否为本轮分类处理过程所选定的比对簇,如果m个簇中有x个簇为本轮数据分类处理过程选定的比对簇,则认为该x个簇中特征向量为本轮数据分类处理过程中影响检索速度的第一类数据1;如果m个簇中有y个簇不为本轮数据分类处理过程选定的比对簇,则认为该y个簇中特征向量为本轮数据分类处理过程中影响检索精度的第二类数据1。
通过上述一轮数据分类处理过程可知,在选定一个或多个比对簇的特征向量作为比对数据后,无论采用上述方式1还是方式2均可将原始数据划分为一组第一类数据和第二类数据。依据类似方法,可以选择1、2、3、…、或i个簇作为比对数据分别执行一轮数据分类处理,每轮数据分类处理过程可以将原始数据划分为一组第一类数据和第二类数据的集合。最终获得i组第一类数据和第二类数据的集合。其中,每轮数 据分类处理过程中,本轮数据分类处理的比对簇均为在前一轮数据分类处理所选择的比对簇的基础上,添加一个或多个新簇获得。也就是说,本轮数据分类处理的比对数据均为在前一轮数据分类处理所选择的比对数据基础上添加新的簇的数据作为本轮数据分类处理的比对数据。i的取值范围包括1≤i≤n,i为正整数,n为步骤S110中原始数据库中划分簇的总数。
示例地,如图3所示,分别选择1、2、3、…、10个簇的特征向量作为一轮数据分类处理的比对数据,执行10轮数据分类处理。当k_choose为1时,选择簇1的所有特征向量作为本轮数据分类处理的比对数据,经过步骤S1201的方式1或方式2的处理后,获得第一类数据1和第二类数据1。当k_choose为2时,选择簇1和簇2的所有特征向量作为本轮数据分类处理的比对数据,经过步骤S1201的方式1或方式2的处理后,获得第一类数据2和第二类数据2。当k_choose为3时,选择簇1、簇2和簇3的所有特征向量为本轮分类数据处理的比对数据,经过步骤S1201的方式1或方式2的处理后,获得第一类数据3和第二类数据3。依此类推,当k_choose为10时,选择簇1、簇2、簇3、…、簇9和簇10的所有特征向量作为本轮数据分类处理的比对数据,经过步骤S1201的方式1或方式2的处理后,获得第一类数据10和第二类数据10。
可选地,首轮数据分类处理可以选择一个或多个簇作为比对簇。
可选地,每轮比对数据的选择过程可以按照簇的标识,在前一轮数据分类处理的比对数据基础上,添加至少一个簇作为本轮数据分类处理的比对簇,此时,本轮数据分类处理的比对数据为前一轮比对簇的数据和本轮新增的簇的数据共同组成。其中,标识包括编号和名称,或其他可以唯一表示簇的信息,或以上几种的组合。或者,每轮比对数据的选择过程可以随机选择一个簇,在前一轮数据分类处理所选择的比对数据基础上,添加该簇的所有特征向量共同作为本轮数据分类处理的比对数据。
示例地,假设原始数据被分成6个簇,执行4轮数据分类处理。第1轮数据分类处理中,选择簇1作为比对簇;第2轮数据分类处理中,选择簇1和簇2作为第2轮数据分类处理的比对簇;第3轮数据分类处理中,选择簇1、簇2和簇3共同作为第3轮数据分类处理的比对簇。第4轮数据分类处理中,选择簇1、簇2、簇3和簇5作为第4轮数据分类处理的比对簇。也或者,第1轮数据分类处理中选择簇1和簇2作为比对数据;第2轮数据分类处理中,选择簇1、簇2和簇3作为第2轮数据分类处理的比对簇;第3轮数据分类处理中,选择簇1、簇2、簇3和簇4共同作为第3轮数据分类处理的比对簇。第4轮数据分类处理中,选择簇1、簇2、簇3、簇4和簇5作为第4轮数据分类处理的比对簇。
可选地,在一轮数据分类处理中,为了避免同一个特征向量在一轮数据分类处理中可能即属于第一类数据,又属于第二类数据的问题,当一个特征向量已确定其属于第二类数据时,即不会再参与其他特征向量作为参考数据的数据分类处理过程。例如,当簇1为本轮数据分类处理的比对簇,特征向量1作为当前参考数据时,确定特征向量7属于第二类数据,则在特征向量2作为参考数据时,不需要再计算特征向量2和特征向量7的相似度。
值得说明的是,为了便于描述,本申请以下描述中以图3所示数据分类处理中, 首轮数据分类处理中以1个簇为比对数据,除首轮数据分类处理外的每轮数据分类处理中,均以前一轮数据分类处理的比对簇为基础,按照簇的编号,每轮添加一个簇作为本轮数据分类处理的比对簇为例进行描述。
S1202、基于至少一组第一类数据和第二类数据的组合,将最后一轮数据分类处理获得的第一类数据划分为多个层,每个层中包括至少一个第一类数据,层索引用于唯一标识一个层。
基于上述步骤S1201的操作过程可以对原始数据执行至少一轮数据分类处理。例如,如图3所示分别选择1、2、3、……、i(例如,i取10)个簇作为比对数据,获得如图3所示的i组第一类数据和第二类数据的组合。此时,第一类数据的层的划分可以按照如下公式(1)获得:
第1层第一类数据=第1轮数据分类处理获得的第一类数据;
第i层第一类数据=第i轮数据分类处理获得的第一类数据-第i-1轮数据分类处理获得的第一类数据。
其中,第1层第一类数据的特征向量为第1轮数据分类数据处理获得的第一类数据对应的特征向量,第2层第一类数据的特征向量为第2轮数据分类处理获得的第一类数据减去第1轮数据分类处理获得的第一类数据的特征向量,也就是说,第2层第一类的特征向量为第2轮数据分类数据处理获得的第一类数据比第1轮数据分类处理获得的第一类数据多出来的特征向量,2≤i≤X,X为执行数据分类处理的最大轮数。第一类数据的层索引用于唯一标识一个第一类数据的层。第一类数据的层索引可以利用i标识,也可以利用其他文字或文字和数据的组合标识。
基于步骤S1201的数据分类处理过程和公式(1)可以将最后一轮获得的第一类数据划分成如图3所示多个层。其中,图3是以S1201进行10轮数据分类处理过程为例进行描述,此时,第一类数据的层索引1至10所对应的所有特征向量的集合即为第10轮数据分类处理获得的第一类数据10所包含的全部特征向量。
值得说明的是,当仅执行一轮数据分类处理时,可以获得一组第一类数据和第二类数据。此时,第一类数据仅包括1层。
可选地,第二类数据也可以基于上述多组数据分类处理过程获得的多组第一类数据和第二类数据的组合划分为多个层,具体可以按照如下公式(2)获得:
第1层第二类数据=第1轮数据分类处理获得的第二类数据;
第j层第二类数据=第j轮数据分类处理获得的第二类数据-第j-1轮数据分类处理获得的第二类数据;
其中,第1层第二类数据的特征向量为第1轮数据分类处理获得的第二类数据对应的特征向量。第2层第二类数据的特征向量为第2轮数据分类处理获得的第二类数据减去第1轮数据分类处理获得的第二类数据的特征向量,也就是说,第2层第二类的特征向量为第2轮数据分类数据处理获得的第二类数据比第1轮数据分类处理获得的第二类数据多出来的特征向量,2≤j≤X,X为执行数据分类处理的最大轮数。第二类数据的层索引用于唯一标识一个第二类数据的层。为了区分第一类数据的层索引和第二类数据的层索引,可以采用与第一类数据的层索引不同的编码方式表示第二类数据的层,例如,当第一类数据的层索引利用i表示时,第二类数据的层索引利用2i+1 表示。
作为另一种可能的实施例,对于影响检索精度的第二类数据,为了保证检索精度,第二类数据需要利用暴力检索方式逐个进行检索,当对数据检索速度要求较高时,也可以根据第二类数据的层索引和簇索引在第二类数据中选择部分数据执行暴力检索,以此提升数据检索速度。
作为另一个可能的实施例,步骤S1201所述的多轮数据分类处理过程相互独立,为了提高数据处理速度,多轮数据分类处理过程可以并行处理,最终获得多组第一类数据和第二类数据的组合,再按照步骤S1202的方法将最后一轮数据分类处理过程获得的第一类数据划分为多个层。其中,多轮数据分类处理并行处理时,可以利用同一处理器调度不同任务完成,也可以利用不同处理器调度不同任务完成。例如,利用同一个中央处理单元(central processing unit,CPU)调度不同任务完成多个数据分类处理的并行处理,或者,利用多个CPU调度不同任务完成多个数据分类处理的并行处理,或者,利用一个或多个图形处理单元(graphics processing unit,GPU)完成多个数据分类处理的并行处理,以此减少数据分类处理过程的时间,提升数据分类处理的效率。
作为另一个可能的实施例,除了上述步骤S120所述的基于最后一轮数据分类处理获得的第一类数据划分层外,也可以在每轮数据处理后确定一个第一类数据的层。例如,数据准备阶段共执行3轮数据分类处理过程,当完成第1轮数据分类处理时,可以确定第1层第一类数据为第1轮数据分类处理获得的第一类数据。当完成第2轮数据分类处理时,则可以确定第2层第2类数据为第2轮数据分类处理获得的第一类数据减去第1轮数据分类处理获得的第一类数据。当完成第3轮数据分类处理时,则可以确定第3层第一类数据为第3轮数据分类处理获得的第一类数据减去第2轮数据分类处理获得的第一类数据。采用上述方法,也可以将第一类数据划分为多个层。
通过上述步骤S1201至步骤S1202的描述,分别选定一个或多个簇的特征向量作为比对数据,然后利用暴力检索或缩小检索范围检索的方式,将原始数据划分为多组第一类数据和第二类数据,再利用上述公式(1)将最后一轮数据分类处理获得的第一类数据分成不同层,并为每个特征向量添加层索引。在后续数据检索过程中,对于最后一轮数据分类处理获得的第二类数据采用暴力检索方式检索,保证检索精度;最后一轮数据分类处理获得的第一类数据则采用缩小检索范围的方式检索,保证检索的效率。以此平衡检索速度和检索精度的问题,在保证检索精度的情况下,进一步提升检索效率。
经过上述步骤的处理过程,数据库中特征向量增加了层索引和簇索引的标识。表1为本申请实施例提供的一种数据结构的示例,包括图像标识、特征向量标识。其中,图像标识用于唯一标识数据库中一个图像。特征向量标识包括特征向量所包括的特征信息,以及该特征向量的簇索引和层索引。可选地,数据库中还包括记录数据的采集信息,包括采集时间和采集地点。
表1一种数据库中数据结构的示例
Figure PCTCN2019085483-appb-000001
作为一个可能的实施例,表1所示的特征向量的标识中还可以包括长特征向量和 短特征向量,以及长特征向量的簇索引和层索引,短特征向量的簇索引和层索引。在后续检索过程中,如果待检索特征向量为长特征向量,即一个待检索图像的全部特征信息组成一个长特征向量,则检索过程参考表1中长特征向量的簇索引和层索引进行处理。如果待检索特征向量为短特征向量,即一个待检索图像的全部特征信息利用两个或两个以上的短特征向量标识,则检索过程参考表1中短特征向量的簇索引和层索引进行处理。
值得说明的是,步骤S1201中数据分类处理过程的轮数与后续检索效率成正比,也就是说,数据分类处理过程的迭代轮数越多,获得的第一类数据和第二类数据的组合越多,步骤S1202中划分为影响检索精度的第二类数据的数量越少,相应地,后续检索处理中,需要暴力检索的数据越少,检索耗时越少,精度越高。而对于影响检索速度的第一类数据,数据分类处理的轮次越多,获得的第一类数据和第二类数据的组合越多,基于多组第一类数据和第二类数据组合对第一类数据进行分层时,第一类数据的层级越多。在后续数据检索过程中,则可以根据细分的层级关系在第一类数据中进一步确定检索范围,再在该检索范围中检索待检索数据,降低了检索耗时。
作为一种可能的实现方式,除上文描述的利用多轮数据分类处理方式将原始数据划分为多个层的方法外,也可以采用其他预置算法将原始数据划分为多个层。可以将每个簇的中心点,以及每个簇中与簇的中心点的距离小于或等于第三阈值的特征向量划分为第一层,将每个簇中与簇的中心点的距离小于或等于第四阈值的特征向量划分为第二层,…,将每个簇中与簇的中心点的距离大于第n阈值的特征向量划分为第n层,n≥1。
本申请实施例提供的数据分类方法仅为一种示例,也可以利用其他方法将数据库中原始数据划分为两类,分别执行不同的检索方法。例如,基于各个特征向量与其所在簇的中心点的距离将其划分为第一类数据或第二类数据。
以上介绍了本申请实施例提供的数据准备阶段的处理过程,接下来,结合步骤S130至步骤S160进一步介绍本申请实施例提供的数据检索过程。
S130、计算待检索特征向量与每个簇的中心点的距离,并根据待检索特征向量与每个簇的中心点的距离对所有簇进行排序。
当接收到一个待检索图像时,首先,获取该待检索图像对应的特征向量。然后,计算该待检索特征向量与步骤S110中划分的每个簇的中心点的距离,并根据待检索特征向量与每个簇的中心点的距离对所有簇进行排序,以此获取待检索特征向量与各个簇的相似度由高至低的排序,也就是获取待检索特征向量与各个簇的中心点的距离由低至高的排序。后续,可以按照待检索特征向量与各个簇的相似度选择一个或多个簇进一步确定本轮检索任务的检索范围,再执行检索操作,获得检索结果。例如,数据库中原始数据划分成3个簇,簇1、簇2和簇3,按照欧式距离公式计算待检索特征向量和簇1、簇2、簇3的中心点的距离分别为3、4.5和1,则待检索特征向量和3个簇的相似度由高至低依次为簇3、簇1和簇2。
S140、在第一类数据中,选择A个簇且层索引归属于选定的B个层的特征向量执行缩小检索范围的检索,获得至少一个第一检索结果,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A和B 均为正整数。
第一类数据是指步骤S120中执行多轮数据分类处理中,最后一轮数据分类处理获得的第一类数据。A个簇的选择方法为在步骤S130中与待检索特征向量相似度由高至低排序后,选择的前A个簇。B个层的选择方法为按照层索引的编号排序确定。A和B为根据业务需求预设的值。例如,A可以为3,B为2,此时,在第一类数据中,选择与待检索特征向量的距离由低至高排序前3位的簇,且该3个簇中特征向量的层索引归属于选定的2个层的特征向量作为第一检索范围,在第一检索范围中计算待检索特征向量和各个数据的距离,并对上述距离进行排序,获得与待检索特征向量距离最近的一个特征向量为第一类数据的检索结果。为便于后续描述,将在第一类数据检索获得的检索结果称为第一检索结果。
可选地,对于B个层的选择方法,可以根据层索引的标识选择B个层,也可以随机选择B个层。另外,A和B可以人为指定,本申请不作限制。
S150、对第二类数据中的特征向量执行暴力检索,获得第二检索结果。
第二类数据是指步骤S120中执行多轮数据分类处理中,最后一轮数据分类处理获得的第二类数据。因为第二类数据是原始数据库中与其他特征向量相似度较低的一组特征向量的集合,是否将第二类数据作为检索范围会影响检索精度,所以,为了提高检索过程的检索精度,所有第二类数据均需执行暴力检索,即计算待检索特征向量和各个第二类数据的距离,以确定与待检索特征向量距离最近的特征向量为第二类数据的检索结果,以此保证检索过程的精度。为了与步骤S140中获得的检索结果进行区分,将在第二类数据中执行暴力检索所获得的检索结果称为第二检索结果。
可选地,基于步骤S120中对特征向量的层的划分,为了提高检索效率,也可以根据簇索引和层索引在第二类数据中选择至少一个第二类数据作为第二检索范围,在第二检索范围中执行暴力检索以获取至少一个第二距离作为检索结果。
值得说明的是,步骤S140和步骤S150的执行过程无先后顺序关系,可以先执行步骤S140,再执行步骤S150。也可以先执行步骤S150,再执行步骤S140。还可以同时执行步骤S140和步骤S150。
S160、根据第一检索结果和第二检索结果确定最终的检索结果。
具体地,确定匹配特征向量的过程包括:比较第一检索结果和第二检索结果,选择与待检索特征向量相似度最高的检索结果作为最终检索结果。也就是对上述第一检索结果和第二检索结果进行排序,选择与待检索特征向量距离最小的特征向量为与待检索特征向量匹配的特征向量,也就是说,选择与待检索特征向量相似度最高的特征向量作为最终的检索结果。
通过上述步骤S110至步骤S150内容的描述,在数据准备阶段,先将原始数据划分为多个簇,再将原始数据划分为多组第一类数据和第二类数据的组合,结合多组第一类数据和第二类数据的组合,将最后一轮执行数据分类处理的第一类数据划分为多个层级,最终获得影响检索精度的第二类数据和影响速度的第一类数据。在检索过程中,对于第一类数据,根据各个特征向量的层索引和簇索引确定第一类数据的检索范围,并针对该部分数据执行缩小范围的检索。而对于第二类数据,则执行暴力检索。最后,再对第一类数据和第二类数据检索过程获得的检索结果进行排序,确定与待检 索特征向量相似度最高的特征向量为最终检索结果。上述检索过程中,首先对原始数据库中特征向量进行分类,识别出与其他特征向量相似度较低的影响检索精度的第二类数据,对其执行暴力检索,保证了检索精度。同时,对于原始数据库中相似度接近的影响检索速度的特征向量采用缩小检索范围的方式检索,保证检索速度。二者结合,保证了整个检索过程的精度和速度,降低了检索过程的耗时,提升了检索过程的效率。
作为一个可能的实施例,在上述图2所述的检索方法中,为了进一步提高检索精度,实际检索过程中,还可以选择C个簇且层索引归属于选定的B个层的特征向量执行缩小检索范围的检索。其中,C为A+加第一偏移值所得,所述第一偏移值大于或等于1的整数,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A、B和C均为正整数。其中,第一偏移值为预设值,具体实施时,可以根据检索精度需求设置第一偏移值,第一偏移值为大于或等于1的正整数。换句话说,检索过程中,选择C个簇,且层索引归属于选定的B个层的特征向量作为检索范围执行缩小检索范围的检索。缩小检索范围的检索方法是为了提升检索速度,在检索过程中可以选择一部分相似度较高的数据作为检索范围,而检索范围的大小会影响检索精度。本申请实施例中在用户选定的第一值基础上增加第一偏移值,可以在用户选择检索范围基础上适当扩大检索范围,进一步提高检索精度。
作为一个可能的实施例,在上述图2所述的检索方法中,基于对第一类数据的层级划分结果,为了进一步提高检索精度,实际检索过程中,还可以选择A个簇且层索引归属于选定的D个层的特征向量执行缩小检索范围的检索。其中,D为B加第二偏移值所得,所述第二偏移值大于或等于1的整数,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A、B和D均为正整数。其中,第二偏移值为预设值,具体实施时,可以根据检索精度需求设置第二偏移值。换句话说,检索过程中,选择A个簇,且层索引归属于选定的用于选择的B基础上,增加第二偏移值个层的特征向量作为检索范围,执行缩小检索范围的检索。上述实施例中,通过在用户选择的B个层基础上增加第二偏移值个层适当扩大检索范围,在具体实施例时,也可以进一步提高检索精度。
作为一个可能的实施例,在上述图2所述的检索方法中,基于对第一类数据的层级划分结果,为了进一步提高检索精度,实际检索过程中,还可以选择E个簇且层索引归属于选定的F个层的特征向量执行缩小检索范围的检索。其中,E为A加第一偏移值所得,F为B加第二偏移值所得,1≤A≤a,1≤B≤b,a为数据库中划分的簇的总数,b为第一类数据划分的层的总数,a≥1,b≥1,a、b、A、B、E和F均为正整数。其中,所述第一偏移值和所述第二偏移值均为大于或等于1的整数。而且,所述第一偏移值和所述第二偏移值均为预设值。具体实施时,可以根据检索精度需求设置第一偏移值和第二偏移值。换句话说,检索过程中,在用户选择的A个簇,且层索引归属于选定的用于选择的B个层基础上,增加第一偏移值个簇和第二偏移值个层的特征向量作为检索范围,扩大检索范围,执行缩小检索范围的检索。上述实施例中,通过在用户选择的簇和层的基础上增加第一偏移值个簇和第二偏移值个层适当扩大检索范围,在具体实施例时,也可以进一步提高检索精度。
作为另一个可能的实施例,在数据准备过程中,原始数据库中区分层索引和簇索 引的划分过程,也可以将原始数据库中数据划分成若干个子数据库,分别对各个子数据库中数据执行簇和层的划分。数据检索时,每个待检索特征向量,在结合上述步骤S130至步骤S150的处理过程,分别在各个子数据库中选定的层和簇的特征向量作为检索范围执行检索,最后,再按照步骤S160所述内容,对多个距离进行排序,根据距离的排序确定匹配的特征向量。上述过程可以有效降低数据准备的耗时,而且,在具体实施例时,可以利用不同硬件(例如,GPU)分别对不同子数据库区分簇和层,再按照选定的检索范围执行检索,可以有效提升数据准备和检索效率。由于每个子数据库中所包含的数据少于原始数据库,在每轮数据分类处理中,参考数据的数量也相对较少,在将每个子数据库中特征向量作为参考数据计算其与其他数据的距离时,计算量小于同一个参考数据在原始数据库的计算量。通过上述方式,将数据库划分为多个子数据库,分别在子数据库中划分第一类数据和第二类数据,再分别检索待检索数据,也能够达到既保证检索精度又减少检索时间的目的。
值得说明的是,对于上述方法实施例仅为一种示例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制。其次,本领域的技术人员根据以上描述的内容,通过上述方法的全部或部分操作步骤的组合,或者本领域技术人员能够想到的其他合理的步骤组合,也属于本申请的保护范围内。
下面结合图4所示示例,以图像检索过程中,数据库中原始数据为图像,每个图像利用一个特征向量(vector)表示,数据分类处理中以步骤S1201中方式1确定第一类数据和第二类数据为例,进一步介绍本申请实施例提供的数据检索方法。如图所示,原始数据库中包括20个图像对应的特征向量V1、V2、…、V20。
在数据准备过程中,先将原始数据按照预置聚类(例如,KMEANS)算法划分为簇1、簇2和簇3,簇1的中心点为V6,簇2的中心点为V5,簇3的中心点为V19。然后,执行2轮数据分类处理。其中,第1轮数据分类处理中选择簇1作为比对簇,簇1的数据为第1轮数据分类处理的比对数据,分别将原始数据库中特征向量作为参考数据,例如,将V1作为当前参考数据,计算V1与V2、V3、…、V20的欧式距离,且V1与其他特征向量的距离由低至高排序依次为V2、V10、V3、…、V7,假设获取距离排序前3的特征向量作为V1为当前参考数据时的待分类数据,即选择V20、V9和V3作为V1为当前参考数据的待分类数据。再分别判断V20、V9和V3是否归属于簇1(第1轮数据分类处理所选择的比对数据),最终确定V20和V3不在簇1中,V9在簇1中,此时,可以确定V9为第一类数据,V20和V3为第二类数据。再将V2作为当前参考数据,计算V2与原始数据中除V20和V3以外的其他特征向量的欧式距离;然后,按照V1与其他特征向量的距离由低至高依次排序,取前3位对应的特征向量V6、V3和V19;最后,确定V6归属于簇1,V3和V19不归属于簇1,则V6属于第一类数据,V3和V19属于第二类数据。依此类推,经过多次处理,原始数据会被划分为第一类数据1和第二类数据1,其中,第一类数据1包括V6和V9,第二类数据1包括除V6和V9以外的其他特征向量。采用类似第1轮数据分类处理的方法,第2轮数据分类处理过程选择簇1和簇2作为第2轮数据分类处理的比对簇, 簇1和簇2中所有特征向量作为第2轮数据分类处理的比对数据,分别将每个特征向量作为参考数据判断其他各个特征向量是否归属于第一类数据2。最终,确定第一类数据2包括V1、V2、V3、V4、V5、V6和V9,第二类数据2包括V7、V8、V10、V11、V12、V13、V14、V15、V16、V17、V18、V19和V20。再基于以上2轮数据分类处理过程将第一类数据2划分为两层,分别为层索引1和层索引2,其中,第1层第一类数据为V6和V9,第2层第一类数据为V1、V2、V3、V4和V5。
在数据检索阶段,当接收待检索数据(例如,待检索数据为图像,则需要先提取图像的特征向量作为待检索数据)时,首先计算待检索数据和各个簇的中心点的距离,然后,按照上述距离排序,假设距离排序由低至高依次为簇1、簇2、簇3。假设在第一类数据中选择两个簇和1个层的数据作为第一检索范围,再基于上述第2轮数据分类处理中获得的第一类数据2和第二类数据2,在第一类数据2中选择簇1和簇2,以及层索引为1的数据为第一检索范围,此时,第一检索范围的数据包括V6和V9。再计算待检索数据和V6、V9的距离,假设待检索数据和V6、V9的距离分别为6和9。然后,在第二类数据2利用暴力检索方式分别计算待检索数据和第二类数据2中各个特征向量的距离,假设待检索数据和V7、V8、V10、V11、V12、V13、V14、V15、V16、V17、V18、V19和V20的距离分别为7、8、10、11、12、13、14、15、16、17、18、19和20。最后,将上述距离进行排序,由低至高依次为6、7、8、9、10、11、12、13、14、15、16、17、18、19和20,由上述距离排序结果可知,与待检索数据距离最近的特征向量为V6,也即是说,V6为与待检索数据相似度最高的特征向量,此时,可以最终确定V6为待检索数据最终的检索结果。
通过图4的示例可知,利用本申请的数据检索方法,在数据准备阶段将原始数据划分第一类数据和第二类数据,并将第一类数据划分为多个层,在第一类数据中按照簇和层划分检索范围,在此检索范围中执行缩小检索范围的检索,而在第二类数据中执行暴力检索,再根据上述检索过程的检索结果确定最终检索结果。解决了传统技术中精度和速度的问题,在保证检索精度的基础上,进一步提升了检索效率。
上文中结合图1至图4,详细描述了根据本申请实施例所提供的检索方法,下面将结合图5至图7,描述根据本申请实施例所提供的数据检索的装置和异构服务器系统。
图5为本申请实施例提供的一种检索装置的结构示意图。如图所示,数据检索装置500包括第一检索单元501、第二检索单元502和确定单元503,数据库中保存N个数据,将所述数据库中N个数据划分为第一类数据和第二类数据,N≥2;
第一检索单元501,用于在所述第一类数据中确定第一检索范围,并在所述第一检索范围中检索待检索数据,获得第一检索结果,其中,所述第一检索范围中的数据为所述第一类数据的子集;
第二检索单元502,用于在所述第二类数据的全部范围中检索所述待检索数据,获得第二检索结果;
确定单元503,用于从所述第一检索结果和所述第二检索结果中确定所述待检索数据最终的检索结果。
可选地,装置500还包括数据分类处理单元504,所述数据分类处理单元504,用于按照聚类算法将所述N个数据划分为M个簇,每个数据对应一个簇,每个簇有一个中心点,每个数据的值与其所归属的簇的中心点具有接近的相似度,M≥2,每个簇的簇索引用于唯一标识一个簇;所述第二类数据为各个簇的边缘点的数据的集合,所述第一类数据为所述原始数据库中除第二类数据以外的其他数据的集合。
可选地,所述数据分类处理单元504,还用于按照预置算法将所述第一类数据划分为多个层,每个层中包括至少一个第一类数据,每个第一类数据归属于一个层,每个层的层索引用于唯一标识一个层。
可选地,数据分类处理单元504,还用于所述从所述M个簇中选择比对簇;从所述N个数据中选择z个参考数据,1≤z≤N;针对每个参考数据执行下述数据分类处理:根据当前参考数据在所述数据库中检索得到待分类数据,所述待分类数据为与所述当前参考数据相似度接近的数据;确定所述待分类数据是否属于所述比对簇,如果是,将所述待分类数据划分到所述第一类数据,如果否,将所述待分类数据划分到所述第二类型数据。
可选地,数据分类处理单元504,计算所述当前参考数据与其它N-1个数据之间的相似度,根据计算出的相似度进行排序,获得与所述当前参考数据相似度由高至低排序的m个数据,将所述m个数据作为所述待分类数据,1≤m≤N-1;或者,计算所述当前参考数据与M个簇的中心点之间的相似度,确定与所述当前参考数据相似度由高至低排序的m个簇,将所述m个簇中的数据作为所述待分类数据,1≤m≤N-1。
可选地,数据分类处理单元504,还用于在i=N时,分别选择所述N个数据中的每个数据作为当前参考数据。
可选地,数据分类处理单元504,还用于当所述N个数据中的每个数据分别作为当前参考数据完成一轮数据分类处理之后,再次将所述i个数据的每个数据分别作为参考数据的执行下一轮数据分类处理,其中,每一轮数据分类处理中选择的比对簇相同,下一轮数据分类处理所选择的比对簇的数量比上一轮数据分类处理所选择的比对簇的数量多,且上一轮数据分类处理所选择的比对簇为下一轮数据分类处理所选择的比对簇的子集,当比对簇的数量达到预先设置的最大值时,结束数据分类处理。
可选地,数据分类处理单元504每一轮数据分类处理获得一组分类结果,第X轮数据分类处理的第X组分类结果为最终分类结果,第X轮为最后一轮数据分类处理;所述数据分类处理单元,还用于对最后一轮数据分类处理获得的第一类数据进行分层,其中:第1层第一类数据为第一轮数据分类处理获得的第一类数据,第j层第一类数据=第j轮数据分类处理获得的第一类数据-第j-1轮数据分类处理获得的第一类数据,2≤j≤X。
可选地,所述数据分类处理504,还用于根据所述N个数据中的每个数据的分簇结果和分层结果,为每个数据配置簇索引和层索引。
可选地,所述第一类数据具有簇索引和层索引;所述第一检索单元502,还用于预先设置所述第一检索范围的指示信息,所述指示信息用于指定所述第一检索范围包括的簇和层;根据所述待检索数据以及所述指示信息,确定所述第一检索范围所包含的簇的簇索引和层的层索引。
应理解的是,本申请实施例的装置500用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图2示的检索方法时,装置500及其各个模块也可以为软件模块。
根据本申请实施例的装置500可对应于执行本申请实施例中描述的方法,并且装置500中的各个单元的上述和其它操作和/或功能分别为了实现图2所述方法的相应流程,为了简洁,在此不再赘述。
通过上述内容的描述,在数据准备阶段将原始数据划分为影响速度的第一类数据和影响检索精度的第二类数据,再将第一类数据划分为多个层。在数据检索过程中,对于第一类数据,根据各个特征向量的层索引和簇索引确定第一检索范围,并针对该部分数据执行缩小范围的检索。而对于第二类数据,对其全部数据执行暴力检索。最后,再对第一类数据和第二类数据检索过程获得的检索结果进行比较,确定与待检索数据相似度最高的数据为最终检索结果。上述检索过程中,在数据准备阶段先识别影响精度的第二类数据,对其执行暴力检索,保证了检索精度。同时,对于影响检索速度的特征向量采用缩小检索范围的方式检索,保证检索速度。二者结合,保证了整个检索过程的精度和速度,降低了检索过程的耗时,提升了检索过程的效率。
图6为本申请实施例提供的另一种数据检索的装置的示意图,如图所示,所述装置600包括第一处理器601、存储器602、通信接口603、总线604。其中,第一处理器601、存储器602、通信接口603通过总线604进行通信,也可以通过无线传输等其他手段实现通信。该存储器602用于存储程序代码6021,第一处理器601用于调用存储器602存储的程序代码6021以执行以下操作:
在所述第一类数据中确定第一检索范围,并在所述第一检索范围中检索待检索数据,获得第一检索结果,其中,所述第一检索范围中的数据为所述第一类数据的子集;
在所述第二类数据的全部范围中检索所述待检索数据,获得第二检索结果;
从所述第一检索结果和所述第二检索结果中确定所述待检索数据最终的检索结果。
可选地,装置600中还可以包括第二处理器605,第一处理器601、存储器602、通信接口603和第二处理器605通过总线604进行通信,第二处理器605用于协助第一处理器601执行数据检索任务。
应理解,在本申请实施例中,第一处理器601可以是CPU,该第一处理器601还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
第二处理器605可以是专用检索处理器,例如,GPU,神经网络处理单元(neural processing unit,NPU),也可以是CPU,还可以是其他通用类型处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
可选地,当第二处理器为GPU时,装置600中还包括显存。
值得说明的是,图6所示装置600中第一处理器601和第二处理器605仅为一种示例,第一处理器601和第二处理器605的个数,以及每个处理器的核数不构成对本申请实施例的限定。
存储器602可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
总线604除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线604。
应理解,根据本申请实施例的装置600可对应于本申请实施例中的装置500,并可以对应于执行根据本申请实施例中图2所述方法,并且装置600中的各个模块的上述和其它操作和/或功能分别为了实现图2方法的相应流程,为了简洁,在此不再赘述。
图7为本申请实施例提供的一种异构装置系统的示意图,如图所示,所述异构装置系统包括第一装置700和第二装置800,第一装置700和第二装置800通过网络相通信,该网络包括以太网、光纤网络、无线带宽(infiniband,IB)网络。第二装置800用于协助第一装置700执行数据检索任务。第一装置700包括第一处理器701、存储器702、通信接口703、总线704。其中,第一处理器701、存储器702、通信接口703通过总线704进行通信,也可以通过无线传输等其他手段实现通信。该存储器702用于存储程序代码7021,第一处理器701用于调用存储器702存储的程序代码7021调度第二装置800协助其完成数据检索任务。第二装置800包括第一处理器801、存储器802、通信接口803、总线804和第二处理器805。其中,第一处理器801、存储器802、通信接口803和第二处理器805通过总线804进行通信,也可以通过无线传输等其他手段实现通信。第二装置800的第一处理器801用于接收第一装置700的数据检索任务,指示第二处理器805进行数据检索,具体检索方法如图2所示方法相同,为例简洁,在此不再赘述。第二装置800也可以称为异构装置。
应理解,在本申请实施例中,第一处理器701、第一处理器801均可以是CPU,还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
第二处理器805可以是专用检索处理器,例如,GPU,神经网络处理单元(neural processing unit,NPU),也可以是CPU,还可以是其他通用类型处理器、数字信号处 理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
可选地,当第二处理器805为GPU时,第二装置中还包括显存(图中未示出)。
值得说明的是,图7所示异构装置系统中第一处理器701、第一处理器801和第二处理器805仅为一种示例,第一处理器701、第一处理器801和第二处理器805的个数,以及每个处理器的核数不构成对本申请实施例的限定。
存储器702和存储器802均可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
总线704和总线804除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线704或总线804。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种数据检索的方法,其特征在于,数据库中保存N个数据,将所述数据库中的N个数据划分为第一类数据和第二类数据,N≥2,所述方法包括:
    在所述第一类数据中确定第一检索范围,并在所述第一检索范围中检索待检索数据,获得第一检索结果,其中,所述第一检索范围中的数据为所述第一类数据的子集;
    在所述第二类数据的全部范围中检索所述待检索数据,获得第二检索结果;
    从所述第一检索结果和所述第二检索结果中确定所述待检索数据最终的检索结果。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    按照聚类算法将所述N个数据划分为M个簇,每个数据对应一个簇,每个簇有一个中心点,每个数据的值与其所归属的簇的中心点具有接近的相似度,M≥2,每个簇的簇索引用于唯一标识一个簇;其中,所述第二类数据为各个簇的边缘点的数据的集合,所述第一类数据为所述原始数据库中除第二类数据以外的其他数据的集合。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    按照预置算法将所述第一类数据划分为多个层,每个层中包括至少一个第一类数据,每个第一类数据归属于一个层,每个层的层索引用于唯一标识一个层。
  4. 根据权利要求3所述的方法,其特征在于,将所述N个数据划分为第一类数据和第二类数据包括:
    从所述M个簇中选择比对簇;
    从所述N个数据中选择z个参考数据,1≤z≤N;
    针对每个参考数据执行下述数据分类处理:
    根据当前参考数据在所述数据库中检索得到待分类数据,所述待分类数据为与所述当前参考数据相似度接近的数据;
    确定所述待分类数据是否属于所述比对簇,如果是,将所述待分类数据划分至所述第一类数据,如果否,将所述待分类数据划分至所述第二类数据。
  5. 根据权利要求4所述的方法,其特征在于,在所述数据库中检索得到待分类数据,所述待分类数据为与所述当前参考数据相似度接近的数据,包括:
    计算所述当前参考数据与其它N-1个数据之间的相似度,根据计算出的相似度进行排序,获得与所述当前参考数据相似度由高至低排序的m个数据,将所述m个数据作为所述待分类数据,1≤m≤N-1;或者,
    计算所述当前参考数据与M个簇的中心点之间的相似度,确定与所述当前参考数据相似度由高至低排序的m个簇,将所述m个簇中的数据作为所述待分类数据,1≤m≤N-1。
  6. 根据权利要求4或5所述的方法,其特征在于,还包括:
    当所述z个数据中的每个数据分别作为当前参考数据完成一轮数据分类处理之后, 再次将所述z个数据的每个数据分别作为参考数据,执行下一轮数据分类处理,其中,每一轮数据分类处理中选择的比对簇相同,下一轮数据分类处理所选择的比对簇的数量比上一轮数据分类处理所选择的比对簇的数量多,且上一轮数据分类处理所选择的比对簇为下一轮数据分类处理所选择的比对簇的子集,当比对簇的数量达到预先设置的最大值时,结束数据分类处理。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述方法还包括:
    预先设置所述第一检索范围的指示信息,所述指示信息用于指定所述第一检索范围包括的簇和层;
    则所述在所述第一类数据中确定第一检索范围包括:
    根据所述待检索数据以及所述指示信息,确定所述第一检索范围所包含的簇的簇索引和层的层索引。
  8. 一种数据检索的装置,其特征在于,所述数据库中保存N个数据,将所述数据库中的N个数据划分为第一类数据和第二类数据,N≥2,所述装置包括第一检索单元、第二检索单元和确定单元;
    所述第一检索单元,用于在所述第一类数据中确定第一检索范围,所述第一检索范围中的数据为所述第一类数据的子集;以及在所述第一检索范围中检索待检索数据,获得第一检索结果;
    所述第二检索单元,用于在所述第二类数据的全部范围中检索所述待检索数据,获得第二检索结果;
    所述确定单元,还用于从所述第一检索单元获得的所述第一检索结果和所述第二检索单元获得的所述第二检索结果中确定所述待检索数据最终的检索结果。
  9. 根据权利要8所述装置,其特征在于,所述装置还包括数据分类处理单元,
    所述数据分类处理单元,用于按照聚类算法将所述N个数据划分为M个簇,每个数据对应一个簇,每个簇有一个中心点,每个数据的值与其所归属的簇的中心点具有接近的相似度,M≥2,每个簇的簇索引用于唯一标识一个簇;所述第二类数据为各个簇的边缘点的数据的集合,所述第一类数据为所述原始数据库中除第二类数据以外的其他数据的集合。
  10. 根据权利要8或9所述装置,其特征在于,
    所述数据分类处理,还用于按照预置算法将所述第一类数据划分为多个层,每个层中包括至少一个第一类数据,每个第一类数据归属于一个层,每个层的层索引用于唯一标识一个层。
  11. 根据权利要求10所述装置,其特征在于,
    所述数据分类处理,还用于从所述M个簇中选择比对簇;从所述N个数据中选择z个参考数据,1≤z≤N;针对每个参考数据执行下述数据分类处理:根据当前参考数 据在所述数据库中检索得到待分类数据,所述待分类数据为与所述当前参考数据相似度接近的数据;确定所述待分类数据是否属于所述比对簇,如果是,将所述待分类数据划分到所述第一类数据,如果否,将所述待分类数据划分到所述第二类型数据。
  12. 根据权利要求11所述装置,其特征在于,
    所述数据分类处理,还用于计算所述当前参考数据与其它N-1个数据之间的相似度,根据计算出的相似度进行排序,获得与所述当前参考数据相似度由高至低排序的m个数据,将所述m个数据作为所述待分类数据,1≤m≤N-1;或者,计算所述当前参考数据与M个簇的中心点之间的相似度,确定与所述当前参考数据相似度由高至低排序的m个簇,将所述m个簇中的数据作为所述待分类数据,1≤m≤N-1。
  13. 根据权利要求11或12所述装置,其特征在于,
    所述数据分类处理,还用于当所述N个数据中的每个数据分别作为当前参考数据完成一轮数据分类处理之后,再次将所述z个数据的每个数据分别作为参考数据,执行下一轮数据分类处理,其中,每一轮数据分类处理中选择的比对簇相同,下一轮数据分类处理所选择的比对簇的数量比上一轮数据分类处理所选择的比对簇的数量多,且上一轮数据分类处理所选择的比对簇为下一轮数据分类处理所选择的比对簇的子集,当比对簇的数量达到预先设置的最大值时,结束数据分类处理。
  14. 根据权利要求8至13任一所述装置,其特征在于,
    所述第一检索单元,还用于预先设置所述第一检索范围的指示信息,所述指示信息用于指定所述第一检索范围包括的簇和层;根据所述待检索数据以及所述指示信息,确定所述第一检索范围所包含的簇的簇索引和层的层索引。
  15. 一种数据检索的装置,其特征在于,所述装置包括处理器和存储器,所述处理器和所述存储器通过总线相通信,所述存储器用于存储计算机程序指令,当所述装置运行时,所述处理器用于执行所述存储器中存储的所述计算机程序指令以执行所述方法权利要求1至7中任一所述方法的操作步骤。
  16. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行所述方法权利要求1至7中任一所述方法的操作操作步骤。
PCT/CN2019/085483 2018-09-04 2019-05-05 数据检索的方法和装置 WO2020048145A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19856658.0A EP3835976A4 (en) 2018-09-04 2019-05-05 DATA RECOVERY PROCESS AND DEVICE
US17/190,188 US11816117B2 (en) 2018-09-04 2021-03-02 Data retrieval method and apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811024553.5 2018-09-04
CN201811024553 2018-09-04
CN201811298840.5A CN110874417B (zh) 2018-09-04 2018-11-02 数据检索的方法和装置
CN201811298840.5 2018-11-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/190,188 Continuation US11816117B2 (en) 2018-09-04 2021-03-02 Data retrieval method and apparatus

Publications (1)

Publication Number Publication Date
WO2020048145A1 true WO2020048145A1 (zh) 2020-03-12

Family

ID=69716273

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085483 WO2020048145A1 (zh) 2018-09-04 2019-05-05 数据检索的方法和装置

Country Status (4)

Country Link
US (1) US11816117B2 (zh)
EP (1) EP3835976A4 (zh)
CN (1) CN110874417B (zh)
WO (1) WO2020048145A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767373B (zh) * 2020-06-30 2024-08-09 平安国际智慧城市科技股份有限公司 一种文献检索方法、文献检索装置、电子设备及存储介质
CN112232290B (zh) * 2020-11-06 2023-12-08 四川云从天府人工智能科技有限公司 数据聚类方法、服务器、系统以及计算机可读存储介质
CN112446816B (zh) * 2021-02-01 2021-04-09 成都点泽智能科技有限公司 显存动态数据存储方法、装置及服务器
CN116088669A (zh) * 2021-11-03 2023-05-09 鸿富锦精密工业(武汉)有限公司 扩增实境方法、电子设备及计算机可读存储介质
CN115510089B (zh) * 2022-11-15 2023-03-10 以萨技术股份有限公司 一种向量特征比对方法、电子设备及存储介质
CN116595233A (zh) * 2023-06-02 2023-08-15 上海爱可生信息技术股份有限公司 基于npu的向量数据库检索处理加速方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040093333A1 (en) * 2002-11-11 2004-05-13 Masaru Suzuki Structured data retrieval apparatus, method, and program
CN103049514A (zh) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 一种基于分层聚类的均衡图像聚类方法
CN106557493A (zh) * 2015-09-25 2017-04-05 索意互动(北京)信息技术有限公司 一种数据检索方法、装置以及数据检索服务器
CN107704601A (zh) * 2017-10-13 2018-02-16 中国人民解放军第三军医大学第附属医院 大数据检索方法与系统、计算机存储介质及电子设备

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NZ548445A (en) * 2003-12-31 2009-05-31 Thomson Reuters Glo Resources Systems, methods, interfaces and software for extending search results beyond initial query-defined boundaries
US7734067B2 (en) * 2004-12-07 2010-06-08 Electronics And Telecommunications Research Institute User recognition system and method thereof
CN101339550A (zh) * 2007-07-03 2009-01-07 北京金奔腾译车通科技有限公司 汽车解码器的数据库检索方法
US9367618B2 (en) * 2008-08-07 2016-06-14 Yahoo! Inc. Context based search arrangement for mobile devices
US8429153B2 (en) * 2010-06-25 2013-04-23 The United States Of America As Represented By The Secretary Of The Army Method and apparatus for classifying known specimens and media using spectral properties and identifying unknown specimens and media
CN102508909B (zh) * 2011-11-11 2014-08-20 苏州大学 一种基于多智能算法及图像融合技术的图像检索方法
US9753909B2 (en) * 2012-09-07 2017-09-05 Splunk, Inc. Advanced field extractor with multiple positive examples
CN104636468A (zh) 2015-02-10 2015-05-20 广州供电局有限公司 数据查询分析方法和系统
CN105007464A (zh) 2015-07-20 2015-10-28 江西洪都航空工业集团有限责任公司 一种视频浓缩的方法
GB201515615D0 (en) * 2015-09-03 2015-10-21 Functional Technologies Ltd Clustering images based on camera fingerprints
US20170213127A1 (en) * 2016-01-24 2017-07-27 Matthew Charles Duncan Method and System for Discovering Ancestors using Genomic and Genealogic Data
CN105868266A (zh) * 2016-01-27 2016-08-17 电子科技大学 一种基于聚类模型的高维数据流离群点检测方法
KR101663547B1 (ko) * 2016-02-26 2016-10-07 주식회사 아미크 데이터베이스의 아카이빙 방법 및 장치, 아카이빙된 데이터베이스의 검색 방법 및 장치
CN107203554A (zh) * 2016-03-17 2017-09-26 北大方正集团有限公司 一种分布式检索方法及装置
CN106886553B (zh) * 2016-12-27 2020-07-28 浙江宇视科技有限公司 一种图像检索方法及服务器
CN107274070A (zh) 2017-05-23 2017-10-20 中标软件有限公司 一种自驾游服务的提供方法、装置及系统
CN107657062A (zh) * 2017-10-25 2018-02-02 医渡云(北京)技术有限公司 相似病例检索方法及装置、存储介质、电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040093333A1 (en) * 2002-11-11 2004-05-13 Masaru Suzuki Structured data retrieval apparatus, method, and program
CN103049514A (zh) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 一种基于分层聚类的均衡图像聚类方法
CN106557493A (zh) * 2015-09-25 2017-04-05 索意互动(北京)信息技术有限公司 一种数据检索方法、装置以及数据检索服务器
CN107704601A (zh) * 2017-10-13 2018-02-16 中国人民解放军第三军医大学第附属医院 大数据检索方法与系统、计算机存储介质及电子设备

Also Published As

Publication number Publication date
EP3835976A4 (en) 2021-10-27
US20210182318A1 (en) 2021-06-17
CN110874417B (zh) 2024-04-16
US11816117B2 (en) 2023-11-14
CN110874417A (zh) 2020-03-10
EP3835976A1 (en) 2021-06-16

Similar Documents

Publication Publication Date Title
WO2020048145A1 (zh) 数据检索的方法和装置
US10896164B2 (en) Sample set processing method and apparatus, and sample querying method and apparatus
CN108319987B (zh) 一种基于支持向量机的过滤-封装式组合流量特征选择方法
US11042815B2 (en) Hierarchical classifiers
KR101443187B1 (ko) 영상 군집화 기반의 의료 영상 검색 방법
CN107292097B (zh) 基于特征组的中医主症选择方法
CN107832456B (zh) 一种基于临界值数据划分的并行knn文本分类方法
WO2012165135A1 (ja) 近似最近傍探索に係るデータベースの登録方法および登録装置
WO2022166363A1 (zh) 一种基于近邻子空间划分高光谱影像波段选择方法及系统
US20230161811A1 (en) Image search system, method, and apparatus
Wan et al. Data driven multi-index hashing
CN114663770A (zh) 一种基于集成聚类波段选择的高光谱图像分类方法及系统
CN100440859C (zh) 一种位图聚合的递推流分类方法及其系统
CN116204647A (zh) 一种目标比对学习模型的建立、文本聚类方法及装置
WO2021012211A1 (zh) 一种为数据建立索引的方法以及装置
Ranbaduge et al. Scalable block scheduling for efficient multi-database record linkage
CN114547384A (zh) 资源对象处理方法、装置及计算机设备
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
CN110443308A (zh) 高维数据空间多球面分割的高效局部密度估计方法
Ji Research on fast de-duplication of text backup information in library database based on big data
Thuy et al. Incremental clustering for time series data based on an improved leader algorithm
JPH09265529A (ja) クラスタ分類方法及びクラスタ分類装置
Cavalcanti et al. Subconcept perturbation-based classifier for within-class multimodal data
CN116304253B (zh) 数据存储方法、数据检索方法和识别相似视频的方法
Qu et al. Multi-view k-means clustering algorithm with improved initialization strategy Effective multi-view clustering algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19856658

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019856658

Country of ref document: EP

Effective date: 20210310