Specific embodiment
The feature and exemplary embodiment of the various aspects of this specification is described more fully below, in order to make this specification
Objects, technical solutions and advantages are more clearly understood, and with reference to the accompanying drawings and embodiments, carry out to this specification further detailed
Description.It should be understood that specific embodiment described herein is only configured to explain this specification, it is not configured as limiting this theory
Bright book.To those skilled in the art, this specification can be in the feelings for not needing some details in these details
Implement under condition.Below the description of embodiment is used for the purpose of providing to this specification more by showing the example of this specification
Good understanding.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that including
There is also other identical elements in the process, method, article or equipment of element.
It, can be by extracting the feature vector of retrieval object in extensive vector index, and calculate between feature vector
The method of characteristic similarity, by the highest one or more features vector pair of characteristic similarity of the feature vector with retrieval object
The data object answered, the search result as retrieval object.
That is, the point in high-dimensional vector space is mapped as object is retrieved by feature extraction, it thus will retrieval
The similarity searching problem of object is converted into the arest neighbors in higher dimensional space and searches problem.
With directly retrieve data object data file compared with, based between feature vector characteristic similarity carry out to
Retrieving is measured, has the characteristics that computation complexity is lower and operand is lesser, vector index efficiency can be improved.
In this specification embodiment, data object is according to different data types, such as may include text data, figure
As a variety of different types of data objects such as data, audio data, video data.For the sake of to simplify the description, under this specification
The multiple embodiments stated illustrate vectorization of the image data in vector index by taking image data as an example.But the description is not answered
It is read as the range or operational feasibility of limitation this programme, the retrieval object of other data types other than image retrieval object
Processing method be consistent with the processing method to image retrieval object.
In one embodiment of this specification, feature vector is the characteristic information according to data object, by data object
It is mapped to the vector data that high-dimensional vector space obtains.In picture search, as an example, to search for personage in specified image
For face, retrieve facial image the step of may include:
Firstly, it is necessary to the face feature of personage in the facial image is extracted, ratio, face's colour of skin, head such as face's length and width
Send out the distance between color, the width (nose, mouth etc. in such as human face five-sense-organ) of local detail, eyes;Secondly, by the face of extraction
Feature Mapping obtains the vector data of a specified dimension to high-dimensional vector space, and the vector data of a face picture can be with
For characterizing the feature in the facial image, and it is properly termed as the feature vector of the facial image;Then, by the face figure of extraction
Feature vector compares in the feature vector and image library of picture, and according to the distance between feature vector or similarity, determining should
The search result of facial image.
In this specification one or more embodiment, central processing unit (Central Processing can use
Unit, CPU) or graphics processor (Graphics Processing Unit, GPU) for vector index process provide calculating money
Source.Wherein, CPU can be used for interpretive machine instruction and processing computer software as a kind of super large-scale integration
In data;Processor of the GPU as video card can be used for executing complicated data operation and geometric operation.
This specification one or more embodiment provides a kind of vector index method and vector index method, can be in benefit
During carrying out vector index with feature vector, CPU and GPU synergistic application is in the storage of data object, classification and retrieves
Cheng Zhong makes full use of cpu resource and GPU resource, promotes retrieval performance and recall precision.
Fig. 1 shows the configuration diagram of the searching system of this specification one embodiment.As shown in Figure 1, this specification
The searching system 100 of exemplary embodiment may include: data storage 101, CPU102, GPU103 and proxy server
104。
In one embodiment, the data stored in data storage 101 may include vector data set, and can fit
For such as 10,000,000,000 GB or more scales on a large scale sample storage and calculate demand, can economic and efficient big rule of analysis processing
Apperance notebook data.
As an example, data storage 101 can provide a variety of data processing services, can be inquired with support structureization
The vector index of language (Structured Query Language, SQL), and can support in data storage 101 to
Vector data in amount data acquisition system is read or write operation.
As an example, the vector data set stored in data storage 101 may include the feature of image data
Vector, the feature vector of voice data, the feature vector of video data and natural language processing (Natural Language
Process, NLP) data at least one of feature vector.
In one embodiment, proxy server 104 is for receiving retrieval request, according to the retrieval object in retrieval request
CPU and GPU is called to retrieve retrieval object.
It continues to refer to figure 1, in this specification one embodiment, vector index method may include:
Step S110 tears the vector data set in data storage 101 open with reference to memory parameters and video memory parameter
Point, obtain first part's vector subset of deposit memory and the second part vector subset of deposit video memory.
Step S120 carries out vector index in first part's vector subset of deposit memory by CPU, obtains data object
First part's similar features vector.
Step S130 carries out vector index by second part vector subset of the GPU to deposit video memory, obtains retrieval object
Second part similar features vector.
Step S140 determines retrieval object according to first part's similar features vector sum second part similar features vector
Search result.
In the vector index method of this specification, the vector index of data object is carried out at collaboration using CPU and GPU
Reason.The parallel processing capability and cpu resource for making full use of GPU improve the arithmetic speed and runnability of vector index.
Fig. 2 shows the flow diagrams of the vector index method of this specification one embodiment.As shown in Fig. 2, at this
In one embodiment of specification, vector index method 200 may include:
Step S210, opposite duration set carry out cluster calculation, obtain the class cluster of the specified quantity in vector set and each
The central point feature vector of class cluster.
In this step, the cluster calculation of opposite duration set, it can be understood as be according to the distance between vector or similar
A kind of data processing method being grouped is spent, every group of vector is known as a class cluster.By cluster calculation, available specified number
A class cluster is measured, and obtains the center vector of each class cluster in the specified quantity class cluster.Wherein, the center vector of each class cluster
With following similarity condition: vector similarity in same class cluster is greater than the vector similarity in inhomogeneity cluster, and to span
With a distance from center vector from same class cluster, less than the distance of the center vector of the vector distance inhomogeneity cluster.
In one or more embodiments of this specification, it can be clustered using a variety of clustering method opposite direction duration sets
It calculates.
In one embodiment, quick clustering method k-means algorithm can be used.In this embodiment, cluster calculation can
To include: k vector of random selection as initial center vector, each vector was assigned to apart from nearest center vector institute's generation
In the class cluster of table;After institute's directed quantity is assigned, for all vector datas in same class cluster, pass through the side being averaged
Formula recalculates the center vector of such cluster, to update center vector;According to the center vector of update, each vector is distributed
In class cluster representated by the new center vector nearest to distance.Iteration carries out vector distribution and vector center in vector set
The step of point updates until the variation of the center vector of each class cluster is less than distance threshold, or reaches maximum update times.
In another embodiment, can be taken based on level clustering algorithm carry out cluster calculation, such as BIRCH algorithm,
CURE algorithm etc.;In yet another embodiment, the clustering algorithm of density, such as DBSCAN algorithm, OPTICS can be taken based on
Algorithm etc.;Other clustering methods can also be taken to carry out cluster calculation, this specification embodiment is not specifically limited in this embodiment.
Step S220 obtains the central point feature vector of each class cluster, obtains central point feature vector set.
Step S230 divides central point feature vector set, obtains center point set A and center point set B.Wherein, in
Heart point set A includes a part of central point feature vector in central point feature vector set, and center point set B includes central point
Another part central feature vector in feature vector set.
In the description of the following embodiments of this specification, it can use GPU and determine the phase for retrieving object in the point set A of center
Like feature vector, and the similar features vector that object is retrieved in the point set B of center is determined using CPU.
Step S240 establishes the index of central point feature vector for each central point feature vector in the point set A of center
File, centered on point set A central point index file.
Step S250, for class cluster belonging to each central point feature vector file in the point set A of center, for such cluster
In central point feature vector other than each feature vector, establish the index data of feature vector, centered on point set A
Vector index file.
Step S260 establishes the index of central point feature vector for each central point feature vector in the point set B of center
Data, centered on point set B first order index data.
Step S270, for class cluster belonging to each central point feature vector in the point set B of center, in such cluster
Each feature vector other than central point feature vector, establishes the index data of feature vector, centered on point set B second
Grade index data.
In one or more embodiments of this specification, retrieval is being searched compared to the similarity for directly passing through feature vector
The method of the similar features vector of object, retrieval rate can obviously be accelerated and improve to examine by carrying out vector index by vector index
Rope efficiency.
In one embodiment, the central point index file of center point set A or vector index file, center point set B
Central point index file or vector index file can be direct retrieval file or inverted index file (Inverted File
Index)。
In one embodiment, direct index file can be used for recording the mark ID of feature vector Yu this feature vector
Between corresponding relationship, a feature vector at least has a mark ID, and the mark ID of each feature vector have it is unique
Property.
In one embodiment, inverted index file can be used for recording the center vector in each feature vector grouping
The center vector mark ID between corresponding relationship and each feature vector grouping in center vector other than other
Corresponding relationship between feature vector and the mark ID of this feature vector.
In one or more embodiments of this specification, inverted index is a kind of higher vector index mode of efficiency,
It is particularly suitable for the retrieval of mass data.It is only a kind of implementation that this specification provides using inverted index, in practical application
In other modes that arbitrarily can establish index information can be used as the implementation of this specification embodiment.
Fig. 3 shows the flow diagram of the vector index method of this specification one embodiment.As shown in figure 3, at this
In one embodiment of specification, vector index method 300 may include:
Step S311 using the central point index file (such as central point index A) pre-established, is obtained in GPU
N1 nearest central point vector of distance retrieval object, the N1 specified integer for being more than or equal to 1 in heart point set A.
Step S313, for each of N1 central point vector central point vector, according to the central point pre-established
The vector index file (such as vector index A) of set A in the class cluster belonging to central point vector, passes through reverse index, determines
The central point vector of most like specified number with retrieval object, as the retrieval knot in class cluster belonging to each central point vector
Fruit.
In this step, the search result in class cluster belonging to each central point vector is summarized, obtains GPU processing
Obtained first part's similarity vector search result, according to each of retrieval object and first part similarity vector search result
The order of the similarity of similarity vector from big to small is ranked up, and obtains K1 top ranked similarity vector of similarity, and K1 is
The specified integer for being more than or equal to 1.
S311 to step S313 through the above steps obtains the storage obtained through the vector index method that GPU executes processing
In vector subset in video memory, first part's similar features vector index result of object is retrieved.
Step S321 utilizes central point index file (such as the central point of the center point set B pre-established in CPU
Index B), determine the N2 central point vector that distance retrieval object is nearest in the point set B of center, what N2 was specified is more than or equal to 1
Integer.
Step S323, for each of N2 central point vector central point vector, according to the central point pre-established
The vector index file (such as vector index B) of set B in the class cluster belonging to central point vector, passes through reverse index, determines
The central point vector of most like specified number with retrieval object, as the retrieval knot in class cluster belonging to each central point vector
Fruit.
In this step, the search result in class cluster belonging to each central point vector is summarized, obtains CPU processing
Obtained second part similarity vector search result, according to retrieval each of object and second part similarity vector search result
The order of the similarity of similarity vector from big to small is ranked up, and obtains K2 top ranked similarity vector of similarity, and K2 is
The specified integer for being more than or equal to 1.
S321 to step S323 through the above steps obtains the storage obtained through the vector index method that CPU executes processing
In vector subset in memory, the second part similar features vector index result of object is retrieved.
In the embodiment of this specification, N2 may be the same or different with above-mentioned N1;K2 can phase with above-mentioned K1
Together, it can also be different.
Step S330, from high to low according to the similarity degree with retrieval object, the first part handled GPU is similar
Vector index is as a result, the second part similarity vector search result handled with CPU is ranked up, and extracts similarity degree row
The vector of the preceding specified number of name, the search result as retrieval object.
According to the vector index method of this specification, CPU and GPU synergistic application is in the vector index process of data object
In, to improve using cpu resource and GPU resource and promote retrieval performance and recall precision.
Fig. 4 shows the flow diagram of the vector index method of this specification another embodiment.As shown in figure 4, one
In a embodiment, vector index method 400 may include:
Step S410 extracts the feature vector of retrieval object.
Step S420 obtains similar with feature vector vector from graphics processor GPU, as first part it is similar to
Amount.
Step S430 obtains similar with feature vector vector from central processor CPU, as second part it is similar to
Amount.
Step S440 determines the retrieval knot of retrieval object according to first part's similarity vector and second part similarity vector
Fruit.
In one embodiment, step S420 can specifically include:
Step S422, selects one or more first center vectors from GPU, in one or more first center vectors
Each first center vector and feature vector meet the first default similarity condition.
Step S424, in the vector grouping where each first center vector, selection first part's similarity vector, first
Part similarity vector and feature vector meet the second default similarity condition.
In one embodiment, one or more first center vectors are by retrieving first row's of the falling rope pre-established
For argument according to obtained vector, the first inverted index data include the index address of each first center vector.
In one embodiment, first part's similarity vector is obtained by retrieving the first index data pre-established
Vector, the first index data includes the index address of vector in vector grouping where each first center vector, wherein every
Index address in a vector grouping between vector is different.
In one embodiment, step S430 can specifically include:
Step S432, selects one or more second center vectors from CPU, in one or more second center vectors
Each second center vector meets third and presets similarity condition.
Step S434, in the vector grouping where each second center vector, selection second part similarity vector, second
Part similarity vector and feature vector meet the 4th default similarity condition.
In one embodiment, one or more second center vectors are by retrieving second row's of the falling rope pre-established
For argument according to obtained vector, the second inverted index data include the index address of each second center vector.
In one embodiment, second part similarity vector is obtained by retrieving the second index data pre-established
Vector, the second index data includes the index address of vector in vector grouping where each second center vector, wherein every
Index address in a vector grouping between vector is different.
In one embodiment, the search result of object is retrieved, including from high to low according to similarity, from first part's phase
Like the vector for the specified number that vector sum second part similarity vector selects.
According to the vector index method of this specification, CPU and GPU synergistic application is in the vector index process of data object
In, to improve using cpu resource and GPU resource and promote retrieval performance and recall precision.
In one embodiment, first part's similarity vector is obtained from the pre-loaded primary vector subset of GPU
Vector similar with feature vector;Second part similarity vector, be obtained from the pre-loaded secondary vector subset of CPU with
The similar vector of feature vector;Primary vector subset and secondary vector subset are to carry out cutting to specified feature vector set
It obtains.
In one embodiment, the cutting carried out to specified feature vector set is size and video memory size based on memory
The cutting of progress.
According to the vector index method of this specification, opposite direction duration set carries out cutting in advance, memory and video memory is stored in, rear
It is continuous to retrieve vector stored in memory using CPU, and using the vector stored in GPU retrieval video memory for retrieval object, it can
To utilize cpu resource and GPU resource, improves and promote retrieval performance and recall precision.
Fig. 5 shows the flow chart of the vector index device according to this specification one embodiment.As shown in figure 5, one
In a embodiment, vector index device 500 may include:
Vector extraction module 510, for extracting the feature vector of retrieval object.
First retrieval module 520, for obtaining similar with feature vector vector from graphics processor GPU, as the
A part of similarity vector.
Second retrieval module 530, for obtaining similar with feature vector vector from central processor CPU, as the
Two part similarity vectors.
As a result determining module 540, for determining retrieval pair according to first part's similarity vector and second part similarity vector
The search result of elephant.
In one embodiment, the first retrieval module 520 is specifically used for:
One or more first center vectors, each of one or more first center vectors first are selected from GPU
Center vector and feature vector meet the first default similarity condition;
In the vector grouping where each first center vector, first part's similarity vector is selected, first part is similar
Vector and feature vector meet the second default similarity condition.
In one embodiment, one or more first center vectors are by retrieving first row's of the falling rope pre-established
For argument according to obtained vector, the first inverted index data include the index address of each first center vector.
In one embodiment, first part's similarity vector is obtained by retrieving the first index data pre-established
Vector, the first index data includes the index address of vector in vector grouping where each first center vector, wherein every
Index address in a vector grouping between vector is different.
In one embodiment, the second retrieval module 530 is specifically used for:
One or more second center vectors, each of one or more second center vectors second are selected from CPU
Center vector meets third and presets similarity condition;In the vector grouping where each second center vector, second is selected
Divide similarity vector, second part similarity vector and feature vector meet the 4th default similarity condition.
In one embodiment, one or more second center vectors are by retrieving second row's of the falling rope pre-established
For argument according to obtained vector, the second inverted index data include the index address of each second center vector.
In one embodiment, second part similarity vector is obtained by retrieving the second index data pre-established
Vector, the second index data includes the index address of vector in vector grouping where each second center vector, wherein every
Index address in a vector grouping between vector is different.
In one embodiment, retrieve object search result include according to similarity from high to low, from first part's phase
Like the vector for the specified number that vector sum second part similarity vector selects.
In one embodiment, first part's similarity vector is obtained from the pre-loaded primary vector subset of GPU
Vector similar with feature vector;Second part similarity vector, be obtained from the pre-loaded secondary vector subset of CPU with
The similar vector of feature vector;Primary vector subset and secondary vector subset are to carry out cutting to specified feature vector set
It obtains.
In one embodiment, the cutting carried out to specified feature vector set is size and video memory size based on memory
The cutting of progress.
According to the vector index method of this specification, the vector index process of CPU and GPU collaboration processing data object, benefit
With cpu resource and GPU resource, improves and promote retrieval performance and recall precision.And it is possible to which opposite direction duration set is cut in advance
Point, it is stored in memory and video memory, subsequent for retrieval object, retrieves vector stored in memory using CPU, and utilize GPU
The vector stored in retrieval video memory, can use cpu resource and GPU resource, improves and promotes retrieval performance and recall precision.
Fig. 6 is the structure for showing the exemplary hardware architecture that can be realized the calculating equipment according to this specification embodiment
Figure.As shown in fig. 6, calculate equipment 600 include input equipment 601, input interface 602, central processing unit 603, memory 604,
Output interface 605 and output equipment 606.Wherein, input interface 602, central processing unit 603, memory 604 and output
Interface 605 is connected with each other by bus 610, and input equipment 601 and output equipment 606 pass through input interface 602 and output respectively
Interface 605 is connect with bus 610, and then is connect with the other assemblies for calculating equipment 600.Specifically, input equipment 601, which receives, comes
Central processing unit 603 is transmitted to by information is inputted from external input information, and by input interface 602;Central processing unit 603
Input information is handled to generate output information based on the computer executable instructions stored in memory 604, will be exported
Information temporarily or permanently stores in the memory 604, and output information is then transmitted to output by output interface 605 and is set
Standby 606;Output information is output to the outside of calculating equipment 600 for users to use by output equipment 606.
In one embodiment, calculating equipment 600 shown in fig. 6 may be implemented as a kind of vector index equipment, this to
Measuring retrieval facility may include: memory, be configured as storage program;Processor is configured as storing in run memory
Program, to execute the vector index method of above-described embodiment description.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member
It is physically separated with being or may not be, component shown as a unit may or may not be physics list
Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs
In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness
Labour in the case where, it can understand and implement.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment
It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable
Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can
With or may be advantageous.