CN109783547B

CN109783547B - Similarity connection query method and device

Info

Publication number: CN109783547B
Application number: CN201910130094.7A
Authority: CN
Inventors: 马友忠; 张瑞玲; 林春杰; 李莹
Original assignee: Luoyang Normal University
Current assignee: Luoyang Normal University
Priority date: 2019-02-21
Filing date: 2019-02-21
Publication date: 2020-08-21
Anticipated expiration: 2039-02-21
Also published as: CN109783547A

Abstract

A similarity connection query method and device relate to the field of data processing. When similarity connection query is carried out, an original vector set for similarity connection query, the number of vector pairs of similarity connection query results and an initial data set of the similarity connection query results are obtained, then the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, a similarity distribution histogram of the original vector set is constructed, a vector distance threshold value is calculated according to the similarity distribution histogram and the number of result vectors, and finally the initial result vector pair set is updated according to the plurality of sub-vector grouping sets, the vector distance threshold value and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results.

Description

Similarity connection query method and device

Technical Field

The present application relates to the field of data processing, and in particular, to a method and an apparatus for similarity connection query.

Background

The similarity connection query is to find out data pairs with similarity greater than or equal to a given similarity threshold or distance less than or equal to a given distance threshold from a massive high-dimensional data set, and has important applications in many fields, such as image clustering, repeated web page detection, similar user recommendation and the like. At present, similarity connection query can be performed by a similarity connection query method based on a threshold, and the selection of the threshold needs to be manually predetermined according to the distance distribution condition between vectors in a vector set to be queried, so as to finally obtain a preset number of data pairs. However, in practice, it is found that the similarity connection query method based on the threshold needs to manually preset the threshold, then continuously modify the threshold, and continuously and repeatedly execute the similarity connection query method according to the new threshold until a preset number of data pairs are obtained, so that a large amount of redundant and repetitive calculation work is brought when the similarity connection query method is continuously and repeatedly executed, and further the similarity connection query efficiency is low.

Disclosure of Invention

The embodiment of the application aims to provide a similarity connection query method and a similarity connection query device, which do not need to manually preset a threshold value, can reduce a large amount of redundant calculation, and further improve the similarity connection query efficiency.

A first aspect of the embodiments of the present application provides a method for querying similarity connection, including:

acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is a data set for similarity connection query, the initial result vector pair set is an initial data set for similarity connection query results, and the result vector quantity represents the vector pair quantity of the similarity connection query results;

grouping the original vector set to obtain a plurality of sub-vector grouping sets;

constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets;

calculating a vector distance threshold according to the similarity distribution histogram and the quantity of the result vectors;

and updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set for representing similarity connection query results.

In the implementation process, when similarity connection query is carried out, an original vector set for carrying out similarity connection query, the number of vector pairs of similarity connection query results and an initial data set of the similarity connection query results are obtained, then the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, then a similarity distribution histogram of the original vector set is constructed, further, a vector distance threshold value is calculated according to the similarity distribution histogram and the number of result vectors, and finally the initial result vector pair set is updated according to the plurality of sub-vector grouping sets, the vector distance threshold value and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results, so that the similarity connection query method is carried out for a plurality of times without manually presetting the vector distance threshold value or manually and continuously changing the vector distance threshold value, and then can reduce a large amount of redundant calculations, and then promote similarity connection inquiry efficiency.

Further, according to the number of the result vectors and the plurality of sub-vector grouping sets, constructing a similarity distribution histogram of the original vector set, including;

acquiring a plurality of distance vector intervals according to a preset interval division rule and the original vector set;

grouping the original vector set to obtain a plurality of sub-vector sets;

calculating a total similar result estimation value corresponding to each distance vector interval according to the plurality of subvector sets;

and constructing a similarity distribution histogram of the original vector set according to the overall similarity result estimation value corresponding to each distance vector interval.

In the implementation process, a plurality of distance vector intervals can be obtained according to a preset interval division rule and an original vector set, then the original vector set is subjected to grouping processing, finally, an overall similarity result estimation value corresponding to each distance vector interval is calculated, and finally, a similarity distribution histogram of the original vector set can be constructed. The abscissa of the similarity distribution histogram represents a distance vector interval, and the ordinate thereof represents an overall similarity result estimation value corresponding to each distance vector interval.

Further, the obtaining a plurality of distance vector intervals according to a preset interval division rule and the original vector set includes:

pairing and combining the original vectors in the original vector set to obtain a plurality of original vector pairs;

calculating the vector distance between two original vectors in each original vector pair to obtain a vector distance set;

selecting the maximum vector distance from the vector distance set as an interval upper limit value of the to-be-divided intervals;

and dividing the interval to be divided into a plurality of subintervals according to the interval upper limit value, the preset interval lower limit value and the preset interval division rule, wherein the plurality of subintervals are a plurality of distance vector intervals.

In the implementation process, the original vectors in the original vector set are combined pairwise to obtain a plurality of original vector pairs, then the vector distance of each original vector pair is calculated to obtain a vector distance set, the maximum vector distance in the orientation vector distance set is used as the interval upper limit value of the interval to be divided, the interval to be divided can be obtained through the preset interval lower limit value and the interval upper limit value, and finally the interval to be divided is divided into a plurality of subintervals according to the preset interval division rule, wherein the obtained plurality of subintervals are the plurality of distance vector intervals. Different distance vector intervals can be obtained according to different original vector sets, the constructed similarity distribution histogram is matched with the original vector sets better, and therefore the accuracy of the similarity connection query method is improved.

Further, calculating an overall similarity result estimation value corresponding to each distance vector interval according to the set of the plurality of subvectors includes:

sampling each sub-vector set to obtain a sampling vector set corresponding to each sub-vector set;

determining at least one sub vector set corresponding to each distance vector interval;

calculating an intra-packet result quantity estimation value and an inter-packet result quantity estimation value corresponding to each distance vector interval according to at least one sub-vector set corresponding to each distance vector interval;

calculating a total similar result quantity estimated value corresponding to each distance vector interval according to the intra-group result quantity estimated value corresponding to each distance vector interval and the inter-group result quantity estimated value corresponding to each distance vector interval; wherein the overall similar result quantity estimation value is an overall similar result quantity estimation value of the at least one sub-vector set corresponding to the distance vector interval.

In the implementation process, the intra-group result quantity estimation value corresponding to each sub-vector set is calculated, then the plurality of sub-vector sets are combined pairwise to obtain a plurality of sub-vector set pairs, then the inter-group result quantity estimation value corresponding to each sub-vector set pair is calculated, finally, at least one sub-vector set corresponding to each distance vector interval is determined, and the overall similar result quantity estimation value of the at least one sub-vector set is calculated.

Further, grouping the original vector set to obtain a plurality of sub-vector grouping sets, including:

acquiring original vector dimensions of the original vector set;

performing dimensionality reduction processing on the original vector set according to the original vector dimensionality and a preset dimensionality reduced dimensionality to obtain a dimensionality reduced vector set;

performing symbolization processing on the dimensionality reduction vector set to obtain a character string set;

grouping the original vector set according to the character string set to obtain a plurality of sub-vector sets; and one character string in the character string set corresponds to one sub vector set.

In the implementation process, after the dimensionality reduction processing is carried out on the original vector set, a dimensionality reduction vector set is obtained, and then the dimensionality reduction vector set is symbolized to obtain a character string set. Because the obtained character strings may be the same after different vector symbolization, the original vector set is grouped according to the character string set to obtain a plurality of sub-vector sets, and the character strings corresponding to at least one original vector in each sub-vector set are the same.

Further, the updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold, and the result vector number to obtain a result vector pair set for representing a similarity join query result includes:

judging whether the number of vector pairs in the initial result vector pair set is smaller than the number of result vectors;

if the number of vector pairs in the initial result vector pair set is not less than the number of result vectors, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set;

and extracting the vector pairs of the result vector quantity from the final updated initial result vector pair set, and combining to obtain a result vector pair set for representing similarity connection query results.

In the implementation process, after the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, continuously updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set; and then extracting the vector pairs of the number of the result vectors from the finally updated initial result vector pair set to form a result vector pair set for representing similarity connection query results.

Further, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set, including:

step 1, determining a plurality of character string pairs which accord with the current character string condition according to the character string set to obtain a candidate character string pair set, and determining a sub-vector set pair corresponding to each pair of character string pairs in the candidate character string pair set according to a plurality of sub-vector sets;

step 2, determining at least one vector pair meeting the conditions of the current similar vector pair according to the corresponding sub-vector set pair of each pair of character string pairs;

step 3, updating the initial result vector pair set according to the at least one vector pair to obtain an updated initial result vector pair set, and respectively updating the character string condition and the similar vector pair condition according to the updated initial result vector pair set to obtain an updated character string condition and an updated similar vector pair condition;

and 4, repeatedly executing the steps 1 to 3 until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and taking the updated initial result vector pair set obtained by executing the step 3 for the last time as a final updated initial result vector pair set.

In the implementation process, the steps 1 to 3 are continuously circulated, and in each circulation, the character string condition and the similar vector pair condition are respectively updated according to the updated initial result vector pair set until the number of the vector pairs in the obtained initial result vector pair set is not less than the number of the result vectors.

Further, calculating a vector distance threshold according to the similarity distribution histogram and the number of the result vectors, including:

determining vector intervals corresponding to the number of the result vectors from a plurality of distance vector intervals as result vector intervals;

and determining a linear equation corresponding to the result vector interval, and calculating a vector distance threshold according to the linear equation.

In the implementation process, the vector distance threshold can be obtained by calculation according to the similarity distribution histogram of the original vector set and the quantity of the result vectors without manual presetting, so that the similarity connection query method is not required to be executed for multiple times by manually and continuously changing the vector distance threshold, a large amount of redundant calculation can be reduced, and the similarity connection query efficiency is improved.

Further, the equation of the straight line is:

wherein x is₀，x₁，x₂，…，x_i-1，x_i，…，x_mFor each interval boundary of the distance vector interval, k is the number of result vectors, x_kIs the vector distance threshold, (x)_i-1，x_i) Is the result vector interval, (y)_i-1，y_i) The similar result quantity interval corresponding to the result vector interval;

wherein, k ∈ (y)_i-1，y_i)。

In the above implementation, x₀，x₁，x₂，…，x_i-1，x_i，…，x_kThe interval boundary of each distance vector interval is (x)₀，x₁)，(x₁，x₂)，…，(x_i-1，x_i)，…，(x_k-1，x_k) And in the similarity distribution histogram, each distance vector interval corresponds to a similar result quantity interval.

A second aspect of the embodiments of the present application provides an apparatus for querying similarity connection, including:

the first acquisition module is used for acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is a data set for similarity connection query, the initial result vector pair set is an initial data set for similarity connection query results, and the result vector quantity represents the vector pair quantity of the similarity connection query results;

the grouping module is used for grouping the original vector set to obtain a plurality of sub-vector grouping sets;

the construction module is used for constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets;

the calculation module is used for calculating a vector distance threshold according to the similarity distribution histogram and the quantity of the result vectors;

and the second acquisition module is used for updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set used for representing the similarity connection query result.

In the implementation process, when similarity connection query is performed, the first obtaining module first obtains an original vector set for performing similarity connection query, the number of vector pairs of similarity connection query results and an initial data set of the similarity connection query results, then the grouping module performs grouping processing on the original vector set to obtain a plurality of sub-vector grouping sets, then the building module builds a similarity distribution histogram of the original vector set, further, the calculating module calculates a vector distance threshold according to the similarity distribution histogram and the number of result vectors, and finally the second obtaining module updates the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results, thereby realizing that the vector distance threshold is not required to be manually preset, and the similarity connection query method does not need to be executed for multiple times according to the fact that the vector distance threshold value is changed manually, so that a large amount of redundant calculation can be reduced, and the similarity connection query efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic block diagram of a flow of a similarity connection query method provided in embodiment 1 of the present application;

fig. 2 is a schematic block diagram of a flow of a similarity connection query method provided in embodiment 2 of the present application;

fig. 3 is a schematic block diagram of a structure of an affinity connection query apparatus according to embodiment 3 of the present application;

fig. 4 is a similarity distribution histogram of an original vector set provided in embodiment 2 of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Example 1

Referring to fig. 1, fig. 1 is a schematic block diagram illustrating a flow of a similarity connection query method according to an embodiment of the present application. As shown in fig. 1, the similarity connection query method includes:

s101, acquiring an original vector set to be queried, a result vector quantity and an initial result vector pair set.

In the embodiment of the application, the original vector set is a data set for similarity connection query, the initial result vector pair set is an initial data set for similarity connection query results, and the number of result vectors represents the number of vector pairs for similarity connection query results.

Similarity connection query has important application in many fields, such as image clustering, repeated web page detection, similar user recommendation and the like. Correspondingly, the original vector set to be queried may be an original image data set, an original web page data set, an original user data set, and the like, which is not limited in this embodiment.

In the embodiment of the present application, the number of result vectors indicates the number of results of the similarity join query. In practical applications, the number of result vectors indicates the number of similar images in the original image data set, the number of repeated web page detection results in the original web page data set, the number of similar users in the original user data set, and the like.

In the embodiment of the present application, the set of initial result vector pairs may be an empty set, i.e., a set without any elements.

S102, grouping the original vector set to obtain a plurality of sub-vector grouping sets.

In the embodiment of the present application, the original vector set is a high-dimensional vector set, and the original vectors included in the high-dimensional vector set are high-dimensional vectors. The original vector set may be subjected to dimensionality reduction processing by using a Piecewise Accumulation Approximation (PAA) to obtain a dimensionality reduction vector set, and then the dimensionality reduction vector set is subjected to symbolization processing by using a symbol accumulation Approximation (SAX) to obtain a character string set. Different original vectors may have the same character string, so that the original vectors in the original vector set may be grouped by using the SAX character string to obtain a plurality of sub-vector grouping sets, and the character string corresponding to at least one original vector in each sub-vector grouping set is the same.

S103, constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets.

And S104, calculating a vector distance threshold according to the similarity distribution histogram and the number of the result vectors.

In the prior art, an actually used similarity connection query method based on a k-threshold needs to manually preset a threshold (namely a vector distance threshold), if the preset k-threshold is too large, a lot of redundant calculations are generated, and if the preset k-threshold is too small, the number of returned results does not meet the requirement, and a second calculation is needed. In the similarity connection query method described in the embodiment of the present application, the vector distance threshold used does not need to be manually preset, and is calculated according to the original vector sets, and the vector distance thresholds calculated by different original vector sets may be different, thereby reducing redundant calculation and improving the similarity connection query efficiency.

And S105, updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set for representing the similarity connection query result.

In the embodiment of the application, in practical application, when the similarity distribution histogram of the original vector set is constructed, the original vector set needs to be grouped to obtain a plurality of sub-vector grouping sets, and then the similarity distribution histogram of the original vector set is constructed according to the number of the result vectors and the plurality of sub-vector grouping sets. Before step S105 is executed, the original vector set may be subjected to grouping again, so as to obtain a plurality of sub-vector grouping sets. Although the plurality of sub-vector grouping sets obtained by the grouping processing twice are the same, because the data volume of the original vector set is large and the data volume of the plurality of sub-vector grouping sets obtained by grouping is also large, the plurality of sub-vector grouping sets obtained before the similarity distribution histogram of the original vector set is constructed are not always stored, and the original vector set needs to be grouped again before the step S105 is executed.

In the similarity connection query method described in fig. 1, when performing similarity connection query, an original vector set for performing similarity connection query, the number of vector pairs of similarity connection query results, and an initial data set of similarity connection query results are obtained, then the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, then a similarity distribution histogram of the original vector set is constructed, further, a vector distance threshold is calculated according to the similarity distribution histogram and the number of result vectors, and finally the initial result vector pair set is updated according to the plurality of sub-vector grouping sets, the vector distance threshold, and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results. Therefore, by implementing the similarity connection query method described in fig. 1, the similarity connection query method can be executed for multiple times without manually presetting a vector distance threshold value or manually and continuously changing the vector distance threshold value, so that a large amount of redundant calculation can be reduced, and the similarity connection query efficiency can be improved.

Example 2

Referring to fig. 2, fig. 2 is a schematic block diagram of a flow of an affinity connection query method according to an embodiment of the present application. As shown in fig. 2, the similarity connection query method includes:

s201, acquiring an original vector set to be queried, a result vector quantity and an initial result vector pair set.

S202, obtaining original vector dimensions of the original vector set.

In the embodiment of the present application, the original vector set is a high-dimensional vector set, and includes a plurality of original vectors, where the dimension of each original vector is the original vector dimension. And if the dimension of the original vector is n, the dimension of each original vector in the original vector set is n. An n-dimensional original vector is a point of an n-dimensional euclidean space (elements are n-dimensional vectors).

S203, performing dimensionality reduction processing on the original vector set according to the original vector dimensionality and the preset dimensionality reduced dimensionality to obtain a dimensionality reduced vector set.

In the embodiment of the application, the dimensionality of the dimensionality reduction vector set obtained by performing dimensionality reduction on the original vector dimensionality is the dimensionality after the preset dimensionality reduction.

As an optional implementation, the dimensionality reduction processing on the original vector dimensionality can adopt a piecewise aggregation approximation PAA dimensionality reduction technology. Setting the original vector dimension of an original vector set as d, setting the dimension after dimension reduction as d', and setting any original vector in the original vector set as X ═ X<x₁，x₂，…，x_d>The dimensionality reduction vector corresponding to any original vector in the dimensionality reduction vector set is

Then the original vector dimension is subjected to dimension reduction processing, which can be calculated according to the following formula:

wherein x is_jFor the elements in which the original vector is X,

for the reduced-dimension vector X corresponding to the original vector_pOf (1).

In the above embodiment, the distance D between two dimension-reduced vectors in the dimension-reduced vector set_pIs the Euclidean distance D of two corresponding original vectors in the original vector set_ELower bound of (D), i.e. D_p≤D_E。

After step S203, the following steps are also included:

and S204, performing symbolization processing on the dimensionality reduction vector set to obtain a character string set.

As an alternative, the symbolization process for the dimensionality reduction vector set may adopt a symbol accumulation approximation SAX method. That is, according to a certain rule, each dimension-reducing vector in the dimension-reducing vector set is mapped to a character string in the character string set, so that the original vector X of dimension d can be represented by a character string in a character string set, and is recorded as:

s205, grouping the original vector set according to the character string set to obtain a plurality of sub-vector sets; and one character string in the character string set corresponds to one sub vector set.

In the embodiment of the present application, by performing the above steps S202 to S205, the original vector set can be grouped to obtain a plurality of sub-vector grouped sets.

In the embodiment of the present application, after performing dimension reduction and symbolization on the original vectors of the original vector set, different original vectors may be mapped to the same character string, so that the original data may be grouped by using the character string, and the original vectors of the same character string may be grouped into one group.

In the embodiment of the present application, the probability of similarity between original vectors in a set of sub-vectors corresponding to one character string is high, and the probability of similarity between original vectors in sets of sub-vectors corresponding to different character strings is low.

In the embodiment of the present application, it is assumed that the original vector set includes X₁、X₂、X₃、X₄、X₅、X₆Six original vectors, and after the original vector set is subjected to dimensionality reduction and symbolization, a character string set comprising s can be obtained₁、s₂、s₃Three character strings, wherein X₁The corresponding character string is s₁，X₂The corresponding character string is s₁，X₃The corresponding character string is s₁，X₄The corresponding character string is s₂，X₅The corresponding character string is s₃，X₆The corresponding character string is s₃. The six original vectors comprised by the original vector set may be divided into three sub-vector sets according to the character string set, wherein the first sub-vector set comprises X₁、X₂、X₃Three vectors, the second set of sub-vectors comprising X₄One vector, the third set of sub-vectors comprising X₅、X₆Two vectors, and the character string corresponding to the first sub-vector set is s₁The character string corresponding to the second set of subvectors is s₂The character string corresponding to the third set of subvectors is s₃。

After step S205, the following steps are also included:

and S206, obtaining a plurality of distance vector intervals according to a preset interval division rule and the original vector set.

In the embodiment of the present application, the preset interval division rule may be equal interval division, and the like, and the embodiment is not limited in any way.

As an optional implementation manner, obtaining a plurality of distance vector intervals according to a preset interval division rule and an original vector set may include the following steps:

carrying out pairing combination processing on original vectors in an original vector set to obtain a plurality of original vector pairs;

and dividing the interval to be divided into a plurality of sub-intervals according to the interval upper limit value, the preset interval lower limit value and the preset interval division rule, wherein the plurality of sub-intervals are a plurality of distance vector intervals.

In the above embodiments, the vector distance between two original vectors may be a euclidean distance, a manhattan distance, a chebyshev distance, a minkowski distance, a Mahalanobis distance (Mahalanobis distance), or the like, but the present embodiment is not limited thereto.

In the above embodiments, the euclidean distance, also referred to as an euclidean metric (euclidean metric), is a true distance between two points in an m-dimensional space, or a natural length of a vector (i.e., a distance of the point from an origin). The euclidean distance in two and three dimensions is the actual distance between two points.

As a further alternative embodiment, the vector distance between two original vectors is the euclidean distance. Let original vector X be ═<x₁，x₂，…，x_d>And original vector Y ═<y₁，y₂，…，y_d>The euclidean distance between them is calculated as:

where dist (X, Y) is the euclidean distance between the original vector X and the original vector Y, and d is the original vector dimension.

In the above embodiment, the preset lower limit of the interval may be 0, and the present embodiment is not limited thereto.

In the above embodiment, if the preset interval division rule is to divide the to-be-divided interval into 4 distance vector intervals at equal intervals, and the preset interval lower limit is 0, the obtained interval upper limit is set to dist_maxThen the to-be-divided interval is (0, dist)_max). For example, when dist_maxIf 20, the to-be-divided interval is (0, 20), and the to-be-divided interval may be divided into 4 distance vector intervals according to the preset interval division rule, which are (0, 5), (5, 10), (10, 15), (15, 20), respectively.

In the above embodiment, the opening and closing of the two end points of the section is not limited.

After step S206, the following steps are also included:

and S207, calculating a total similarity result estimation value corresponding to each distance vector interval according to the plurality of sub-vector sets.

In this embodiment of the present application, calculating an overall similarity result estimation value corresponding to each distance vector interval according to a plurality of sub-vector sets may include the following steps:

sampling each sub vector set to obtain a sampling vector set corresponding to each sub vector set;

determining at least one sub-vector set corresponding to each distance vector interval;

calculating an overall similar result quantity estimation value corresponding to each distance vector interval according to the intra-group result quantity estimation value corresponding to each distance vector interval and the inter-group result quantity estimation value corresponding to each distance vector interval; and the total similar result quantity estimation value is the total similar result quantity estimation value of at least one sub vector set corresponding to the distance vector interval.

In the embodiment of the present application, for example, after the original vector set is subjected to grouping processing, m sub-vector sets may be obtained, where the number of vectors included in the ith sub-vector set is N_iI is 1, …, m. The specific steps of calculating the overall similarity result estimation value corresponding to each distance vector interval according to the plurality of sub-vector sets are as follows:

(1) and respectively carrying out random sampling treatment on each sub-vector set to obtain a sampling sample set corresponding to each sub-vector set. The number of sample vectors in a sampling sample set corresponding to the ith sub-vector set is S_i。

(2) And determining at least one sub vector set corresponding to each distance vector interval. Because one character string in the character string set corresponds to one sub-vector set, all the character strings in the character string set are combined pairwise to obtain a plurality of character string pairs, then the character string distance between two character strings in each character string pair is calculated, and then at least one sub-vector set corresponding to each distance vector interval is determined according to the character string distance. For example, assuming that three distance vector intervals are obtained in total, it is assumed that the character string 1 corresponds to a first sub-vector set, the character string 2 corresponds to a second sub-vector set, and the character string distance between the character string 1 and the character string 2 is dist1, if the dist1 belongs to a third distance vector interval, the third distance vector interval corresponding to the first sub-vector set and the second sub-vector set can be obtained.

(2) Taking the calculation of the total similarity result quantity estimation value of at least one sub-vector set corresponding to a distance vector interval as an example, let the distance vector interval (x)_i-1，x_i) The corresponding at least one sub-vector set is X respectively₁、X₂、X₃Wherein X is₁The corresponding set of sample samples is T₁，X₂The corresponding set of sample samples is T₂，X₃The corresponding set of sample samples is T₃The specific calculation steps are as follows:

first, separately calculating T₁、T₂、T₃Similar to the number of results. Wherein, the ith sub-vector set X_iThe corresponding set of sample samples is T_iThe number of similar results of the sample of (1) is R_i；

Second, calculating T separately₁And T₂、T₂And T₃、T₁And T₃The number of similar results between samples. Wherein, the ith sub-vector set X_iAnd jth set of sub-vectors X_jThe number of inter-sample similarity results between corresponding sets of sampled samples is R_ij；

Thirdly, respectively calculating X₁、X₂、X₃An estimate of the number of results within the packet. Wherein, the ith sub-vector set X_iThe corresponding intra-packet outcome quantity estimate is

The fourth step, respectively calculating X₁And X₂、X₂And X₃、X₁And X₃Corresponding inter-packet outcome quantity estimates. Wherein, the ith sub-vector set X_iAnd jth set of sub-vectors X_jCorresponding inter-packet outcome quantity estimate

The fifth step, calculate the distance vector interval (x)_i-1，x_i) The corresponding overall similarity measure is estimated as

In the embodiment of the present application, a distance vector interval (x) is defined_i-1，x_i) Corresponding at least one set of sub-vectors (X)₁、X₂、X₃) Wherein X is₁Corresponding toSet of sample samples as T₁，X₂The corresponding set of sample samples is T₂，X₃The corresponding set of sample samples is T₃. To calculate a set of sample samples T₁For example, the number of similar results in the sample set is T₁Combining every two sample vectors to obtain multiple sample vector pairs, calculating the vector distance (which may be Euclidean distance) between two sample vectors of each sample vector pair, and determining the distance between the vector distances in the distance vector interval (x)_i-1，x_i) Is the number of the sampling sample set T₁Similar to the number of results.

In addition, to calculate a set of sample samples T₁And a set of sampled samples T₂For example, the number of similar results between samples is calculated as dist (t)_i，t_j) Wherein, t_i∈G(T₁)，t_j∈G(T₂) Then determine dist (t)_i，t_j)∈(x_i-1，x_i) The number of the sample vectors is the sampling sample set T₁And a set of sampled samples T₂The number of similar results between samples.

In the embodiment of the application, calculating the total intra-group result quantity estimated values of the plurality of sub-vector sets is to sum up the intra-group result quantity estimated values corresponding to the plurality of sub-vector sets, and similarly, calculating the total inter-group result quantity estimated values of the plurality of sub-vector sets is to sum up the inter-group result quantity estimated values between every two sub-vector sets; calculating the total similar result quantity estimated value corresponding to the distance vector interval is to sum the total intra-group result quantity estimated value and the inter-group result quantity estimated value of a plurality of sub-vector sets corresponding to the distance vector interval.

In the embodiment of the present application, given two vectors X and Y, the corresponding character strings after the charting process are each X_sAnd Y_sThen character string X_sAnd a character string Y_sThe calculation formula of the distance between the character strings is as follows:

wherein MINDIST (X)_s，Y_s) Representing a character string X_sAnd a character string Y_sD is the original vector dimension, d' is the dimension after the preset dimension reduction.

After step S207, the following steps are also included:

and S208, constructing a similarity distribution histogram of the original vector set according to the overall similarity result estimation value corresponding to each distance vector interval.

In the embodiment of the present application, by implementing the steps S206 to S208, a similarity distribution histogram of the original vector set can be constructed according to the number of the result vectors and the plurality of sub-vector grouping sets.

Referring to fig. 4, fig. 4 is a similarity distribution histogram of an original vector set according to the present embodiment. As shown in fig. 4, the abscissa of the similarity distribution histogram represents distance vector bins, and the ordinate thereof represents overall similarity result estimation values corresponding to each distance vector bin. Wherein x is₀，x₁，x₂，…，x_i-1，x_i，…，x_mIs the interval boundary of each distance vector interval, k is the number of result vectors, (y)_i-1，y_i) The result vector interval is corresponding to the similar result quantity interval. Assuming that the overall similarity result estimation value increases linearly between two adjacent interval boundaries, x_i-1And x_iThe equation of a straight line between can be expressed as:

then, when the number of the resulting vectors is k, the corresponding linear equation is substituted to obtain the corresponding vector distance threshold value x_k。

And S209, determining vector intervals corresponding to the number of result vectors from the plurality of distance vector intervals as result vector intervals.

In the embodiment of this application, whenk∈(y_i-1，y_i) Then, the interval of the resultant vector is (x) as shown in the similarity distribution histogram shown in FIG. 4_i-1，x_i)。

S210, determining a linear equation corresponding to the result vector interval, and calculating a vector distance threshold according to the linear equation.

In the embodiment of the present application, by implementing the steps S209 to S210, the vector distance threshold can be calculated according to the similarity distribution histogram and the number of result vectors.

As an alternative embodiment, according to the similarity distribution histogram shown in FIG. 4, when k ∈ (y)_i-1，y_i) Then, the equation of the straight line is:

wherein x is₀，x₁，x₂，…，x_i-1，x_i，…，x_mIs the interval boundary of each distance vector interval, k is the number of result vectors, x_kIs a vector distance threshold, (x)_i-1，x_i) As result vector interval, (y)_i-1，y_i) The result vector interval is corresponding to the similar result quantity interval.

S211, judging whether the number of the vector pairs in the initial result vector pair set is less than the number of the result vectors, if so, executing the step S213; if not, step S212 to step S213 are executed.

S212, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set.

As an optional implementation manner, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, to obtain a final updated initial result vector pair set, may include the following steps:

step 1, determining a plurality of character string pairs which accord with the current character string condition according to a character string set to obtain a candidate character string pair set, and determining a subvector set pair corresponding to each character string pair in the candidate character string pair set according to a plurality of subvector sets;

step 3, updating the initial result vector pair set according to at least one vector pair to obtain an updated initial result vector pair set, and respectively updating the character string condition and the similar vector pair condition according to the updated initial result vector pair set to obtain an updated character string condition and an updated similar vector pair condition;

and 4, repeatedly executing the steps 1 to 3 until the number of the vector pairs in the updated initial result vector pair set is not less than the number of the result vectors, and taking the updated initial result vector pair set obtained by executing the step 3 for the last time as a final updated initial result vector pair set.

In the above embodiment, the current string condition may be: character string pair (SAX)_i，SAX_jDistance MINDIST (SAX) of corresponding character string_i，SAX_j) Less than or equal to the current vector distance threshold.

In the above embodiment, the current similarity vector pair condition may be: any one vector pair (v)_m，v_nWherein, v_m∈G(SAX_i)，v_n∈G(SAX_j)，dist(v_m，v_n) ≦ current vector distance threshold.

In the above embodiment, the character string condition and the similar vector pair condition are updated according to the updated initial result vector pair set, so as to obtain an updated character string condition and an updated similar vector pair condition, which are actually vector distance thresholds, and in the first cycle, the current vector distance threshold is the vector distance threshold calculated according to the similarity histogram.

As a further optional implementation manner, when the vector distance threshold is updated, the vector distances of all vector pairs in the currently updated initial result vector are calculated to obtain a distance set, then the maximum vector distance is extracted from the distance set to serve as a new vector distance threshold, and further, the string condition and the similar vector pair condition are respectively and correspondingly updated according to the new vector distance threshold.

After step S212, the method further includes the following steps:

and S213, extracting the vector pairs of the number of the result vectors from the finally updated initial result vector pair set, and combining to obtain a result vector pair set for expressing the similarity connection query result.

In the embodiment of the present application, by implementing the steps S211 to S213, the initial result vector pair set can be updated according to the plurality of sub-vector grouping sets, the vector distance threshold, and the number of result vectors, so as to obtain a result vector pair set for representing the similarity connection query result.

In the above embodiment, if the number of vector pairs included in the obtained final updated initial result vector pair set is greater than or equal to the number of result vectors, it is necessary to extract the vector pairs of the number of result vectors from the final updated initial result vector pair set, and combine them to obtain the result vector pair set. Before extraction, the vector pairs in the finally updated initial result vector pairs need to be arranged from small to large according to the size of the corresponding vector distance, then the vector pairs with the quantity of extracted result vectors from small to large are extracted, and the result vector pair set is obtained through combination.

It can be seen that, by implementing the similarity connection query method described in fig. 2, the similarity connection query method can be executed for multiple times without manually presetting a vector distance threshold value or manually and continuously changing the vector distance threshold value, so that a large amount of redundant calculation can be reduced, and the similarity connection query efficiency can be improved.

Example 3

Referring to fig. 3, fig. 3 is a schematic block diagram of a similarity connection query apparatus according to an embodiment of the present application. As shown in fig. 3, the similarity connection query apparatus includes:

a first obtaining module 310, configured to obtain an original vector set, a result vector quantity, and an initial result vector pair set to be queried; the initial result vector pair set is an initial data set of similarity connection query results, and the number of result vectors represents the number of vector pairs of the similarity connection query results.

The grouping module 320 is configured to perform grouping processing on the original vector set to obtain a plurality of sub-vector grouping sets.

And the constructing module 330 is configured to construct a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets.

And the calculating module 340 is configured to calculate a vector distance threshold according to the similarity distribution histogram and the number of the result vectors.

A second obtaining module 350, configured to update the initial result vector pair set according to the multiple sub-vector grouping sets, the vector distance threshold, and the number of result vectors, to obtain a result vector pair set used for representing a similarity connection query result.

It can be seen that, by implementing the similarity connection query apparatus described in fig. 3, the similarity connection query method is implemented for multiple times without manually presetting a vector distance threshold and without manually changing the vector distance threshold continuously, so that a large amount of redundant computation can be reduced, and the similarity connection query efficiency is improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An affinity linkage query method, comprising:

acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is an original image data set for similarity connection query, the initial result vector pair set is an initial image data set for similarity connection query results, and the result vector quantity represents the number of vector pairs of similarity connection query results and the quantity of similar images in the original image data set;

updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set for representing similarity connection query results;

constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets, wherein the similarity distribution histogram comprises the following steps;

2. The similarity connection query method according to claim 1, wherein the obtaining a plurality of distance vector intervals according to a preset interval partition rule and the original vector set comprises:

3. The method according to claim 1, wherein the calculating an overall similarity result estimation value corresponding to each distance vector interval according to the plurality of subvector grouping sets comprises:

sampling each sub-vector grouping set to obtain a sampling vector set corresponding to each sub-vector grouping set of the sub-vector sets;

determining at least one sub-vector grouping set corresponding to each distance vector interval;

calculating an intra-packet result quantity estimation value and an inter-packet result quantity estimation value corresponding to each distance vector interval according to at least one sub-vector packet set corresponding to each distance vector interval;

calculating a total similar result quantity estimated value corresponding to each distance vector interval according to the intra-group result quantity estimated value corresponding to each distance vector interval and the inter-group result quantity estimated value corresponding to each distance vector interval; wherein the overall similar result quantity estimation value is an overall similar result quantity estimation value of the at least one sub-vector grouping set corresponding to the distance vector interval.

4. The similarity connection query method according to claim 1, wherein the grouping processing is performed on the original vector sets to obtain a plurality of sub-vector grouping sets, and the method comprises:

acquiring original vector dimensions of the original vector set;

grouping the original vector set according to the character string set to obtain a plurality of sub-vector grouping sets; and one character string in the character string set corresponds to one subvector packet set.

5. The similarity join query method of claim 4, wherein the updating the initial set of result vector pairs according to the plurality of sets of sub-vector groups, the vector distance threshold, and the number of result vectors to obtain a set of result vector pairs representing the result of the similarity join query comprises:

if the number of vector pairs in the initial result vector pair set is not less than the number of result vectors, updating the initial result vector pair set according to the character string set and the plurality of sub-vector grouping sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set;

6. The similarity connection query method according to claim 5, wherein updating the initial result vector pair set according to the character string set and the plurality of sub-vector grouping sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, to obtain a final updated initial result vector pair set, includes:

step 1, determining a plurality of character string pairs which accord with the current character string condition according to the character string set to obtain a candidate character string pair set, and determining a sub-vector grouping set pair corresponding to each pair of character string pairs in the candidate character string pair set according to a plurality of sub-vector grouping sets;

step 2, determining at least one vector pair meeting the current similar vector pair condition according to the sub-vector grouping set pair corresponding to each pair of character strings;

7. The similarity connection query method according to claim 1, wherein calculating a vector distance threshold according to the similarity distribution histogram and the number of result vectors comprises:

8. The similarity connection query method according to claim 7, wherein the linear equation is:

；

wherein the content of the first and second substances,x ₀，x ₁，x ₂，…，x _i-1，x _i，…，x _mfor each of said interval boundaries of distance vector intervals,kfor the number of the result vectors it is,x _kas the vector distance threshold value (a) ((b))x _i-1，x _i) For the result vector interval (a)y _i-1，y _i) The similar result quantity interval corresponding to the result vector interval;

wherein the content of the first and second substances,k∈（y _i-1，y _i）。

9. an affinity connection inquiry apparatus, comprising:

the first acquisition module is used for acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is an original image data set for similarity connection query, the initial result vector pair set is an initial image data set for similarity connection query results, and the result vector quantity represents the number of vector pairs of similarity connection query results and the quantity of similar images in the original image data set;

a second obtaining module, configured to update the initial result vector pair set according to the multiple sub-vector grouping sets, the vector distance threshold, and the result vector quantity, so as to obtain a result vector pair set used for representing a similarity connection query result;

wherein the building block comprises:

the first submodule is used for acquiring a plurality of distance vector intervals according to a preset interval division rule and the original vector set;

the second sub-module is used for calculating a total similar result estimation value corresponding to each distance vector interval according to the plurality of sub-vector grouping sets;

and the third sub-module is used for constructing a similarity distribution histogram of the original vector set according to the overall similarity result estimation value corresponding to each distance vector interval.