CN109783547B - Similarity connection query method and device - Google Patents

Similarity connection query method and device Download PDF

Info

Publication number
CN109783547B
CN109783547B CN201910130094.7A CN201910130094A CN109783547B CN 109783547 B CN109783547 B CN 109783547B CN 201910130094 A CN201910130094 A CN 201910130094A CN 109783547 B CN109783547 B CN 109783547B
Authority
CN
China
Prior art keywords
vector
result
original
similarity
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910130094.7A
Other languages
Chinese (zh)
Other versions
CN109783547A (en
Inventor
马友忠
张瑞玲
林春杰
李莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luoyang Normal University
Original Assignee
Luoyang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Luoyang Normal University filed Critical Luoyang Normal University
Priority to CN201910130094.7A priority Critical patent/CN109783547B/en
Publication of CN109783547A publication Critical patent/CN109783547A/en
Application granted granted Critical
Publication of CN109783547B publication Critical patent/CN109783547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A similarity connection query method and device relate to the field of data processing. When similarity connection query is carried out, an original vector set for similarity connection query, the number of vector pairs of similarity connection query results and an initial data set of the similarity connection query results are obtained, then the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, a similarity distribution histogram of the original vector set is constructed, a vector distance threshold value is calculated according to the similarity distribution histogram and the number of result vectors, and finally the initial result vector pair set is updated according to the plurality of sub-vector grouping sets, the vector distance threshold value and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results.

Description

Similarity connection query method and device
Technical Field
The present application relates to the field of data processing, and in particular, to a method and an apparatus for similarity connection query.
Background
The similarity connection query is to find out data pairs with similarity greater than or equal to a given similarity threshold or distance less than or equal to a given distance threshold from a massive high-dimensional data set, and has important applications in many fields, such as image clustering, repeated web page detection, similar user recommendation and the like. At present, similarity connection query can be performed by a similarity connection query method based on a threshold, and the selection of the threshold needs to be manually predetermined according to the distance distribution condition between vectors in a vector set to be queried, so as to finally obtain a preset number of data pairs. However, in practice, it is found that the similarity connection query method based on the threshold needs to manually preset the threshold, then continuously modify the threshold, and continuously and repeatedly execute the similarity connection query method according to the new threshold until a preset number of data pairs are obtained, so that a large amount of redundant and repetitive calculation work is brought when the similarity connection query method is continuously and repeatedly executed, and further the similarity connection query efficiency is low.
Disclosure of Invention
The embodiment of the application aims to provide a similarity connection query method and a similarity connection query device, which do not need to manually preset a threshold value, can reduce a large amount of redundant calculation, and further improve the similarity connection query efficiency.
A first aspect of the embodiments of the present application provides a method for querying similarity connection, including:
acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is a data set for similarity connection query, the initial result vector pair set is an initial data set for similarity connection query results, and the result vector quantity represents the vector pair quantity of the similarity connection query results;
grouping the original vector set to obtain a plurality of sub-vector grouping sets;
constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets;
calculating a vector distance threshold according to the similarity distribution histogram and the quantity of the result vectors;
and updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set for representing similarity connection query results.
In the implementation process, when similarity connection query is carried out, an original vector set for carrying out similarity connection query, the number of vector pairs of similarity connection query results and an initial data set of the similarity connection query results are obtained, then the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, then a similarity distribution histogram of the original vector set is constructed, further, a vector distance threshold value is calculated according to the similarity distribution histogram and the number of result vectors, and finally the initial result vector pair set is updated according to the plurality of sub-vector grouping sets, the vector distance threshold value and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results, so that the similarity connection query method is carried out for a plurality of times without manually presetting the vector distance threshold value or manually and continuously changing the vector distance threshold value, and then can reduce a large amount of redundant calculations, and then promote similarity connection inquiry efficiency.
Further, according to the number of the result vectors and the plurality of sub-vector grouping sets, constructing a similarity distribution histogram of the original vector set, including;
acquiring a plurality of distance vector intervals according to a preset interval division rule and the original vector set;
grouping the original vector set to obtain a plurality of sub-vector sets;
calculating a total similar result estimation value corresponding to each distance vector interval according to the plurality of subvector sets;
and constructing a similarity distribution histogram of the original vector set according to the overall similarity result estimation value corresponding to each distance vector interval.
In the implementation process, a plurality of distance vector intervals can be obtained according to a preset interval division rule and an original vector set, then the original vector set is subjected to grouping processing, finally, an overall similarity result estimation value corresponding to each distance vector interval is calculated, and finally, a similarity distribution histogram of the original vector set can be constructed. The abscissa of the similarity distribution histogram represents a distance vector interval, and the ordinate thereof represents an overall similarity result estimation value corresponding to each distance vector interval.
Further, the obtaining a plurality of distance vector intervals according to a preset interval division rule and the original vector set includes:
pairing and combining the original vectors in the original vector set to obtain a plurality of original vector pairs;
calculating the vector distance between two original vectors in each original vector pair to obtain a vector distance set;
selecting the maximum vector distance from the vector distance set as an interval upper limit value of the to-be-divided intervals;
and dividing the interval to be divided into a plurality of subintervals according to the interval upper limit value, the preset interval lower limit value and the preset interval division rule, wherein the plurality of subintervals are a plurality of distance vector intervals.
In the implementation process, the original vectors in the original vector set are combined pairwise to obtain a plurality of original vector pairs, then the vector distance of each original vector pair is calculated to obtain a vector distance set, the maximum vector distance in the orientation vector distance set is used as the interval upper limit value of the interval to be divided, the interval to be divided can be obtained through the preset interval lower limit value and the interval upper limit value, and finally the interval to be divided is divided into a plurality of subintervals according to the preset interval division rule, wherein the obtained plurality of subintervals are the plurality of distance vector intervals. Different distance vector intervals can be obtained according to different original vector sets, the constructed similarity distribution histogram is matched with the original vector sets better, and therefore the accuracy of the similarity connection query method is improved.
Further, calculating an overall similarity result estimation value corresponding to each distance vector interval according to the set of the plurality of subvectors includes:
sampling each sub-vector set to obtain a sampling vector set corresponding to each sub-vector set;
determining at least one sub vector set corresponding to each distance vector interval;
calculating an intra-packet result quantity estimation value and an inter-packet result quantity estimation value corresponding to each distance vector interval according to at least one sub-vector set corresponding to each distance vector interval;
calculating a total similar result quantity estimated value corresponding to each distance vector interval according to the intra-group result quantity estimated value corresponding to each distance vector interval and the inter-group result quantity estimated value corresponding to each distance vector interval; wherein the overall similar result quantity estimation value is an overall similar result quantity estimation value of the at least one sub-vector set corresponding to the distance vector interval.
In the implementation process, the intra-group result quantity estimation value corresponding to each sub-vector set is calculated, then the plurality of sub-vector sets are combined pairwise to obtain a plurality of sub-vector set pairs, then the inter-group result quantity estimation value corresponding to each sub-vector set pair is calculated, finally, at least one sub-vector set corresponding to each distance vector interval is determined, and the overall similar result quantity estimation value of the at least one sub-vector set is calculated.
Further, grouping the original vector set to obtain a plurality of sub-vector grouping sets, including:
acquiring original vector dimensions of the original vector set;
performing dimensionality reduction processing on the original vector set according to the original vector dimensionality and a preset dimensionality reduced dimensionality to obtain a dimensionality reduced vector set;
performing symbolization processing on the dimensionality reduction vector set to obtain a character string set;
grouping the original vector set according to the character string set to obtain a plurality of sub-vector sets; and one character string in the character string set corresponds to one sub vector set.
In the implementation process, after the dimensionality reduction processing is carried out on the original vector set, a dimensionality reduction vector set is obtained, and then the dimensionality reduction vector set is symbolized to obtain a character string set. Because the obtained character strings may be the same after different vector symbolization, the original vector set is grouped according to the character string set to obtain a plurality of sub-vector sets, and the character strings corresponding to at least one original vector in each sub-vector set are the same.
Further, the updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold, and the result vector number to obtain a result vector pair set for representing a similarity join query result includes:
judging whether the number of vector pairs in the initial result vector pair set is smaller than the number of result vectors;
if the number of vector pairs in the initial result vector pair set is not less than the number of result vectors, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set;
and extracting the vector pairs of the result vector quantity from the final updated initial result vector pair set, and combining to obtain a result vector pair set for representing similarity connection query results.
In the implementation process, after the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, continuously updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set; and then extracting the vector pairs of the number of the result vectors from the finally updated initial result vector pair set to form a result vector pair set for representing similarity connection query results.
Further, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set, including:
step 1, determining a plurality of character string pairs which accord with the current character string condition according to the character string set to obtain a candidate character string pair set, and determining a sub-vector set pair corresponding to each pair of character string pairs in the candidate character string pair set according to a plurality of sub-vector sets;
step 2, determining at least one vector pair meeting the conditions of the current similar vector pair according to the corresponding sub-vector set pair of each pair of character string pairs;
step 3, updating the initial result vector pair set according to the at least one vector pair to obtain an updated initial result vector pair set, and respectively updating the character string condition and the similar vector pair condition according to the updated initial result vector pair set to obtain an updated character string condition and an updated similar vector pair condition;
and 4, repeatedly executing the steps 1 to 3 until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and taking the updated initial result vector pair set obtained by executing the step 3 for the last time as a final updated initial result vector pair set.
In the implementation process, the steps 1 to 3 are continuously circulated, and in each circulation, the character string condition and the similar vector pair condition are respectively updated according to the updated initial result vector pair set until the number of the vector pairs in the obtained initial result vector pair set is not less than the number of the result vectors.
Further, calculating a vector distance threshold according to the similarity distribution histogram and the number of the result vectors, including:
determining vector intervals corresponding to the number of the result vectors from a plurality of distance vector intervals as result vector intervals;
and determining a linear equation corresponding to the result vector interval, and calculating a vector distance threshold according to the linear equation.
In the implementation process, the vector distance threshold can be obtained by calculation according to the similarity distribution histogram of the original vector set and the quantity of the result vectors without manual presetting, so that the similarity connection query method is not required to be executed for multiple times by manually and continuously changing the vector distance threshold, a large amount of redundant calculation can be reduced, and the similarity connection query efficiency is improved.
Further, the equation of the straight line is:
Figure BDA0001975002870000071
wherein x is0,x1,x2,…,xi-1,xi,…,xmFor each interval boundary of the distance vector interval, k is the number of result vectors, xkIs the vector distance threshold, (x)i-1,xi) Is the result vector interval, (y)i-1,yi) The similar result quantity interval corresponding to the result vector interval;
wherein, k ∈ (y)i-1,yi)。
In the above implementation, x0,x1,x2,…,xi-1,xi,…,xkThe interval boundary of each distance vector interval is (x)0,x1),(x1,x2),…,(xi-1,xi),…,(xk-1,xk) And in the similarity distribution histogram, each distance vector interval corresponds to a similar result quantity interval.
A second aspect of the embodiments of the present application provides an apparatus for querying similarity connection, including:
the first acquisition module is used for acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is a data set for similarity connection query, the initial result vector pair set is an initial data set for similarity connection query results, and the result vector quantity represents the vector pair quantity of the similarity connection query results;
the grouping module is used for grouping the original vector set to obtain a plurality of sub-vector grouping sets;
the construction module is used for constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets;
the calculation module is used for calculating a vector distance threshold according to the similarity distribution histogram and the quantity of the result vectors;
and the second acquisition module is used for updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set used for representing the similarity connection query result.
In the implementation process, when similarity connection query is performed, the first obtaining module first obtains an original vector set for performing similarity connection query, the number of vector pairs of similarity connection query results and an initial data set of the similarity connection query results, then the grouping module performs grouping processing on the original vector set to obtain a plurality of sub-vector grouping sets, then the building module builds a similarity distribution histogram of the original vector set, further, the calculating module calculates a vector distance threshold according to the similarity distribution histogram and the number of result vectors, and finally the second obtaining module updates the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results, thereby realizing that the vector distance threshold is not required to be manually preset, and the similarity connection query method does not need to be executed for multiple times according to the fact that the vector distance threshold value is changed manually, so that a large amount of redundant calculation can be reduced, and the similarity connection query efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic block diagram of a flow of a similarity connection query method provided in embodiment 1 of the present application;
fig. 2 is a schematic block diagram of a flow of a similarity connection query method provided in embodiment 2 of the present application;
fig. 3 is a schematic block diagram of a structure of an affinity connection query apparatus according to embodiment 3 of the present application;
fig. 4 is a similarity distribution histogram of an original vector set provided in embodiment 2 of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Example 1
Referring to fig. 1, fig. 1 is a schematic block diagram illustrating a flow of a similarity connection query method according to an embodiment of the present application. As shown in fig. 1, the similarity connection query method includes:
s101, acquiring an original vector set to be queried, a result vector quantity and an initial result vector pair set.
In the embodiment of the application, the original vector set is a data set for similarity connection query, the initial result vector pair set is an initial data set for similarity connection query results, and the number of result vectors represents the number of vector pairs for similarity connection query results.
Similarity connection query has important application in many fields, such as image clustering, repeated web page detection, similar user recommendation and the like. Correspondingly, the original vector set to be queried may be an original image data set, an original web page data set, an original user data set, and the like, which is not limited in this embodiment.
In the embodiment of the present application, the number of result vectors indicates the number of results of the similarity join query. In practical applications, the number of result vectors indicates the number of similar images in the original image data set, the number of repeated web page detection results in the original web page data set, the number of similar users in the original user data set, and the like.
In the embodiment of the present application, the set of initial result vector pairs may be an empty set, i.e., a set without any elements.
S102, grouping the original vector set to obtain a plurality of sub-vector grouping sets.
In the embodiment of the present application, the original vector set is a high-dimensional vector set, and the original vectors included in the high-dimensional vector set are high-dimensional vectors. The original vector set may be subjected to dimensionality reduction processing by using a Piecewise Accumulation Approximation (PAA) to obtain a dimensionality reduction vector set, and then the dimensionality reduction vector set is subjected to symbolization processing by using a symbol accumulation Approximation (SAX) to obtain a character string set. Different original vectors may have the same character string, so that the original vectors in the original vector set may be grouped by using the SAX character string to obtain a plurality of sub-vector grouping sets, and the character string corresponding to at least one original vector in each sub-vector grouping set is the same.
S103, constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets.
And S104, calculating a vector distance threshold according to the similarity distribution histogram and the number of the result vectors.
In the prior art, an actually used similarity connection query method based on a k-threshold needs to manually preset a threshold (namely a vector distance threshold), if the preset k-threshold is too large, a lot of redundant calculations are generated, and if the preset k-threshold is too small, the number of returned results does not meet the requirement, and a second calculation is needed. In the similarity connection query method described in the embodiment of the present application, the vector distance threshold used does not need to be manually preset, and is calculated according to the original vector sets, and the vector distance thresholds calculated by different original vector sets may be different, thereby reducing redundant calculation and improving the similarity connection query efficiency.
And S105, updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set for representing the similarity connection query result.
In the embodiment of the application, in practical application, when the similarity distribution histogram of the original vector set is constructed, the original vector set needs to be grouped to obtain a plurality of sub-vector grouping sets, and then the similarity distribution histogram of the original vector set is constructed according to the number of the result vectors and the plurality of sub-vector grouping sets. Before step S105 is executed, the original vector set may be subjected to grouping again, so as to obtain a plurality of sub-vector grouping sets. Although the plurality of sub-vector grouping sets obtained by the grouping processing twice are the same, because the data volume of the original vector set is large and the data volume of the plurality of sub-vector grouping sets obtained by grouping is also large, the plurality of sub-vector grouping sets obtained before the similarity distribution histogram of the original vector set is constructed are not always stored, and the original vector set needs to be grouped again before the step S105 is executed.
In the similarity connection query method described in fig. 1, when performing similarity connection query, an original vector set for performing similarity connection query, the number of vector pairs of similarity connection query results, and an initial data set of similarity connection query results are obtained, then the original vector set is subjected to grouping processing to obtain a plurality of sub-vector grouping sets, then a similarity distribution histogram of the original vector set is constructed, further, a vector distance threshold is calculated according to the similarity distribution histogram and the number of result vectors, and finally the initial result vector pair set is updated according to the plurality of sub-vector grouping sets, the vector distance threshold, and the number of result vectors to obtain a result vector pair set for representing the similarity connection query results. Therefore, by implementing the similarity connection query method described in fig. 1, the similarity connection query method can be executed for multiple times without manually presetting a vector distance threshold value or manually and continuously changing the vector distance threshold value, so that a large amount of redundant calculation can be reduced, and the similarity connection query efficiency can be improved.
Example 2
Referring to fig. 2, fig. 2 is a schematic block diagram of a flow of an affinity connection query method according to an embodiment of the present application. As shown in fig. 2, the similarity connection query method includes:
s201, acquiring an original vector set to be queried, a result vector quantity and an initial result vector pair set.
In the embodiment of the application, the original vector set is a data set for similarity connection query, the initial result vector pair set is an initial data set for similarity connection query results, and the number of result vectors represents the number of vector pairs for similarity connection query results.
S202, obtaining original vector dimensions of the original vector set.
In the embodiment of the present application, the original vector set is a high-dimensional vector set, and includes a plurality of original vectors, where the dimension of each original vector is the original vector dimension. And if the dimension of the original vector is n, the dimension of each original vector in the original vector set is n. An n-dimensional original vector is a point of an n-dimensional euclidean space (elements are n-dimensional vectors).
S203, performing dimensionality reduction processing on the original vector set according to the original vector dimensionality and the preset dimensionality reduced dimensionality to obtain a dimensionality reduced vector set.
In the embodiment of the application, the dimensionality of the dimensionality reduction vector set obtained by performing dimensionality reduction on the original vector dimensionality is the dimensionality after the preset dimensionality reduction.
As an optional implementation, the dimensionality reduction processing on the original vector dimensionality can adopt a piecewise aggregation approximation PAA dimensionality reduction technology. Setting the original vector dimension of an original vector set as d, setting the dimension after dimension reduction as d', and setting any original vector in the original vector set as X ═ X<x1,x2,…,xd>The dimensionality reduction vector corresponding to any original vector in the dimensionality reduction vector set is
Figure BDA0001975002870000121
Figure BDA0001975002870000122
Then the original vector dimension is subjected to dimension reduction processing, which can be calculated according to the following formula:
Figure BDA0001975002870000123
wherein x isjFor the elements in which the original vector is X,
Figure BDA0001975002870000124
for the reduced-dimension vector X corresponding to the original vectorpOf (1).
In the above embodiment, the distance D between two dimension-reduced vectors in the dimension-reduced vector setpIs the Euclidean distance D of two corresponding original vectors in the original vector setELower bound of (D), i.e. Dp≤DE
After step S203, the following steps are also included:
and S204, performing symbolization processing on the dimensionality reduction vector set to obtain a character string set.
As an alternative, the symbolization process for the dimensionality reduction vector set may adopt a symbol accumulation approximation SAX method. That is, according to a certain rule, each dimension-reducing vector in the dimension-reducing vector set is mapped to a character string in the character string set, so that the original vector X of dimension d can be represented by a character string in a character string set, and is recorded as:
Figure BDA0001975002870000125
Figure BDA0001975002870000126
s205, grouping the original vector set according to the character string set to obtain a plurality of sub-vector sets; and one character string in the character string set corresponds to one sub vector set.
In the embodiment of the present application, by performing the above steps S202 to S205, the original vector set can be grouped to obtain a plurality of sub-vector grouped sets.
In the embodiment of the present application, after performing dimension reduction and symbolization on the original vectors of the original vector set, different original vectors may be mapped to the same character string, so that the original data may be grouped by using the character string, and the original vectors of the same character string may be grouped into one group.
In the embodiment of the present application, the probability of similarity between original vectors in a set of sub-vectors corresponding to one character string is high, and the probability of similarity between original vectors in sets of sub-vectors corresponding to different character strings is low.
In the embodiment of the present application, it is assumed that the original vector set includes X1、X2、X3、X4、X5、X6Six original vectors, and after the original vector set is subjected to dimensionality reduction and symbolization, a character string set comprising s can be obtained1、s2、s3Three character strings, wherein X1The corresponding character string is s1,X2The corresponding character string is s1,X3The corresponding character string is s1,X4The corresponding character string is s2,X5The corresponding character string is s3,X6The corresponding character string is s3. The six original vectors comprised by the original vector set may be divided into three sub-vector sets according to the character string set, wherein the first sub-vector set comprises X1、X2、X3Three vectors, the second set of sub-vectors comprising X4One vector, the third set of sub-vectors comprising X5、X6Two vectors, and the character string corresponding to the first sub-vector set is s1The character string corresponding to the second set of subvectors is s2The character string corresponding to the third set of subvectors is s3
After step S205, the following steps are also included:
and S206, obtaining a plurality of distance vector intervals according to a preset interval division rule and the original vector set.
In the embodiment of the present application, the preset interval division rule may be equal interval division, and the like, and the embodiment is not limited in any way.
As an optional implementation manner, obtaining a plurality of distance vector intervals according to a preset interval division rule and an original vector set may include the following steps:
carrying out pairing combination processing on original vectors in an original vector set to obtain a plurality of original vector pairs;
calculating the vector distance between two original vectors in each original vector pair to obtain a vector distance set;
selecting the maximum vector distance from the vector distance set as an interval upper limit value of the to-be-divided intervals;
and dividing the interval to be divided into a plurality of sub-intervals according to the interval upper limit value, the preset interval lower limit value and the preset interval division rule, wherein the plurality of sub-intervals are a plurality of distance vector intervals.
In the above embodiments, the vector distance between two original vectors may be a euclidean distance, a manhattan distance, a chebyshev distance, a minkowski distance, a Mahalanobis distance (Mahalanobis distance), or the like, but the present embodiment is not limited thereto.
In the above embodiments, the euclidean distance, also referred to as an euclidean metric (euclidean metric), is a true distance between two points in an m-dimensional space, or a natural length of a vector (i.e., a distance of the point from an origin). The euclidean distance in two and three dimensions is the actual distance between two points.
As a further alternative embodiment, the vector distance between two original vectors is the euclidean distance. Let original vector X be ═<x1,x2,…,xd>And original vector Y ═<y1,y2,…,yd>The euclidean distance between them is calculated as:
Figure BDA0001975002870000141
where dist (X, Y) is the euclidean distance between the original vector X and the original vector Y, and d is the original vector dimension.
In the above embodiment, the preset lower limit of the interval may be 0, and the present embodiment is not limited thereto.
In the above embodiment, if the preset interval division rule is to divide the to-be-divided interval into 4 distance vector intervals at equal intervals, and the preset interval lower limit is 0, the obtained interval upper limit is set to distmaxThen the to-be-divided interval is (0, dist)max). For example, when distmaxIf 20, the to-be-divided interval is (0, 20), and the to-be-divided interval may be divided into 4 distance vector intervals according to the preset interval division rule, which are (0, 5), (5, 10), (10, 15), (15, 20), respectively.
In the above embodiment, the opening and closing of the two end points of the section is not limited.
After step S206, the following steps are also included:
and S207, calculating a total similarity result estimation value corresponding to each distance vector interval according to the plurality of sub-vector sets.
In this embodiment of the present application, calculating an overall similarity result estimation value corresponding to each distance vector interval according to a plurality of sub-vector sets may include the following steps:
sampling each sub vector set to obtain a sampling vector set corresponding to each sub vector set;
determining at least one sub-vector set corresponding to each distance vector interval;
calculating an intra-packet result quantity estimation value and an inter-packet result quantity estimation value corresponding to each distance vector interval according to at least one sub-vector set corresponding to each distance vector interval;
calculating an overall similar result quantity estimation value corresponding to each distance vector interval according to the intra-group result quantity estimation value corresponding to each distance vector interval and the inter-group result quantity estimation value corresponding to each distance vector interval; and the total similar result quantity estimation value is the total similar result quantity estimation value of at least one sub vector set corresponding to the distance vector interval.
In the embodiment of the present application, for example, after the original vector set is subjected to grouping processing, m sub-vector sets may be obtained, where the number of vectors included in the ith sub-vector set is NiI is 1, …, m. The specific steps of calculating the overall similarity result estimation value corresponding to each distance vector interval according to the plurality of sub-vector sets are as follows:
(1) and respectively carrying out random sampling treatment on each sub-vector set to obtain a sampling sample set corresponding to each sub-vector set. The number of sample vectors in a sampling sample set corresponding to the ith sub-vector set is Si
(2) And determining at least one sub vector set corresponding to each distance vector interval. Because one character string in the character string set corresponds to one sub-vector set, all the character strings in the character string set are combined pairwise to obtain a plurality of character string pairs, then the character string distance between two character strings in each character string pair is calculated, and then at least one sub-vector set corresponding to each distance vector interval is determined according to the character string distance. For example, assuming that three distance vector intervals are obtained in total, it is assumed that the character string 1 corresponds to a first sub-vector set, the character string 2 corresponds to a second sub-vector set, and the character string distance between the character string 1 and the character string 2 is dist1, if the dist1 belongs to a third distance vector interval, the third distance vector interval corresponding to the first sub-vector set and the second sub-vector set can be obtained.
(2) Taking the calculation of the total similarity result quantity estimation value of at least one sub-vector set corresponding to a distance vector interval as an example, let the distance vector interval (x)i-1,xi) The corresponding at least one sub-vector set is X respectively1、X2、X3Wherein X is1The corresponding set of sample samples is T1,X2The corresponding set of sample samples is T2,X3The corresponding set of sample samples is T3The specific calculation steps are as follows:
first, separately calculating T1、T2、T3Similar to the number of results. Wherein, the ith sub-vector set XiThe corresponding set of sample samples is TiThe number of similar results of the sample of (1) is Ri
Second, calculating T separately1And T2、T2And T3、T1And T3The number of similar results between samples. Wherein, the ith sub-vector set XiAnd jth set of sub-vectors XjThe number of inter-sample similarity results between corresponding sets of sampled samples is Rij
Thirdly, respectively calculating X1、X2、X3An estimate of the number of results within the packet. Wherein, the ith sub-vector set XiThe corresponding intra-packet outcome quantity estimate is
Figure BDA0001975002870000161
The fourth step, respectively calculating X1And X2、X2And X3、X1And X3Corresponding inter-packet outcome quantity estimates. Wherein, the ith sub-vector set XiAnd jth set of sub-vectors XjCorresponding inter-packet outcome quantity estimate
Figure BDA0001975002870000162
The fifth step, calculate the distance vector interval (x)i-1,xi) The corresponding overall similarity measure is estimated as
Figure BDA0001975002870000163
In the embodiment of the present application, a distance vector interval (x) is definedi-1,xi) Corresponding at least one set of sub-vectors (X)1、X2、X3) Wherein X is1Corresponding toSet of sample samples as T1,X2The corresponding set of sample samples is T2,X3The corresponding set of sample samples is T3. To calculate a set of sample samples T1For example, the number of similar results in the sample set is T1Combining every two sample vectors to obtain multiple sample vector pairs, calculating the vector distance (which may be Euclidean distance) between two sample vectors of each sample vector pair, and determining the distance between the vector distances in the distance vector interval (x)i-1,xi) Is the number of the sampling sample set T1Similar to the number of results.
In addition, to calculate a set of sample samples T1And a set of sampled samples T2For example, the number of similar results between samples is calculated as dist (t)i,tj) Wherein, ti∈G(T1),tj∈G(T2) Then determine dist (t)i,tj)∈(xi-1,xi) The number of the sample vectors is the sampling sample set T1And a set of sampled samples T2The number of similar results between samples.
In the embodiment of the application, calculating the total intra-group result quantity estimated values of the plurality of sub-vector sets is to sum up the intra-group result quantity estimated values corresponding to the plurality of sub-vector sets, and similarly, calculating the total inter-group result quantity estimated values of the plurality of sub-vector sets is to sum up the inter-group result quantity estimated values between every two sub-vector sets; calculating the total similar result quantity estimated value corresponding to the distance vector interval is to sum the total intra-group result quantity estimated value and the inter-group result quantity estimated value of a plurality of sub-vector sets corresponding to the distance vector interval.
In the embodiment of the present application, given two vectors X and Y, the corresponding character strings after the charting process are each XsAnd YsThen character string XsAnd a character string YsThe calculation formula of the distance between the character strings is as follows:
Figure BDA0001975002870000171
wherein MINDIST (X)s,Ys) Representing a character string XsAnd a character string YsD is the original vector dimension, d' is the dimension after the preset dimension reduction.
After step S207, the following steps are also included:
and S208, constructing a similarity distribution histogram of the original vector set according to the overall similarity result estimation value corresponding to each distance vector interval.
In the embodiment of the present application, by implementing the steps S206 to S208, a similarity distribution histogram of the original vector set can be constructed according to the number of the result vectors and the plurality of sub-vector grouping sets.
Referring to fig. 4, fig. 4 is a similarity distribution histogram of an original vector set according to the present embodiment. As shown in fig. 4, the abscissa of the similarity distribution histogram represents distance vector bins, and the ordinate thereof represents overall similarity result estimation values corresponding to each distance vector bin. Wherein x is0,x1,x2,…,xi-1,xi,…,xmIs the interval boundary of each distance vector interval, k is the number of result vectors, (y)i-1,yi) The result vector interval is corresponding to the similar result quantity interval. Assuming that the overall similarity result estimation value increases linearly between two adjacent interval boundaries, xi-1And xiThe equation of a straight line between can be expressed as:
Figure BDA0001975002870000181
then, when the number of the resulting vectors is k, the corresponding linear equation is substituted to obtain the corresponding vector distance threshold value xk
And S209, determining vector intervals corresponding to the number of result vectors from the plurality of distance vector intervals as result vector intervals.
In the embodiment of this application, whenk∈(yi-1,yi) Then, the interval of the resultant vector is (x) as shown in the similarity distribution histogram shown in FIG. 4i-1,xi)。
S210, determining a linear equation corresponding to the result vector interval, and calculating a vector distance threshold according to the linear equation.
In the embodiment of the present application, by implementing the steps S209 to S210, the vector distance threshold can be calculated according to the similarity distribution histogram and the number of result vectors.
As an alternative embodiment, according to the similarity distribution histogram shown in FIG. 4, when k ∈ (y)i-1,yi) Then, the equation of the straight line is:
Figure BDA0001975002870000182
wherein x is0,x1,x2,…,xi-1,xi,…,xmIs the interval boundary of each distance vector interval, k is the number of result vectors, xkIs a vector distance threshold, (x)i-1,xi) As result vector interval, (y)i-1,yi) The result vector interval is corresponding to the similar result quantity interval.
S211, judging whether the number of the vector pairs in the initial result vector pair set is less than the number of the result vectors, if so, executing the step S213; if not, step S212 to step S213 are executed.
S212, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set.
As an optional implementation manner, updating the initial result vector pair set according to the character string set and the plurality of sub-vector sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, to obtain a final updated initial result vector pair set, may include the following steps:
step 1, determining a plurality of character string pairs which accord with the current character string condition according to a character string set to obtain a candidate character string pair set, and determining a subvector set pair corresponding to each character string pair in the candidate character string pair set according to a plurality of subvector sets;
step 2, determining at least one vector pair meeting the conditions of the current similar vector pair according to the corresponding sub-vector set pair of each pair of character string pairs;
step 3, updating the initial result vector pair set according to at least one vector pair to obtain an updated initial result vector pair set, and respectively updating the character string condition and the similar vector pair condition according to the updated initial result vector pair set to obtain an updated character string condition and an updated similar vector pair condition;
and 4, repeatedly executing the steps 1 to 3 until the number of the vector pairs in the updated initial result vector pair set is not less than the number of the result vectors, and taking the updated initial result vector pair set obtained by executing the step 3 for the last time as a final updated initial result vector pair set.
In the above embodiment, the current string condition may be: character string pair (SAX)i,SAXjDistance MINDIST (SAX) of corresponding character stringi,SAXj) Less than or equal to the current vector distance threshold.
In the above embodiment, the current similarity vector pair condition may be: any one vector pair (v)m,vnWherein, vm∈G(SAXi),vn∈G(SAXj),dist(vm,vn) ≦ current vector distance threshold.
In the above embodiment, the character string condition and the similar vector pair condition are updated according to the updated initial result vector pair set, so as to obtain an updated character string condition and an updated similar vector pair condition, which are actually vector distance thresholds, and in the first cycle, the current vector distance threshold is the vector distance threshold calculated according to the similarity histogram.
As a further optional implementation manner, when the vector distance threshold is updated, the vector distances of all vector pairs in the currently updated initial result vector are calculated to obtain a distance set, then the maximum vector distance is extracted from the distance set to serve as a new vector distance threshold, and further, the string condition and the similar vector pair condition are respectively and correspondingly updated according to the new vector distance threshold.
After step S212, the method further includes the following steps:
and S213, extracting the vector pairs of the number of the result vectors from the finally updated initial result vector pair set, and combining to obtain a result vector pair set for expressing the similarity connection query result.
In the embodiment of the present application, by implementing the steps S211 to S213, the initial result vector pair set can be updated according to the plurality of sub-vector grouping sets, the vector distance threshold, and the number of result vectors, so as to obtain a result vector pair set for representing the similarity connection query result.
In the above embodiment, if the number of vector pairs included in the obtained final updated initial result vector pair set is greater than or equal to the number of result vectors, it is necessary to extract the vector pairs of the number of result vectors from the final updated initial result vector pair set, and combine them to obtain the result vector pair set. Before extraction, the vector pairs in the finally updated initial result vector pairs need to be arranged from small to large according to the size of the corresponding vector distance, then the vector pairs with the quantity of extracted result vectors from small to large are extracted, and the result vector pair set is obtained through combination.
It can be seen that, by implementing the similarity connection query method described in fig. 2, the similarity connection query method can be executed for multiple times without manually presetting a vector distance threshold value or manually and continuously changing the vector distance threshold value, so that a large amount of redundant calculation can be reduced, and the similarity connection query efficiency can be improved.
Example 3
Referring to fig. 3, fig. 3 is a schematic block diagram of a similarity connection query apparatus according to an embodiment of the present application. As shown in fig. 3, the similarity connection query apparatus includes:
a first obtaining module 310, configured to obtain an original vector set, a result vector quantity, and an initial result vector pair set to be queried; the initial result vector pair set is an initial data set of similarity connection query results, and the number of result vectors represents the number of vector pairs of the similarity connection query results.
The grouping module 320 is configured to perform grouping processing on the original vector set to obtain a plurality of sub-vector grouping sets.
And the constructing module 330 is configured to construct a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets.
And the calculating module 340 is configured to calculate a vector distance threshold according to the similarity distribution histogram and the number of the result vectors.
A second obtaining module 350, configured to update the initial result vector pair set according to the multiple sub-vector grouping sets, the vector distance threshold, and the number of result vectors, to obtain a result vector pair set used for representing a similarity connection query result.
It can be seen that, by implementing the similarity connection query apparatus described in fig. 3, the similarity connection query method is implemented for multiple times without manually presetting a vector distance threshold and without manually changing the vector distance threshold continuously, so that a large amount of redundant computation can be reduced, and the similarity connection query efficiency is improved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (9)

1. An affinity linkage query method, comprising:
acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is an original image data set for similarity connection query, the initial result vector pair set is an initial image data set for similarity connection query results, and the result vector quantity represents the number of vector pairs of similarity connection query results and the quantity of similar images in the original image data set;
grouping the original vector set to obtain a plurality of sub-vector grouping sets;
constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets;
calculating a vector distance threshold according to the similarity distribution histogram and the quantity of the result vectors;
updating the initial result vector pair set according to the plurality of sub-vector grouping sets, the vector distance threshold and the result vector quantity to obtain a result vector pair set for representing similarity connection query results;
constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets, wherein the similarity distribution histogram comprises the following steps;
acquiring a plurality of distance vector intervals according to a preset interval division rule and the original vector set;
calculating a total similar result estimation value corresponding to each distance vector interval according to the plurality of subvector sets;
and constructing a similarity distribution histogram of the original vector set according to the overall similarity result estimation value corresponding to each distance vector interval.
2. The similarity connection query method according to claim 1, wherein the obtaining a plurality of distance vector intervals according to a preset interval partition rule and the original vector set comprises:
pairing and combining the original vectors in the original vector set to obtain a plurality of original vector pairs;
calculating the vector distance between two original vectors in each original vector pair to obtain a vector distance set;
selecting the maximum vector distance from the vector distance set as an interval upper limit value of the to-be-divided intervals;
and dividing the interval to be divided into a plurality of subintervals according to the interval upper limit value, the preset interval lower limit value and the preset interval division rule, wherein the plurality of subintervals are a plurality of distance vector intervals.
3. The method according to claim 1, wherein the calculating an overall similarity result estimation value corresponding to each distance vector interval according to the plurality of subvector grouping sets comprises:
sampling each sub-vector grouping set to obtain a sampling vector set corresponding to each sub-vector grouping set of the sub-vector sets;
determining at least one sub-vector grouping set corresponding to each distance vector interval;
calculating an intra-packet result quantity estimation value and an inter-packet result quantity estimation value corresponding to each distance vector interval according to at least one sub-vector packet set corresponding to each distance vector interval;
calculating a total similar result quantity estimated value corresponding to each distance vector interval according to the intra-group result quantity estimated value corresponding to each distance vector interval and the inter-group result quantity estimated value corresponding to each distance vector interval; wherein the overall similar result quantity estimation value is an overall similar result quantity estimation value of the at least one sub-vector grouping set corresponding to the distance vector interval.
4. The similarity connection query method according to claim 1, wherein the grouping processing is performed on the original vector sets to obtain a plurality of sub-vector grouping sets, and the method comprises:
acquiring original vector dimensions of the original vector set;
performing dimensionality reduction processing on the original vector set according to the original vector dimensionality and a preset dimensionality reduced dimensionality to obtain a dimensionality reduced vector set;
performing symbolization processing on the dimensionality reduction vector set to obtain a character string set;
grouping the original vector set according to the character string set to obtain a plurality of sub-vector grouping sets; and one character string in the character string set corresponds to one subvector packet set.
5. The similarity join query method of claim 4, wherein the updating the initial set of result vector pairs according to the plurality of sets of sub-vector groups, the vector distance threshold, and the number of result vectors to obtain a set of result vector pairs representing the result of the similarity join query comprises:
judging whether the number of vector pairs in the initial result vector pair set is smaller than the number of result vectors;
if the number of vector pairs in the initial result vector pair set is not less than the number of result vectors, updating the initial result vector pair set according to the character string set and the plurality of sub-vector grouping sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and obtaining a final updated initial result vector pair set;
and extracting the vector pairs of the result vector quantity from the final updated initial result vector pair set, and combining to obtain a result vector pair set for representing similarity connection query results.
6. The similarity connection query method according to claim 5, wherein updating the initial result vector pair set according to the character string set and the plurality of sub-vector grouping sets until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, to obtain a final updated initial result vector pair set, includes:
step 1, determining a plurality of character string pairs which accord with the current character string condition according to the character string set to obtain a candidate character string pair set, and determining a sub-vector grouping set pair corresponding to each pair of character string pairs in the candidate character string pair set according to a plurality of sub-vector grouping sets;
step 2, determining at least one vector pair meeting the current similar vector pair condition according to the sub-vector grouping set pair corresponding to each pair of character strings;
step 3, updating the initial result vector pair set according to the at least one vector pair to obtain an updated initial result vector pair set, and respectively updating the character string condition and the similar vector pair condition according to the updated initial result vector pair set to obtain an updated character string condition and an updated similar vector pair condition;
and 4, repeatedly executing the steps 1 to 3 until the number of vector pairs in the updated initial result vector pair set is not less than the number of result vectors, and taking the updated initial result vector pair set obtained by executing the step 3 for the last time as a final updated initial result vector pair set.
7. The similarity connection query method according to claim 1, wherein calculating a vector distance threshold according to the similarity distribution histogram and the number of result vectors comprises:
determining vector intervals corresponding to the number of the result vectors from a plurality of distance vector intervals as result vector intervals;
and determining a linear equation corresponding to the result vector interval, and calculating a vector distance threshold according to the linear equation.
8. The similarity connection query method according to claim 7, wherein the linear equation is:
Figure 7351DEST_PATH_IMAGE001
wherein the content of the first and second substances,x 0 x 1 x 2 ,…,x i-1 x i ,…,x m for each of said interval boundaries of distance vector intervals,kfor the number of the result vectors it is,x k as the vector distance threshold value (a) ((b))x i-1 x i ) For the result vector interval (a)y i-1 y i ) The similar result quantity interval corresponding to the result vector interval;
wherein the content of the first and second substances,k∈(y i-1 y i )。
9. an affinity connection inquiry apparatus, comprising:
the first acquisition module is used for acquiring an original vector set to be queried, the quantity of result vectors and an initial result vector pair set; the original vector set is an original image data set for similarity connection query, the initial result vector pair set is an initial image data set for similarity connection query results, and the result vector quantity represents the number of vector pairs of similarity connection query results and the quantity of similar images in the original image data set;
the grouping module is used for grouping the original vector set to obtain a plurality of sub-vector grouping sets;
the construction module is used for constructing a similarity distribution histogram of the original vector set according to the number of the result vectors and the plurality of sub-vector grouping sets;
the calculation module is used for calculating a vector distance threshold according to the similarity distribution histogram and the quantity of the result vectors;
a second obtaining module, configured to update the initial result vector pair set according to the multiple sub-vector grouping sets, the vector distance threshold, and the result vector quantity, so as to obtain a result vector pair set used for representing a similarity connection query result;
wherein the building block comprises:
the first submodule is used for acquiring a plurality of distance vector intervals according to a preset interval division rule and the original vector set;
the second sub-module is used for calculating a total similar result estimation value corresponding to each distance vector interval according to the plurality of sub-vector grouping sets;
and the third sub-module is used for constructing a similarity distribution histogram of the original vector set according to the overall similarity result estimation value corresponding to each distance vector interval.
CN201910130094.7A 2019-02-21 2019-02-21 Similarity connection query method and device Active CN109783547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910130094.7A CN109783547B (en) 2019-02-21 2019-02-21 Similarity connection query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910130094.7A CN109783547B (en) 2019-02-21 2019-02-21 Similarity connection query method and device

Publications (2)

Publication Number Publication Date
CN109783547A CN109783547A (en) 2019-05-21
CN109783547B true CN109783547B (en) 2020-08-21

Family

ID=66485901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910130094.7A Active CN109783547B (en) 2019-02-21 2019-02-21 Similarity connection query method and device

Country Status (1)

Country Link
CN (1) CN109783547B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276050B (en) * 2019-06-25 2023-09-15 洛阳师范学院 Method and device for comparing high-dimensional vector similarity
CN110502629B (en) * 2019-08-27 2020-09-11 桂林电子科技大学 LSH-based connection method for filtering and verifying similarity of character strings
CN111814990B (en) * 2020-06-23 2023-10-10 汇纳科技股份有限公司 Threshold determining method, system, storage medium and terminal
CN113779197B (en) * 2021-09-09 2023-07-04 中国电子科技集团公司信息科学研究院 Data set searching method and device, storage medium and terminal

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101524375B1 (en) * 2013-12-16 2015-07-01 성신여자대학교 산학협력단 Method for similarity joins by adaptive prefix filtering
CN105843907A (en) * 2016-03-24 2016-08-10 复旦大学 Method for establishing memory index structure-distance tree and similarity connection algorithm based on distance tree
CN107368830A (en) * 2016-05-13 2017-11-21 佳能株式会社 Method for text detection and device and text recognition system
CN107623639A (en) * 2017-09-08 2018-01-23 广西大学 Data flow distribution similarity join method based on EMD distances
CN108829804A (en) * 2018-06-05 2018-11-16 洛阳师范学院 Based on the high dimensional data similarity join querying method and device apart from partition tree
CN108846067A (en) * 2018-06-05 2018-11-20 洛阳师范学院 The high dimensional data similarity join querying method and device divided based on mapping space

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10496638B2 (en) * 2016-12-07 2019-12-03 City University Of Hong Kong Systems and methods for privacy-assured similarity joins over encrypted datasets

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101524375B1 (en) * 2013-12-16 2015-07-01 성신여자대학교 산학협력단 Method for similarity joins by adaptive prefix filtering
CN105843907A (en) * 2016-03-24 2016-08-10 复旦大学 Method for establishing memory index structure-distance tree and similarity connection algorithm based on distance tree
CN107368830A (en) * 2016-05-13 2017-11-21 佳能株式会社 Method for text detection and device and text recognition system
CN107623639A (en) * 2017-09-08 2018-01-23 广西大学 Data flow distribution similarity join method based on EMD distances
CN108829804A (en) * 2018-06-05 2018-11-16 洛阳师范学院 Based on the high dimensional data similarity join querying method and device apart from partition tree
CN108846067A (en) * 2018-06-05 2018-11-20 洛阳师范学院 The high dimensional data similarity join querying method and device divided based on mapping space

Also Published As

Publication number Publication date
CN109783547A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CN109783547B (en) Similarity connection query method and device
CN110188223B (en) Image processing method and device and computer equipment
CN105912611B (en) A kind of fast image retrieval method based on CNN
CN106570141B (en) Approximate repeated image detection method
CN107622072B (en) Identification method for webpage operation behavior, server and terminal
EP3499384A1 (en) Word and sentence embeddings for sentence classification
CN111461637A (en) Resume screening method and device, computer equipment and storage medium
WO2014068990A1 (en) Relatedness determination device, permanent physical computer-readable medium for same, and relatedness determination method
CN108595688A (en) Across the media Hash search methods of potential applications based on on-line study
CN110825894A (en) Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
CN113011194B (en) Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
CN111859004A (en) Retrieval image acquisition method, device, equipment and readable storage medium
CN111325033B (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN106909575B (en) Text clustering method and device
CN113536020B (en) Method, storage medium and computer program product for data query
Wang et al. File fragment type identification with convolutional neural networks
CN116795947A (en) Document recommendation method, device, electronic equipment and computer readable storage medium
Xu et al. DHA: Supervised deep learning to hash with an adaptive loss function
CN110019400B (en) Data storage method, electronic device and storage medium
CN113111178A (en) Method and device for disambiguating homonymous authors based on expression learning without supervision
CN112711648A (en) Database character string ciphertext storage method, electronic device and medium
CN114691868A (en) Text clustering method and device and electronic equipment
CN114282119A (en) Scientific and technological information resource retrieval method and system based on heterogeneous information network
CN112766288A (en) Image processing model construction method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant