CN110276050B - Method and device for comparing high-dimensional vector similarity - Google Patents

Method and device for comparing high-dimensional vector similarity Download PDF

Info

Publication number
CN110276050B
CN110276050B CN201910553042.0A CN201910553042A CN110276050B CN 110276050 B CN110276050 B CN 110276050B CN 201910553042 A CN201910553042 A CN 201910553042A CN 110276050 B CN110276050 B CN 110276050B
Authority
CN
China
Prior art keywords
preset
value
similarity
dimensional vectors
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910553042.0A
Other languages
Chinese (zh)
Other versions
CN110276050A (en
Inventor
马友忠
张瑞玲
林春杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luoyang Normal University
Original Assignee
Luoyang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Luoyang Normal University filed Critical Luoyang Normal University
Priority to CN201910553042.0A priority Critical patent/CN110276050B/en
Publication of CN110276050A publication Critical patent/CN110276050A/en
Application granted granted Critical
Publication of CN110276050B publication Critical patent/CN110276050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the technical field of data processing, in particular to a method and a device for comparing high-dimensional vector similarity, wherein the method comprises the following steps: and obtaining a vector set, wherein the vector set comprises a plurality of high-dimensional vectors, and further carrying out multi-time dimension reduction on each high-dimensional vector to obtain a plurality of preset dimension maps. And sequentially calculating the distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors, and if the obtained distance values are smaller than the predetermined distance values, enabling the similarity between the high-dimensional vectors to accord with the predetermined similarity value. According to the scheme, the high-dimensional vector is subjected to multi-time dimension reduction, and the predetermined dimension mapping after dimension reduction is subjected to pairwise calculation so as to ensure higher recall rate and filtering effect, and the query accuracy is improved.

Description

Method and device for comparing high-dimensional vector similarity
Technical Field
The application relates to the technical field of data processing, in particular to a method and a device for comparing high-dimensional vector similarity.
Background
At present, if similarity comparison is needed to be carried out on pictures or other application scenes such as repeated checking are needed to be carried out on papers, compared objects are converted into high-dimensional vectors, and finally, a conclusion whether the compared objects are similar or not is obtained through comparison between the vectors. If the two pictures are compared in similarity, each picture is converted into a high-dimensional vector, the two high-dimensional vectors are compared in similarity, and finally a conclusion whether the pictures are similar is obtained. However, at present, similarity comparison is performed on two high-dimensional vectors, and a single mapping filtering method is mostly adopted, so that higher recall rate and filtering effect cannot be obtained at the same time, and the final comparison result is inaccurate.
Disclosure of Invention
The application aims to provide a method for comparing high-dimensional vector similarity, so as to ensure higher recall rate and filtering effect and improve query accuracy.
The application further aims to provide a device for comparing high-dimensional vector similarity, so that higher recall rate and filtering effect are ensured, and query accuracy is improved.
In order to achieve the above object, the technical scheme adopted by the embodiment of the application is as follows:
in a first aspect, an embodiment of the present application provides a method for comparing similarity of vectors in high dimensions, the method comprising: obtaining a vector set, the vector set comprising a plurality of high-dimensional vectors; performing multi-time dimensionality reduction on each high-dimensional vector to obtain a plurality of preset dimensionality maps; and sequentially calculating the distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors, wherein if the obtained distance values are smaller than the predetermined distance values, the similarity between the high-dimensional vectors accords with the predetermined similarity value.
In a second aspect, an embodiment of the present application further provides an apparatus for comparing similarity of vectors in high dimensions, where the apparatus includes: the receiving and transmitting module is used for acquiring a vector set, and the vector set comprises a plurality of high-dimensional vectors; the processing module is used for carrying out multi-time dimension reduction on each high-dimensional vector to obtain a plurality of preset dimension mappings; and sequentially calculating the distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors, wherein if the obtained distance values are smaller than the predetermined distance values, the similarity between the high-dimensional vectors accords with the predetermined similarity value.
The embodiment of the application provides a method and a device for comparing high-dimensional vector similarity, wherein the method comprises the following steps: and obtaining a vector set, wherein the vector set comprises a plurality of high-dimensional vectors, and further carrying out multi-time dimension reduction on each high-dimensional vector to obtain a plurality of preset dimension maps. And sequentially calculating the distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors, and if the obtained distance values are smaller than the predetermined distance values, enabling the similarity between the high-dimensional vectors to accord with the predetermined similarity value. According to the scheme, the high-dimensional vector is subjected to multi-time dimension reduction, and the predetermined dimension mapping after dimension reduction is subjected to pairwise calculation so as to ensure higher recall rate and filtering effect, and the query accuracy is improved.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart illustrating a method for comparing high-dimensional vector similarity according to an embodiment of the present application.
Fig. 2 is a block diagram of a method for comparing similarity of vectors in high dimensions according to an embodiment of the present application.
Fig. 3 is a schematic functional block diagram of an apparatus for comparing high-dimensional vector similarity according to an embodiment of the present application.
The diagram is: 200-means for comparing high-dimensional vector similarity; 210-a transceiver module; 220-a processing module.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
At present, there are many application scenarios of high-dimensional vectors, which determine the similarity degree between the comparison objects corresponding to the high-dimensional vectors by performing pairwise comparison between the high-dimensional vectors, and the application scenarios include, but are not limited to, comparing a plurality of pictures and comparing a plurality of sounds. If the similarity comparison is performed on two pictures, instead of directly comparing the two pictures, a high-dimensional vector is extracted from each picture, and because the factors representing the content on one picture are many, such as texture features, color features and the like, the vector dimension corresponding to the picture is larger; and then comparing the similarity of the high-dimensional vectors corresponding to the two pictures, if the similarity of the two high-dimensional vectors is larger, the similarity of the pictures corresponding to the two high-dimensional vectors is also higher, and vice versa.
Therefore, the embodiment of the application provides a similarity comparison method for high-dimensional vectors, which sequentially compares the distances between multi-dimensional mappings by carrying out multi-dimensional mapping processing on the high-dimensional vectors to determine the similarity degree between two high-dimensional vectors, and better ensures the recall rate and the filtering effect (the recall rate is the number of reserved characteristics, the filtering effect is the number of filtered characteristics), and can be widely applied to application scenes of the high-dimensional vector processing.
Referring to fig. 1, a flow chart of a method for comparing high-dimensional vector similarity according to an embodiment of the application is shown, and the method includes:
s110, acquiring a vector set, wherein the vector set comprises a plurality of high-dimensional vectors.
Specifically, the vector set includes a plurality of high-dimensional vectors, for example, the vector set is v= [ v ] 1 ,v 2 ,v 3 ]Wherein v is 1 =[v 11 ,v 12 ,v 13 …v 1n ],v 2 =[v 21 ,v 22 ,v 23 …v 2n ],v 3 =[v 31 ,v 32 ,v 33 …v 3n ]It can be seen that each vector in the set of vectors is a high-dimensional vector and has a phaseThe same dimension, different values. It is easy to understand that each high-dimensional vector represents a meaning, for example, one high-dimensional vector represents one picture, and three high-dimensional vectors represent three pictures respectively. The three high-dimensional vectors are placed in the same vector set, so that similarity comparison is conveniently carried out on the three high-dimensional vectors, and a conclusion on whether pictures represented by the three high-dimensional vectors are similar or not is obtained.
Further, since each high-dimensional vector has a larger dimension, if a plurality of high-dimensional vectors are simultaneously run on one terminal device (such as a desktop computer or a notebook computer, which can be operated intelligently), a larger operation pressure is brought to the terminal device, and the operation speed is reduced. Furthermore, a plurality of high-dimensional vectors can be respectively calculated on different terminal devices in a distributed calculation mode, so that the calculation pressure of the terminal devices is reduced, and the calculation speed is improved. The specific implementation mode is as follows:
each high-dimensional vector is randomly assigned a data block number pid, and the value of the data block number pid is pid=Math.abs ()% c+1, where c is the total number of data blocks. And then the high-dimensional vectors with the same data block number are divided into a group, and the high-dimensional vectors with the same group are operated on the same terminal equipment for calculation. In other words, a plurality of data block numbers are determined first, the data block numbers are allocated to different high-dimensional vectors, and then the high-dimensional vectors with the same data block numbers are divided to the same terminal equipment for operation.
S120, reducing the dimension of each high-dimensional vector for a plurality of times to obtain a plurality of preset dimension maps.
Specifically, if one of the high-dimensional vectors v1 in the vector set is reduced in dimension, it is assumed that v1 originally has 1000 dimensions, i.e., v 1 =[v 11 ,v 12 ,v 13 …v 11000 ]The high-dimensional vector can be subjected to multiple dimensionality reduction to obtain multiple 100-dimensional maps, such asEach->All are 100 dimensions but the specific values are different, i.e. have the same dimensions and are different. It is easy to understand that the specific dimension of the predetermined dimension map obtained by dimension reduction of a high-dimension vector can be determined according to actual needs, and the number of the predetermined dimension maps can also be determined according to actual needs.
Further, each ofThe calculation mode of (a) is as follows: />Wherein g n (v) Is the nth mapping value of v, which is calculated by: g (v) =v·a, each element of vector a is an independent, co-distributed random variable that obeys the standard normal distribution N (0, 1), i.e. a is a varying value.
According to the method, a plurality of preset dimension mappings are obtained through reducing dimensions of one high-dimensional vector for a plurality of times, and compared with a mode of carrying out single mapping processing on the high-dimensional vector, the method can ensure that a higher recall rate and a better filtering effect can be obtained, and the final calculation result is more accurate.
S130, sequentially calculating the distance values between the predetermined dimension mappings under the same dimension reduction corresponding to each two high-dimensional vectors, and if the distance values are smaller than the predetermined distance values, the similarity between the high-dimensional vectors accords with the predetermined similarity value.
Specifically, since a plurality of predetermined dimension maps are obtained by performing dimension reduction on each high-dimensional vector for a plurality of times in advance, in order to ensure the accuracy of calculation, a distance value between the predetermined dimension maps of two high-dimensional vectors in the same dimension reduction is calculated one by one, as is the predetermined dimension map obtained in the first dimension reduction, and then the similarity between the two high-dimensional vectors is determined by the distance value. If the number of high-dimensional vectors involved in the operation is plural, the high-dimensional vectors on the same terminal device may be calculated in pairs first, then the high-dimensional vectors on other terminal devices may be calculated in pairs, and after all the high-dimensional vectors are calculated in pairs, the operation is completed.
Such as for each high-dimensional vector pair<v i ,v j >First, calculate the distance value of the 1 st m-dimensional mapIf the distance value is greater than a predetermined distance value kε, e.g. +.>And (3) representing that the difference between the two high-dimensional vectors is too large and does not accord with a preset similarity value (the preset similarity value can be set, and if the preset similarity value is met in actual application, the high-dimensional vectors are reserved), and then the two high-dimensional vectors are filtered in advance.
In another instance, for each pair of high-dimensional vectors<v i ,v j >First, calculate the distance value of the 1 st m-dimensional mapIf the distance value is smaller than the preset distance value, representing that the similarity between the two high-dimensional vectors is larger, continuing to calculate the distance value of the 2 nd m-dimensional mapping +.>If the distance value is greater than the preset distance value, the two high-dimensional vectors are filtered in advance, otherwise, the distance value of the 3 rd m-dimensional mapping is calculated continuously, and the like until the distance value of the l-th m-dimensional mapping is calculated +.>And when all the distance values are calculated and are smaller than the preset distance value, the similarity between the two high-dimensional vectors is very high, and the similarity between the two high-dimensional vectors can be judged to accord with the preset similarity value.
Further, in order to further determine that the two high-dimensional vectors are actually similar, after the predetermined dimension mapping under the same dimension reduction of the two high-dimensional vectors is calculated to preliminarily determine that the two high-dimensional vectors are similar, euclidean distance calculation is further required to be performed on the two high-dimensional vectors in an original space, and when the calculated euclidean distance is smaller than or equal to a threshold value, the similarity between the two high-dimensional vectors can be completely determined to meet a predetermined similarity value.
The Euclidean distance calculation method for the two high-dimensional vectors comprises the following steps:when the Euclidean distance is less than or equal to the threshold, i.e. dist (v i ,v j ) Less than or equal to epsilon, the similarity of the two high-dimensional vectors can be determined to be high, and then<v i ,v j >The output is the result. Therefore, the similarity between the two high-dimensional vectors can be more accurately determined by performing distance calculation on the multi-dimensional mappings corresponding to the high-dimensional vectors and Euclidean distance calculation between the high-dimensional vectors, and the similarity between the objects (such as pictures) represented by the two high-dimensional vectors can be further determined.
The predetermined distance value is determined in the following manner: searching a preset chi-square distribution table according to a preset recall rate to obtain a coefficient value, wherein the preset recall rate can be represented by P, the preset chi-square distribution table prestores the corresponding relation between the preset recall rate and the coefficient value, and the coefficient value is the square of k. Since the predetermined distance value is k epsilon, epsilon is a known value, and then after the square of k is found according to the predetermined recall, the value of k can be calculated, and then the value of the predetermined distance value can be calculated. The predetermined distance value is a distance threshold value determined after dimension reduction according to the high-dimension vector.
Referring to fig. 2, a block diagram of a method for comparing high-dimensional vector similarity according to an embodiment of the present application is provided, and a scheme flow provided by the embodiment of the present application will be further described below according to the block diagram.
The vector set includes a plurality of high-dimensional vector dataBlock 1, data block 2. The comparison method provided by the embodiment of the application is mainly divided into two stages of a Map stage and a Reduce stage, wherein the Map stage mainly realizes grouping and dimension reduction processing on the high-dimensional vector, and the Reduce stage mainly realizes filtering after distance calculation on the high-dimensional vector. Further, the Map stage includes randomly grouping (e.g., dividing into c-blocks) the high-dimensional vectors-each high-dimensional vector is reduced in multiple dimensions to obtain an m-dimensional MapCopy each m-dimensional map c times and output c key-value pairs to facilitate distance computation between two pairs. After doing this, the Map phase will have an output that is a set of data blocks (p), a number of m-dimensional maps, and the original high-dimensional vector v.
Furthermore, the Reduce stage sequentially calculates the distance value between the predetermined dimension maps obtained by the same dimension reduction of the two high-dimension vectors according to the output of the Map stage, namelyAnd if the distance value is larger than the preset distance value k epsilon, filtering the two high-dimensional vectors in advance, otherwise, sequentially carrying out calculation on the distance value on the preset dimension mapping obtained by each dimension reduction. Finally, if the Euclidean distance between the two high-dimensional vectors is also less than the threshold, dist (v) i ,v j ) And (5) outputting the two high-dimensional vectors less than epsilon.
Therefore, the scheme provided by the embodiment of the application can carry out multiple dimension reduction on the high-dimensional vectors of a plurality of other substances such as the representation pictures, the sounds and the like to obtain a plurality of m-dimensional maps, and a conclusion of whether the high-dimensional vectors are similar or not is obtained by calculating the distance between the m-dimensional maps so as to further obtain a conclusion of whether the substances represented by the high-dimensional vectors are similar or not, thereby ensuring higher recall rate and better filtering effect and improving query accuracy.
Referring to fig. 3, a functional block diagram of an apparatus 200 for comparing high-dimensional vector similarity according to an embodiment of the application is shown, which includes a transceiver module 210 and a processing module 220.
The transceiver module 210 is configured to obtain a vector set, where the vector set includes a plurality of high-dimensional vectors.
In an embodiment of the present application, S110 may be performed by the transceiver module 210.
A processing module 220, configured to perform multiple dimension reduction on each high-dimensional vector to obtain multiple predetermined dimension maps; and sequentially calculating the distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors, and if the obtained distance values are smaller than the predetermined distance values, enabling the similarity between the high-dimensional vectors to accord with the predetermined similarity value.
In an embodiment of the present application, S120 and S130 may be performed by the processing module 220.
Since a part of the method for comparing the similarity of the high-dimensional vectors has been described in detail, a detailed description thereof will be omitted.
In summary, the method and apparatus for comparing similarity between vectors in high dimension according to the embodiments of the present application include: and obtaining a vector set, wherein the vector set comprises a plurality of high-dimensional vectors, and further carrying out multi-time dimension reduction on each high-dimensional vector to obtain a plurality of preset dimension maps. And sequentially calculating the distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors, and if the obtained distance values are smaller than the predetermined distance values, enabling the similarity between the high-dimensional vectors to accord with the predetermined similarity value. According to the scheme, the high-dimensional vector is subjected to multi-time dimension reduction, and the predetermined dimension mapping after dimension reduction is subjected to pairwise calculation so as to ensure higher recall rate and filtering effect, and the query accuracy is improved.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (4)

1. A method for high-dimensional vector similarity comparison, the method for similar picture querying, the method comprising:
obtaining a vector set, the vector set comprising a plurality of high-dimensional vectors; each high-dimensional vector represents a picture;
performing multi-time dimensionality reduction on each high-dimensional vector to obtain a plurality of preset dimensionality maps;
calculating distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors in sequence, if a plurality of distance values are smaller than the predetermined distance values, the similarity between the high-dimensional vectors accords with the predetermined similarity value, the pictures represented by the high-dimensional vectors are similar, and the predetermined distance values are determined in the following manner: searching a preset chi-square distribution table according to a preset recall rate to obtain a coefficient value, wherein the preset recall rate is represented by P, the preset chi-square distribution table prestores the corresponding relation between the preset recall rate and the coefficient value, the coefficient value is the square of k, the preset distance value is k epsilon, epsilon is a known value, after the square of k is obtained according to the preset recall rate, the value of k is obtained through calculation, and then the value of the preset distance value is obtained through calculation, wherein the preset distance value is a distance threshold value determined after dimension reduction according to a high-dimensional vector;
reserving pictures corresponding to the high-dimensional vectors with the similarity meeting the preset similarity value, filtering out pictures corresponding to the high-dimensional vectors with the similarity not meeting the preset similarity value, and determining a picture recall rate according to the reserved pictures;
the step of sequentially calculating the distance values between the same dimension-reduction preset dimension maps corresponding to every two high-dimension vectors, and the step of further comprises the following steps if the obtained distance values are smaller than the preset distance values: calculating Euclidean distances of the two high-dimensional vectors; if the Euclidean distance is smaller than or equal to a threshold value, the similarity between the high-dimensional vectors accords with a preset similarity value;
the step of obtaining the vector set further comprises the following steps: distributing a data block number for each high-dimensional vector, dividing the high-dimensional vectors with the same data block number into a group, and operating the high-dimensional vectors with the same group on the same terminal equipment for calculation; dividing high-dimensional vectors having the same data block number into a group; running the same group of high-dimensional vectors on the same terminal equipment to perform distributed computation;
the calculation mode of the preset distance value is as follows: searching in a preset chi-square distribution table according to a preset recall ratio to obtain a coefficient value; and calculating according to the coefficient value to obtain a preset distance value.
2. The method of claim 1, wherein the method further comprises:
and if one of the obtained distance values is larger than a preset distance value, the similarity between the high-dimensional vectors does not accord with the preset similarity value.
3. An apparatus for high-dimensional vector similarity comparison, the apparatus for similarity picture querying, the apparatus comprising:
the receiving and transmitting module is used for acquiring a vector set, and the vector set comprises a plurality of high-dimensional vectors; each high-dimensional vector represents a picture;
the processing module is used for carrying out multi-time dimension reduction on each high-dimensional vector to obtain a plurality of preset dimension mappings; calculating distance values between the predetermined dimension mappings under the same dimension reduction corresponding to every two high-dimensional vectors in sequence, if a plurality of distance values are smaller than the predetermined distance values, the similarity between the high-dimensional vectors accords with the predetermined similarity value, the pictures represented by the high-dimensional vectors are similar, and the predetermined distance values are determined in the following manner: searching a preset chi-square distribution table according to a preset recall rate to obtain a coefficient value, wherein the preset recall rate is represented by P, the preset chi-square distribution table prestores the corresponding relation between the preset recall rate and the coefficient value, the coefficient value is the square of k, the preset distance value is k epsilon, epsilon is a known value, after the square of k is obtained according to the preset recall rate, the value of k is obtained through calculation, and then the value of the preset distance value is obtained through calculation, wherein the preset distance value is a distance threshold value determined after dimension reduction according to a high-dimensional vector;
reserving pictures corresponding to the high-dimensional vectors with the similarity meeting the preset similarity value, filtering out pictures corresponding to the high-dimensional vectors with the similarity not meeting the preset similarity value, and determining a picture recall rate according to the reserved pictures;
the processing module is further configured to: calculating Euclidean distances of the two high-dimensional vectors; if the Euclidean distance is smaller than or equal to a threshold value, the similarity between the high-dimensional vectors accords with a preset similarity value;
the processing module is further configured to: distributing a data block number for each high-dimensional vector, dividing the high-dimensional vectors with the same data block number into a group, and operating the high-dimensional vectors with the same group on the same terminal equipment for calculation; dividing high-dimensional vectors having the same data block number into a group; running the same group of high-dimensional vectors on the same terminal equipment to perform distributed computation;
the processing module is further configured to: searching in a preset chi-square distribution table according to a preset recall ratio to obtain a coefficient value; and calculating according to the coefficient value to obtain a preset distance value.
4. The apparatus of claim 3, wherein the processing module is further to: and if one of the obtained distance values is larger than a preset distance value, the similarity between the high-dimensional vectors does not accord with the preset similarity value.
CN201910553042.0A 2019-06-25 2019-06-25 Method and device for comparing high-dimensional vector similarity Active CN110276050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910553042.0A CN110276050B (en) 2019-06-25 2019-06-25 Method and device for comparing high-dimensional vector similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910553042.0A CN110276050B (en) 2019-06-25 2019-06-25 Method and device for comparing high-dimensional vector similarity

Publications (2)

Publication Number Publication Date
CN110276050A CN110276050A (en) 2019-09-24
CN110276050B true CN110276050B (en) 2023-09-15

Family

ID=67961785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910553042.0A Active CN110276050B (en) 2019-06-25 2019-06-25 Method and device for comparing high-dimensional vector similarity

Country Status (1)

Country Link
CN (1) CN110276050B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723105B (en) * 2020-06-15 2024-09-06 腾讯科技(深圳)有限公司 Method and device for calculating data similarity
CN112364009A (en) * 2020-12-03 2021-02-12 四川长虹电器股份有限公司 Method for retrieving similar data of target object

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182411A (en) * 2013-05-24 2014-12-03 日电(中国)有限公司 Map-Reduce-based high-dimensional data similarity connection method and device
CN108829804A (en) * 2018-06-05 2018-11-16 洛阳师范学院 Based on the high dimensional data similarity join querying method and device apart from partition tree
CN108846067A (en) * 2018-06-05 2018-11-20 洛阳师范学院 The high dimensional data similarity join querying method and device divided based on mapping space
CN109783547A (en) * 2019-02-21 2019-05-21 洛阳师范学院 A kind of similarity join querying method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8190663B2 (en) * 2009-07-06 2012-05-29 Osterreichisches Forschungsinstitut Fur Artificial Intelligence Der Osterreichischen Studiengesellschaft Fur Kybernetik Of Freyung Method and a system for identifying similar audio tracks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182411A (en) * 2013-05-24 2014-12-03 日电(中国)有限公司 Map-Reduce-based high-dimensional data similarity connection method and device
CN108829804A (en) * 2018-06-05 2018-11-16 洛阳师范学院 Based on the high dimensional data similarity join querying method and device apart from partition tree
CN108846067A (en) * 2018-06-05 2018-11-20 洛阳师范学院 The high dimensional data similarity join querying method and device divided based on mapping space
CN109783547A (en) * 2019-02-21 2019-05-21 洛阳师范学院 A kind of similarity join querying method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于卡方分布的高维数据相似性连接查询算法;马友忠;贾世杰;张永新;;计算机应用(第07期);全文 *
大数据相似性连接查询技术研究进展;马友忠;张智辉;林春杰;;计算机应用(第04期);全文 *

Also Published As

Publication number Publication date
CN110276050A (en) 2019-09-24

Similar Documents

Publication Publication Date Title
CN110400575B (en) Inter-channel feature extraction method, audio separation method and device and computing equipment
CN109597822B (en) User data storage and query method and user data processing device
CN110796154A (en) Method, device and equipment for training object detection model
WO2006094002A1 (en) Hierarchical determination of feature relevancy for mixed data types
CN113032580B (en) Associated file recommendation method and system and electronic equipment
EP2633397A1 (en) Software application recognition
CN110276050B (en) Method and device for comparing high-dimensional vector similarity
CN110162637B (en) Information map construction method, device and equipment
CN109726195B (en) Data enhancement method and device
CN111243601A (en) Voiceprint clustering method and device, electronic equipment and computer-readable storage medium
CN113515656B (en) Multi-view target identification and retrieval method and device based on incremental learning
CN110825894A (en) Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
CN112529767B (en) Image data processing method, device, computer equipment and storage medium
WO2017201605A1 (en) Large scale social graph segmentation
US20110142346A1 (en) Apparatus and method for blocking objectionable multimedia based on skin color and face information
Niu et al. Machine learning-based framework for saliency detection in distorted images
CN107944931A (en) Seed user expanding method, electronic equipment and computer-readable recording medium
CN114780606A (en) Big data mining method and system
CN110619349A (en) Plant image classification method and device
CN104050291A (en) Parallel processing method and system for account balance data
CN110427496B (en) Knowledge graph expansion method and device for text processing
CN112381147A (en) Dynamic picture similarity model establishing method and device and similarity calculating method and device
CN113111687A (en) Data processing method and system and electronic equipment
CN115495778A (en) Differential privacy histogram publishing method and device based on grouping combination
CN112528068B (en) Voiceprint feature storage method, voiceprint feature matching method, voiceprint feature storage device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant