CN104881395A - Method and system for obtaining similarity of vectors in matrix - Google Patents

Method and system for obtaining similarity of vectors in matrix Download PDF

Info

Publication number
CN104881395A
CN104881395A CN201510359140.2A CN201510359140A CN104881395A CN 104881395 A CN104881395 A CN 104881395A CN 201510359140 A CN201510359140 A CN 201510359140A CN 104881395 A CN104881395 A CN 104881395A
Authority
CN
China
Prior art keywords
vector
matrix
similarity
reduce
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510359140.2A
Other languages
Chinese (zh)
Inventor
王巍
许子立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TCL Corp
Original Assignee
TCL Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TCL Corp filed Critical TCL Corp
Priority to CN201510359140.2A priority Critical patent/CN104881395A/en
Publication of CN104881395A publication Critical patent/CN104881395A/en
Pending legal-status Critical Current

Links

Abstract

The invention is applicable to the technical field of information processing, and provides a method and system for obtaining similarity of vectors in a matrix. The method includes the steps that the matrix to be obtained is pre-processed, i.e., the zero value and the values of which the occurring rate is lower than a preset occurring rate threshold value in the matrix to be obtained are removed; the pre-processed values in the matrix to be obtained are stored to a distribution system in a line vector mode; the similarity of any two vectors in the distribution system is calculated in a sampling mode according to a Map-Reduce mapping-reduction model. Through the method and system for obtaining similarity of vectors in the matrix, the similarity calculation amount can be greatly reduced while the similarity calculation accuracy is guaranteed.

Description

A kind of method and system obtaining vector similarity in matrix
Technical field
The invention belongs to algorithm field, particularly relate to a kind of method and system obtaining vector similarity in matrix.
Background technology
In actual life, people often run into the problem of being carried out by article classifying, such as: computer, mobile phone and MP3 are categorized as electronic product, and if computing machine will be made to classify to article, similarity between then needing according to article is classified, similarity is classified as a classification higher than the two items of threshold value preset, is converted into the problem asking similarity by classification problem.
Suppose the article having now a M section different, every section of article is made up of, if will find out any two word w in these articles L mutual unduplicated word (for english article, if Chinese articles, being then mutual unduplicated word) iand w jsimilarity, prior art is the matrix this M section article being formed the capable N row of M, the number of word of N for occurring in M section article, wherein N>=L.Process for large-scale data generally uses Map-Reduce mapping-reduction model to process.In the Map stage, if there is w in one section of article simultaneously iand w j, then by w iand w jmap, be designated as (w i, w j->1) <i>, i represent w iand w jthe number that both map, wherein the value of i is (1, M); If not there is not w in one section of article simultaneously iand w j, then by w iand w jmap, be designated as (w i, w j->0); In the Reduce stage, mapping injects row summation, and divided by w in M section article under radical sign iand w jthe product of the number of times occurred respectively, thus obtain w iand w jsimilarity.The complexity of the whole algorithm of prior art is O (M*N 2), when the numerical value of M is huge especially (as 10 10), the calculated amount expended is quite large.
Summary of the invention
Given this, the embodiment of the present invention provides a kind of method and system obtaining vector similarity in matrix, to solve the problem that in prior art Similarity Measure process, calculated amount is huge.
The embodiment of the present invention is achieved in that a kind of method obtaining vector similarity in matrix, said method comprising the steps of:
Carry out pre-service to matrix to be obtained, described pre-service comprises: remove null value in described matrix to be obtained and the occurrence rate value lower than the occurrence rate threshold value preset;
Value in described pretreated matrix to be obtained is stored into distributed system by the mode of row vector;
According to the similarity of Map-Reduce mapping-reduction model by any two row vectors in distributed system described in sample calculation.
Further,
Describedly to comprise according to the similarity of Map-Reduce mapping-reduction model by two row vectors any in distributed system described in sample calculation:
In the Map stage, from described distributed system, extract vector by row to w i, w j;
Obtain described vector to w i, w jreduce probability;
Map described vector to w i, w j: ((w i, w j)->1), and according to described Reduce probability, described mapping is submitted to the Reduce stage of Map-Reduce model;
In the Reduce stage, obtain described vector to w i, w jand obtain the similarity that described vector is right, wherein r irepresent (w i, w j) corresponding mapping number, R represents the record of article, represent (w in R section article i, w j) corresponding mapping number sues for peace.
Further, the described vector of described acquisition is to w i, w jreduce probability be specially:
According to formula obtain described vector to w i, w jreduce probability, wherein p be Reduce probability, e to be span be (0,1) the vector similarity threshold value, the #w that preset ifor vectorial w ithe number of times occurred in described distributed system;
Described in the Reduce stage, obtain described vector to w i, w jand the similarity obtaining described vector right is specially: in the Reduce stage, obtains described vector to w i, w jand and according to formula obtain the similarity that described vector is right, wherein ((w i, w j), <r 1..., r r>).
Further, described removal occurrence rate comprises lower than the value of the occurrence rate threshold value preset:
Existing frequency is pressed out to the residue numerical value in matrix and carries out descending sort, remove and be arranged in the numerical value of rear X position, described X be greater than zero integer
Further, described X is [residue numerical value * preset ratio] or int (residue numerical value * preset ratio).
Another object of the embodiment of the present invention is to provide a kind of system obtaining vector similarity in matrix, and described system comprises:
Matrix pretreatment unit, for carrying out pre-service to matrix to be obtained, described pre-service comprises: remove null value in described matrix to be obtained and the occurrence rate value lower than the occurrence rate threshold value preset;
Matrix memory cell, for being stored into distributed system by the value in the pretreated matrix of described matrix pretreatment unit by the mode of row vector;
Similarity acquiring unit, for according to the similarity of any two row vectors of Map-Reduce model by storing in matrix memory cell described in sample calculation.
Further,
Described similarity acquiring unit, comprising:
Vector, to extraction subelement, in the Map stage, extracts vector to w by row from described distributed system i, w j;
Reduce probability obtains subelement, for obtaining described vector to extracting the vector of subelement extraction to w i, w jreduce probability;
Map and submit subelement to, for mapping described vector to w i, w j: ((w i, w j)->1), and obtain according to described Reduce probability the Reduce stage that described mapping to be submitted to Map-Reduce model by Reduce probability that subelement obtains;
Similarity obtains subelement, for mapping the Reduce stage of submitting to subelement to be submitted to described, obtains described vector to w i, w jand obtain the similarity that described vector is right, wherein r irepresent (w i, w j) corresponding mapping number, R represents the record of article, represent (w in R section article i, w j) corresponding mapping number sues for peace.
Further, described Reduce probability acquisition subelement comprises:
Reduce probability calculation subelement, for according to formula obtain described vector to w i, w jreduce probability, wherein p be Reduce probability, e to be span be (0,1) the vector similarity threshold value, the #w that preset ifor vectorial w ithe number of times occurred in described distributed system;
Described similarity obtains subelement and comprises:
Similarity Measure subelement, in the Reduce stage, obtains described vector to w i, w jand and according to formula obtain the similarity that described vector is right, wherein ((w i, w j), <r 1..., r r>).
Further, described matrix pretreatment unit specifically for, existing frequency is pressed out to the residue numerical value in matrix and carries out descending sort, remove and be arranged in the numerical value of rear N position, described N be greater than zero integer.
Further, described N is [residue numerical value * preset ratio] or int (residue numerical value * preset ratio).
The beneficial effect that the embodiment of the present invention compared with prior art exists is: by carrying out pre-service to matrix to be obtained, value in described pretreated matrix to be obtained is stored into distributed system by the mode of row vector, and according to the similarity of Map-Reduce mapping-reduction model by any two row vectors in distributed system described in sample calculation.The embodiment of the present invention decreases the quantity of numerical value in matrix by pre-service, is reduced the complexity of Map-Reduce model calculating by sampling, thus while guarantee Similarity Measure precision, can reduce the calculated amount in Similarity Measure process greatly.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of vector similarity method in the acquisition matrix that provides of the embodiment of the present invention;
Fig. 2 a is the schematic diagram of the matrix to be obtained that the embodiment of the present invention provides;
Fig. 2 b is the schematic diagram of the matrix after the removal null vector that provides of the embodiment of the present invention;
Fig. 2 c is the schematic diagram of the removal that provides of the embodiment of the present invention lower than the matrix after the occurrence rate threshold value preset;
Fig. 3 is the structural drawing of vector similarity system in the acquisition matrix that provides of the embodiment of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
In order to technical solutions according to the invention are described, be described below by specific embodiment.
Embodiment one
Be illustrated in figure 1 the process flow diagram of vector similarity method in the acquisition matrix that the embodiment of the present invention provides, said method comprising the steps of:
Step S101, to matrix pre-service to be obtained, described pre-service comprises: remove null value in described matrix to be obtained and the occurrence rate value lower than the occurrence rate threshold value preset.
In embodiments of the present invention, matrix to be obtained is generally sparse matrix, and therefore can carry out pre-service to described sparse matrix, described pre-service is specifically as follows: remove null value in matrix to be obtained and the occurrence rate value lower than the occurrence rate threshold value preset.
If Fig. 2 a is matrix to be obtained, Fig. 2 b removes the matrix after null value, and Fig. 2 c is the matrix that removal occurrence rate is less than after the value of 3 times, can find out that the embodiment of the present invention effectively reduces the quantity of numerical value matrix by described pre-service from Fig. 2 c.
Preferably, after removing the null value in described matrix to be obtained, remove occurrence rate described in the embodiment of the present invention and can comprise lower than the value of the occurrence rate threshold value preset:
Existing frequency is pressed out to the residue numerical value in matrix and carries out descending sort, remove and be arranged in the numerical value of rear X position, described X be greater than zero integer.
Preferably, described X is [residue numerical value * preset ratio] or int (residue numerical value * preset ratio) (representing that result of calculation rounds).It should be noted that, " ratio " in preset ratio described in the application exists different from " number ", " number " is a static value, and " ratio " is a dynamic value, if what such as arrange is " number ", " number " is 5, and in matrix, remaining number only has 5, then can remove numerical value all in matrix.Such as, and if setting is " ratio ", " ratio " is 50%, and in matrix, remaining numerical value only has 5, as long as so remove 5*50%=2.5, be 2 after rounding, then remove the numerical value being arranged in latter 2.
Step S102, is stored into distributed system by the value in described pretreated matrix to be obtained by the mode of row vector.
In embodiments of the present invention, in order to use Map-Reduce (mapping-reduction) model to obtain similarity vector in matrix to be obtained, need the mode of pretreated value by row vector to be stored in distributed system.Exemplary, this distributed system is HDFS (Hadoop Distribute File System is called for short: Hadoop distributed system).
Step S103, according to Map-Reduce mapping-reduction model by calculating the similarity obtaining any two row vectors in described distributed system.
In embodiments of the present invention, the described similarity being obtained any two row vectors in described distributed system according to Map-Reduce mapping-reduction model by calculating, can be comprised:
1, in the Map stage, from described distributed system, vector is extracted by row to w i, w j.
In embodiments of the present invention, arbitrary extracting two vectorial w from distributed system first by row i, w j, this w i, w jthe vector being extraction is right, and the right process of this extraction vector is sampling.
2, described vector is obtained to w i, w jreduce probability.
In embodiments of the present invention, the reduction probability calculation formula of vector is namely these two vectors are after mapping, be submitted to reduction stages according to p probability, in this formula, p is mapping probabilities, e be span is (0,1) the vector similarity threshold value preset, #w are the number of times that vectorial w occurs in described distributed system, as: w ithe number of times occurred in the second matrix is 5 times, then #w ivalue be 5; w jthe number of times occurred in the second matrix is 6 times, then #w jvalue be 6, namely the described vector of described acquisition is to w i, w jreduce probability, be specially:
According to formula obtain described vector to w i, w jreduce probability, wherein p be Reduce probability, e to be span be (0,1) the vector similarity threshold value, the #w that preset ifor vectorial w ithe number of times occurred in described distributed system.
3, described vector is mapped to w i, w j: ((w i, w j)->1), and according to described Reduce probability, described mapping is submitted to the Reduce stage of Map-Reduce model.
In embodiments of the present invention, to vector to w i, w jmap: ((w i, w j)->1), and according to above-mentioned reduction Probability p by this mapping ((w i, w j)->1) be submitted to the Reduce stage, if that is: the p of above-mentioned calculating is 0.6, then will map ((w according to the probability of 60% i, w j)->1) be submitted to Reduce stage of Map-Reduce model.
4, in the Reduce stage, described vector is obtained to w i, w jand and obtain the right similarity of described vector, wherein r irepresent (w i, w j) corresponding mapping number, R represents the record of article, represent (w in R section article i, w j) corresponding mapping number sues for peace.
In embodiments of the present invention, according to formula obtain the similarity that described vector is right, wherein ((w i, w j), <r 1..., r r>), ((w i, w j), <r 1..., r r>) (w is represented i, w j) come from <r 1..., r rin >R section article.
It should be noted that, described according to formula before obtaining the step of the right Reduce probability of described vector, described method is further comprising the steps of:
Preset described vector similarity threshold value e.
First the embodiment of the present invention decreases the quantity of numerical value in matrix by pre-service, the complexity of Map-Reduce model calculating is reduced again by sampling, wherein the computation complexity in Map stage is reduced to O (DNlog (D)/e), the computation complexity in Reduce stage is reduced to O (log (D)/e), while guarantee Similarity Measure precision, greatly reduce the calculated amount in Similarity Measure process.
Embodiment two
Be illustrated in figure 3 the structural drawing of vector similarity system in the acquisition matrix that the embodiment of the present invention provides, for convenience of explanation, the part relevant to the embodiment of the present invention be only shown.
In described acquisition matrix, vector similarity system can be the unit of software unit, hardware cell or the software and hardware combining be built in intelligent terminal (such as mobile phone, dull and stereotyped electroplax, intelligent TV set), and in described acquisition matrix, vector similarity system comprises:
Matrix pretreatment unit 301, for carrying out pre-service to matrix to be obtained, described pre-service comprises: remove null value in described matrix to be obtained and the occurrence rate value lower than the occurrence rate threshold value preset.
In embodiments of the present invention, described threshold value can be arranged according to the needs of reality use, does not limit at this.Exemplary, this threshold value is int (residue numerical value * 50%).
Matrix memory cell 302, for being stored into distributed system by the value in the pretreated matrix of described matrix pretreatment unit 301 by the method for row vector.
Exemplary, this distributed system is HDFS (Hadoop Distribute File System is called for short: Hadoop distributed file system).
Similarity acquiring unit 303, for passing through the similarity of any two row vectors of storage in matrix memory cell 302 described in sample calculation according to Map-Reduce model.
Further, described similarity acquiring unit 303 comprises:
Vector, to extraction subelement 3031, in the Map stage, extracts vector to w by row from described distributed system i, w j;
Reduce probability obtains subelement 3032, for obtaining described vector to extracting the vector of subelement 3031 extraction to w i, w jreduce probability.
In embodiments of the present invention, the reduction probability calculation formula of vector is namely these two vectors are after mapping, be submitted to reduction stages according to p probability, in this formula, p is mapping probabilities, e be span is (0,1) the vector similarity threshold value preset, #w are the number of times that vectorial w occurs in a distributed system, as: w ithe number of times occurred in the second matrix is 5 times, then #w ivalue be 5; w jthe number of times occurred in the second matrix is 6 times, then #w jvalue be 6.Described Reduce probability obtains subelement 3032, comprising:
Reduce probability calculation subelement 30321, for according to formula obtain described vector to w i, w jreduce probability, wherein p be Reduce probability, e to be span be (0,1) the vector similarity threshold value, the #w that preset ifor vectorial w ithe number of times occurred in described distributed system.
Map and submit subelement 3033 to, for mapping described vector to w i, w j: ((w i, w j)->1), and obtain according to described Reduce probability the Reduce stage that described mapping to be submitted to Map-Reduce model by Reduce probability that subelement 3032 obtains.
In embodiments of the present invention, to vector to w i, w jmap: ((w i, w j)->1), and according to above-mentioned reduction Probability p by this mapping ((w i, w j)->1) be submitted to the Reduce stage, if that is: the p of above-mentioned calculating is 0.6, then will map ((w according to the probability of 60% i, w j)->1) be submitted to Reduce stage of Map-Reduce model.
Similarity obtains subelement 3034, for mapping the Reduce stage of submitting to subelement 3033 to be submitted to described, obtains described vector to w i, w jand obtain the similarity that described vector is right, wherein r irepresent (w i, w j) corresponding mapping number, R represents the record of article, represent (w in R section article i, w j) corresponding mapping number sues for peace.
In embodiments of the present invention, described similarity obtains subelement 3034, comprising:
Similarity Measure subelement 30341, in the Reduce stage, obtains described vector to w i, w jand and according to formula obtain the similarity that described vector is right, wherein ((w i, w j), <r 1..., r r>), ((w i, w j), <r 1..., r r>) (w is represented i, w j) come from <r 1..., r rin >R section article.
Further, described matrix pretreatment unit 301 specifically may be used for, and presses out existing frequency carry out descending sort to the residue numerical value in matrix, removes and is arranged in the numerical value of rear N position, described N be greater than zero integer.
Preferably, described N is [residue numerical value * preset ratio] or int (residue numerical value * preset ratio).
In sum, the embodiment of the present invention is by carrying out pre-service to matrix to be obtained, value in described pretreated matrix to be obtained is stored into distributed system by the mode of row vector, and according to the similarity of Map-Reduce mapping-reduction model by any two row vectors in distributed system described in sample calculation.Compared with prior art, the embodiment of the present invention decreases the quantity of vector in matrix by pre-service, the complexity of Map-Reduce model calculating is reduced by sampling, thus while guarantee Similarity Measure precision, the calculated amount in Similarity Measure process can be reduced greatly, there is stronger ease for use and practicality.
Those skilled in the art can be well understood to, for convenience of description and succinctly, only be illustrated with the division of above-mentioned each functional unit, subelement, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional units, module, inner structure by described system is divided into different functional units or module, to complete all or part of function described above.Each functional unit in embodiment, subelement can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated, above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.In addition, the concrete title of each functional unit, subelement, also just for the ease of mutual differentiation, is not limited to the protection domain of the application.The specific works process of unit, subelement in said system, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
Those of ordinary skill in the art can recognize, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with the combination of electronic hardware or computer software and electronic hardware.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.
In embodiment provided by the present invention, should be understood that disclosed system and method can realize by another way.Such as, system embodiment described above is only schematic, such as, the division of described module or unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, it can be by some interfaces that shown or discussed coupling each other or direct-coupling or communication connect, and the indirect coupling of device or unit or communication connect, and can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that the technical scheme of the embodiment of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform all or part of step of method described in each embodiment of the embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random AccessMemory), magnetic disc or CD etc. various can be program code stored medium.
The above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of each embodiment technical scheme of the embodiment of the present invention.

Claims (10)

1. obtain a method for vector similarity in matrix, it is characterized in that, described method comprises:
Carry out pre-service to matrix to be obtained, described pre-service comprises: remove null value in described matrix to be obtained and the occurrence rate value lower than the occurrence rate threshold value preset;
Value in described pretreated matrix to be obtained is stored into distributed system by the mode of row vector;
According to the similarity of Map-Reduce mapping-reduction model by any two row vectors in distributed system described in sample calculation.
2. the method for claim 1, is characterized in that, describedly comprises according to the similarity of Map-Reduce mapping-reduction model by two row vectors any in distributed system described in sample calculation:
In the Map stage, from described distributed system, extract vector by row to w i, w j;
Obtain described vector to w i, w jreduce probability;
Map described vector to w i, w j: ((w i, w j)->1), and according to described Reduce probability, described mapping is submitted to the Reduce stage of Map-Reduce model;
In the Reduce stage, obtain described vector to w i, w jand obtain the similarity that described vector is right, wherein r irepresent (w i, w j) corresponding mapping number, R represents the record of article, represent (w in R section article i, w j) corresponding mapping number sues for peace.
3. method as claimed in claim 2, it is characterized in that, the described vector of described acquisition is to w i, w jreduce probability be specially:
According to formula obtain described vector to w i, w jreduce probability, wherein p be Reduce probability, e to be span be (0,1) the vector similarity threshold value, the #w that preset ifor vectorial w ithe number of times occurred in described distributed system;
Described in the Reduce stage, obtain described vector to w i, w jand the similarity obtaining described vector right is specially: in the Reduce stage, obtains described vector to w i, w jand and according to formula obtain the similarity that described vector is right, wherein ((w i, w j), <r 1..., r r>).
4. the method for claim 1, is characterized in that, described removal occurrence rate comprises lower than the value of the occurrence rate threshold value preset:
Existing frequency is pressed out to the residue numerical value in matrix and carries out descending sort, remove and be arranged in the numerical value of rear X position, described X be greater than zero integer.
5. method as claimed in claim 4, is characterized in that, described X is [residue numerical value * preset ratio] or int (residue numerical value * preset ratio).
6. obtain a system for vector similarity in matrix, it is characterized in that, described system comprises:
Matrix pretreatment unit, for carrying out pre-service to matrix to be obtained, described pre-service comprises: remove null value in described matrix to be obtained and the occurrence rate value lower than the occurrence rate threshold value preset;
Matrix memory cell, for being stored into distributed system by the value in the pretreated matrix of described matrix pretreatment unit by the mode of row vector;
Similarity acquiring unit, for according to the similarity of any two row vectors of Map-Reduce model by storing in matrix memory cell described in sample calculation.
7. system as claimed in claim 6, it is characterized in that, described similarity acquiring unit, comprising:
Vector, to extraction subelement, in the Map stage, extracts vector to w by row from described distributed system i, w j;
Reduce probability obtains subelement, for obtaining described vector to extracting the vector of subelement extraction to w i, w jreduce probability;
Map and submit subelement to, for mapping described vector to w i, w j: ((w i, w j)->1), and obtain according to described Reduce probability the Reduce stage that described mapping to be submitted to Map-Reduce model by Reduce probability that subelement obtains;
Similarity obtains subelement, for mapping the Reduce stage of submitting to subelement to be submitted to described, obtains described vector to w i, w jand obtain the similarity that described vector is right, wherein r irepresent (w i, w j) corresponding mapping number, R represents the record of article, represent (w in R section article i, w j) corresponding mapping number sues for peace.
8. system as claimed in claim 7, is characterized in that, described Reduce probability obtains subelement and comprises:
Reduce probability calculation subelement, for according to formula obtain described vector to w i, w jreduce probability, wherein p be Reduce probability, e to be span be (0,1) the vector similarity threshold value, the #w that preset ifor vectorial w ithe number of times occurred in described distributed system;
Described similarity obtains subelement and comprises:
Similarity Measure subelement, in the Reduce stage, obtains described vector to w i, w jand and according to formula obtain the similarity that described vector is right, wherein ((w i, w j), <r 1..., r r>).
9. system as claimed in claim 6, is characterized in that, described matrix pretreatment unit specifically for, existing frequency is pressed out to the residue numerical value in matrix and carries out descending sort, remove the numerical value being arranged in rear N position, described N be greater than zero integer.
10. system as claimed in claim 9, is characterized in that, described N is [residue numerical value * preset ratio] or int (residue numerical value * preset ratio).
CN201510359140.2A 2015-06-25 2015-06-25 Method and system for obtaining similarity of vectors in matrix Pending CN104881395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510359140.2A CN104881395A (en) 2015-06-25 2015-06-25 Method and system for obtaining similarity of vectors in matrix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510359140.2A CN104881395A (en) 2015-06-25 2015-06-25 Method and system for obtaining similarity of vectors in matrix

Publications (1)

Publication Number Publication Date
CN104881395A true CN104881395A (en) 2015-09-02

Family

ID=53948890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510359140.2A Pending CN104881395A (en) 2015-06-25 2015-06-25 Method and system for obtaining similarity of vectors in matrix

Country Status (1)

Country Link
CN (1) CN104881395A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052485A (en) * 2017-12-15 2018-05-18 东软集团股份有限公司 the distributed computing method and device of vector similarity, storage medium and node
CN108241868A (en) * 2016-12-26 2018-07-03 浙江宇视科技有限公司 The objective similarity of image is to the mapping method and device of subjective similarity

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241868A (en) * 2016-12-26 2018-07-03 浙江宇视科技有限公司 The objective similarity of image is to the mapping method and device of subjective similarity
CN108241868B (en) * 2016-12-26 2021-02-02 浙江宇视科技有限公司 Method and device for mapping objective similarity to subjective similarity of image
CN108052485A (en) * 2017-12-15 2018-05-18 东软集团股份有限公司 the distributed computing method and device of vector similarity, storage medium and node
CN108052485B (en) * 2017-12-15 2021-05-07 东软集团股份有限公司 Distributed computing method and device for vector similarity, storage medium and node

Similar Documents

Publication Publication Date Title
CN102799647B (en) Method and device for webpage reduplication deletion
WO2020147488A1 (en) Method and device for identifying irregular group
CN108710613A (en) Acquisition methods, terminal device and the medium of text similarity
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
CN102866954B (en) The method of Memory Allocation and device
CN110851598A (en) Text classification method and device, terminal equipment and storage medium
CN109960612B (en) Method, device and server for determining data storage ratio
CN110717040A (en) Dictionary expansion method and device, electronic equipment and storage medium
CN104317850A (en) Data processing method and device
CN111415196A (en) Advertisement recall method, device, server and storage medium
CN104881395A (en) Method and system for obtaining similarity of vectors in matrix
CN105159927A (en) Method and device for selecting subject term of target text and terminal
CN109657060B (en) Safety production accident case pushing method and system
CN114970553B (en) Information analysis method and device based on large-scale unmarked corpus and electronic equipment
US11709798B2 (en) Hash suppression
CN104991920A (en) Label generation method and apparatus
CN105512270A (en) Method and device for determining related objects
CN104463627A (en) Data processing method and device
CN114444514A (en) Semantic matching model training method, semantic matching method and related device
CN114461837A (en) Image processing method and device and electronic equipment
CN112667770A (en) Method and device for classifying articles
CN114417808B (en) Article generation method and device, electronic equipment and storage medium
CN105488022A (en) Text characteristic extraction system and method
CN113360602A (en) Method, apparatus, device and storage medium for outputting information
CN113535968A (en) Method and device for extracting key attributes of data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150902