Summary of the invention
In view of the above problems, propose the present invention in order to provide one kind overcome the above problem or at least be partially solved or
The method and apparatus that person slows down a kind of determination of association news of the above problem.
According to an aspect of the invention, there is provided a kind of determination method for being associated with news, comprising: choose a news conduct
The mark post news of a certain news category;Calculate the distance between other news and mark post news;When other news and mark post news
The distance between no more than setting threshold value when, determine the association news of other news news category thus.
Optionally, in the determination method of the association news of embodiment according to the present invention, news includes: headline,
News in brief or full press release.
Optionally, in the determination method of the association news of embodiment according to the present invention, distance by other news spy
The intersection for levying the feature vector of vector and mark post news determines.
Optionally, in the determination method of the association news of embodiment according to the present invention, distance by other news spy
The inner product or folder cosine of an angle for levying the feature vector of vector and mark post news determine.
Optionally, in the determination method of the association news of embodiment according to the present invention, distance by other news spy
The minimum hash for levying the minimum hash of vector and the feature vector of mark post news determines.
Optionally, in the determination method of the association news of embodiment according to the present invention, the composition of feature vector is specific
For words sequence will be formed after news progress word segmentation processing, according to the sequence weight of the words frequency of occurrences in words sequence from high to low
New arrangement words sequence, takes out feature vector of the words of preset quantity as news from front to back.
Optionally, in the determination method of the association news of embodiment according to the present invention, news is subjected to word segmentation processing
Further progress goes garbage processing to re-form the words sequence before sequence afterwards.
Optionally, described new to having determined as in the determination method of the association news of embodiment according to the present invention
At least one of the following factor of peg of news for hearing classification is ranked up: clicking rate, news reprint rate and number of reviews,
First news will be come as the mark post news.
According to another aspect of the present invention, a kind of determining device for being associated with news, including selecting device are provided, is used for
Choose mark post news of the news as a certain news category;Apart from computing device, for calculating other news and mark post news
The distance between;It is associated with news determining device, for when the threshold of the distance between other news and mark post news no more than setting
When value, determine that other news are the association news of news category.
Optionally, in the determining device of the association news of embodiment according to the present invention, news includes: headline,
News in brief or full press release.
Optionally, in the determining device of the association news of embodiment according to the present invention, distance by other news spy
The intersection for levying the feature vector of vector and mark post news determines.
Optionally, in the determining device of the association news of embodiment according to the present invention, distance by other news spy
The inner product or folder cosine of an angle for levying the feature vector of vector and mark post news determine.
Optionally, in the determining device of the association news of embodiment according to the present invention, distance by other news spy
The minimum hash for levying the minimum hash of vector and the feature vector of mark post news determines.
Optionally, in the determining device of the association news of embodiment according to the present invention, further include apart from computing device
Feature vector constituent apparatus occurs for forming words sequence after news is carried out word segmentation processing according to words in words sequence
The sequence of frequency from high to low rearranges words sequence, takes out spy of the words of preset quantity as the news from front to back
Levy vector.
Optionally, in the determining device of the association news of embodiment according to the present invention, further include apart from computing device
Garbage processing unit re-forms the word before sequence for the words sequence after word segmentation processing to be carried out garbage processing
Word sequence.
Optionally, in the determining device of the association news of embodiment according to the present invention, selecting device, for
At least one of the following factor of peg of news for being determined as the news category is ranked up: clicking rate, news reprint rate, with
And number of reviews, first news will be come as the mark post news.
The invention has the benefit that the determination method and device of association news of the invention can effectively reduce newly
The calculation amount for hearing relatedness computation in contribution cluster process can be improved speed and efficiency that association news determines.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Specific embodiment
The invention will be further described with specific embodiment with reference to the accompanying drawing.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one
It is a ", " described " and "the" may also comprise plural form.It should be further understood that wording used in specification of the invention
" comprising " refers to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition one
Other a or multiple features, integer, step, operation, element, component and/or their group.It should be understood that when we claim element
It is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be in
Between element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.Wording used herein
"and/or" includes one or more associated wholes for listing item or any cell and all combinations.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art
Language and scientific term), there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Should also
Understand, those terms such as defined in the general dictionary, it should be understood that have in the context of the prior art
The consistent meaning of meaning, and unless idealization or meaning too formal otherwise will not be used by specific definitions as here
To explain.
Cluster refers to that the set by physics or abstract object is divided into the multiple classes being made of similar object in the present invention
Process.By clustering the set that cluster generated is one group of data object, these objects and the object in the same cluster are similar to each other,
It is different with the object in other clusters.
Referring to Fig. 1, a kind of determination method of the association news provided it illustrates the specific embodiment of the invention, packet
It includes: step 110, choosing mark post news of the news as a certain news category;Step 120, other news and the mark are calculated
The distance between bar news;Step 130, when the threshold of the distance between other described news and the mark post news no more than setting
When value, determine that other described news are the association news of the news category.
Wherein, step 110, mark post news of the news as a certain news category is chosen.
In a certain embodiment of the invention, the selection of mark post news can be ranked up clustered news, can
To be ranked up according to a certain item in news clicking rate, news reprint rate and news comment quantity to the news clustered,
To the news after sequence, the news for coming first is chosen as mark post news.
In another embodiment of the invention, the selection of mark post news can be ranked up clustered news, can
The news clustered is ranked up with the Multiple factors in news clicking rate, news reprint rate and news comment quantity, it is right
News after sequence chooses the news for coming first as mark post news.
In another embodiment of the invention, in the following factor of the peg of news for having determined as news category at least
One is ranked up: clicking rate, news reprint rate and number of reviews, will come first news as mark post news.It is optional
, the news after sequence is further screened, is more than the news of certain time threshold value to issuing time, is not selected as
Mark post news.
In another embodiment of the invention, M clustered news are randomly selected, by calculating between M news
Distance, select with M-1 news sum of the distance of other news be shortest news as mark post news.Due to M news pair
It is therefore a lesser value can't produce the computational efficiency of methods and apparatus of the present invention for entire news category
Raw apparent influence.Optionally, it is further screened to according to the news after the ascending sequence of sum of the distance, when to publication
Between be more than certain time threshold value news, be not selected as mark post news.
Step 120, the distance between other news and the mark post news are calculated;
Specifically, step 120 optionally includes the following steps, please refers to Fig. 2:
Step 1201, word segmentation processing is carried out to news;
Present embodiment can carry out word segmentation processing first, obtain a words.Words after word segmentation processing
Include such as keywords such as " Ma Yili ", " new film ", " scale ", further comprises garbage.
Step 1202, garbage is carried out to the words after word segmentation processing to handle;
It can be divided into punctuation mark by garbage, with structural auxiliary word function word etc. in Chinese meaningless vocabulary.?
In the specific embodiment of the invention, after word segmentation processing, can further include the words after word segmentation processing is carried out it is useless
Information processing.
Step 1203, representative words is selected to constitute the feature vector of news;
It optionally, can be using the words for going garbage to obtain after handling as the feature vector of news.Or it extracts and goes
Representative words constitutes the feature vector of news in the words obtained after garbage processing.
For example, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage
Arrange S=(s1,s2,s3......,sN), wherein the expressions such as s1, s2, s3 by participle and go garbage treated words.
It is possible that identical words in words sequence S, therefore related word frequency can be carried out to the words in words sequence
Statistics, is further arranged according to the sequence of words frequency of occurrence from high to low, takes out the word of preset quantity from front to back
Accord with the feature vector as the newsletter archive.
Step 1204, other news are calculated at a distance from mark post news according to the feature vector of news;
Optionally, it is assumed that the feature vector of other news is Si, and the feature vector of mark post news is Sp, other news and mark
The distance between bar news is shown below:
D=1- | Si∩Sp|/|Si∪Sp| (1)
That is, the feature vector Sp of the feature vector Si of other news and mark post news intersection and other news feature to
Measure the ratio of the union of the feature vector Sp of Si and mark post news and 1 difference.
For example, the feature vector Sp of mark post news is (drive elder sister's model and must so wear in the big workplace of Ma Yili new film scale),
The feature vector S1 of one of other news is that the big collection of (Ma Yi Li new film scale is affectionate for several times), the feature vector Sp of mark post news
Intersection between the feature vector S1 of one of other news is 4, union 17, therefore it is 0.76 that distance, which can be calculated,.
The feature vector S2 of the two of other news is that the newest new film stage photo of (Ma Yi Li is classy), the feature of mark post news to
Measuring the intersection between the feature vector S2 of one of Sp and other news is 3, union 16, therefore distance can be calculated and be
0.81。
It can be seen that the distance of feature vector is bigger, correlation is smaller, and apart from smaller, correlation is bigger.Art technology
Personnel are it was determined that formula (1) is only a kind of example for determining feature vector distance, the feature vector and mark post of other news
The intersection of the feature vector of news or other functions composition can also be with the distance between characteristic feature vectors.
Optionally, distance can be by the inner product or angle of the feature vector of the feature vector and mark post news of other news
Cosine determines.
Optionally, distance can be by the feature vector of the minimum hash and mark post news of the feature vector of other news
Minimum hash determines.
In min-hash algorithm, it is assumed that vector A=(a1,a2...ai...aN) it is a N-dimensional vector, in vector
Each element ai, H (ai) it is by aiIt is mapped to the hash function of an integer, hminIt (A) is element in vector A by Hash
Resulting minimum hash after function processing.For vector A and B, hmin(A)=hmin(B) condition set up is that have in A ∪ B
The element of minimum hash is also in A ∩ B.It is a good hash function that the premise that above formula is set up, which is this H, is had good
Different elements can be mapped to different integers by uniformity.
So that Pr(hmin(A)=hmin(B))=J (A, B).Wherein Pr indicates probability.I.e. vector A minimum hash with
The equal probability of vector B minimum hash is equal to vector A, the coefficient of community of B.It therefore, can minimum hash is identical or phase
Close vector gathers for one kind.
Assuming that there are vector A and B, the coefficient of community J of the two vectors is defined as:
J (A, B)=| A ∩ B |/| A ∪ B |
News generally comprises title, text from structure, and abstract etc. calculates the distance between news, can be to news
Text calculates the distance between vector after extracting structure vector, be also possible to extract headline or news in brief structure to
The distance between amount, then calculate vector.
Step 130, it when the threshold value of the distance between other described news and the mark post news no more than setting, determines
Other described news are the association news of the news category.
The setting of threshold value can set and adjust according to actual needs.
Referring to Fig. 3, a kind of determining device of the association news provided it illustrates the specific embodiment of the invention, packet
Include: selecting device 210 and is associated with news determining device 230 apart from computing device 220.
Selecting device 210, for choosing mark post news of the news as a certain news category.
In a certain embodiment of the invention, the selection of mark post news can be ranked up clustered news, can
To be ranked up according to a certain item in news clicking rate, news reprint rate and news comment quantity to the news clustered,
To the news after sequence, the news for coming first is chosen as mark post news.
In another embodiment of the invention, the selection of mark post news can be ranked up clustered news, can
The news clustered is ranked up with the Multiple factors in news clicking rate, news reprint rate and news comment quantity, it is right
News after sequence chooses the news for coming first as mark post news.
In another embodiment of the invention, in the following factor of the peg of news for having determined as news category at least
One is ranked up: clicking rate, news reprint rate and number of reviews, will come first news as mark post news.It is optional
, the news after sequence is further screened, is more than the news of certain time threshold value to issuing time, is not selected as
Mark post news.
In another embodiment of the invention, M clustered news are randomly selected, by calculating between M news
Distance, select with M-1 news sum of the distance of other news be shortest news as mark post news.Due to M news pair
It is therefore a lesser value can't produce the computational efficiency of methods and apparatus of the present invention for entire news category
Raw apparent influence.Optionally, it is further screened to according to the news after the ascending sequence of sum of the distance, when to publication
Between be more than certain time threshold value news, be not selected as mark post news.
Apart from computing device 220, for calculating the distance between other news and the mark post news.
Specifically, include optionally consisting of part apart from computing device 220, please refer to Fig. 4:
Device 2201 is segmented, for carrying out word segmentation processing to news.
Present embodiment can carry out word segmentation processing first, obtain a words.Words after word segmentation processing
Include such as keywords such as " Ma Yili ", " new film ", " scale ", further comprises garbage.
Garbage processing unit 2202 carries out garbage to the words after word segmentation processing and handles.
It can be divided into punctuation mark by garbage, with structural auxiliary word function word etc. in Chinese meaningless vocabulary.?
In the specific embodiment of the invention, after word segmentation processing, can further include the words after word segmentation processing is carried out it is useless
Information processing.
Feature vector constituent apparatus 2203, for selecting representative words to constitute the feature vector of news.
It optionally, can be using the words for going garbage to obtain after handling as the feature vector of news.Or it extracts and goes
Representative words constitutes the feature vector of news in the words obtained after garbage processing.
For example, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage
Arrange S=(s1,s2,s3......,sN), wherein the expressions such as s1, s2, s3 by participle and go garbage treated words.
It is possible that identical words in words sequence S, therefore related word frequency can be carried out to the words in words sequence
Statistics, is further arranged according to the sequence of words frequency of occurrence from high to low, takes out the word of preset quantity from front to back
Accord with the feature vector as the newsletter archive.
Apart from determining device 2204, other news are calculated at a distance from mark post news for the feature vector according to news;
Optionally, it is assumed that the feature vector of other news is Si, and the feature vector of mark post news is Sp, other news and mark
The distance between bar news is shown below:
D=1- | Si∩Sp|/|Si∪Sp| (1)
That is, the feature vector Sp of the feature vector Si of other news and mark post news intersection and other news feature to
Measure the ratio of the union of the feature vector Sp of Si and mark post news and 1 difference.
For example, the feature vector Sp of mark post news is (drive elder sister's model and must so wear in the big workplace of Ma Yili new film scale),
The feature vector S1 of one of other news is that the big collection of (Ma Yi Li new film scale is affectionate for several times), the feature vector Sp of mark post news
Intersection between the feature vector S1 of one of other news is 4, union 17, therefore it is 0.76 that distance, which can be calculated,.
The feature vector S2 of the two of other news is that the newest new film stage photo of (Ma Yi Li is classy), the feature of mark post news to
Measuring the intersection between the feature vector S2 of one of Sp and other news is 3, union 16, therefore distance can be calculated and be
0.81。
It can be seen that the distance of feature vector is bigger, correlation is smaller, and apart from smaller, correlation is bigger.Art technology
Personnel are it was determined that formula (1) is only a kind of example for determining feature vector distance, the feature vector and mark post of other news
The intersection of the feature vector of news or other functions composition can also be with the distance between characteristic feature vectors.
Optionally, distance can be by the inner product or angle of the feature vector of the feature vector and mark post news of other news
Cosine determines.
Optionally, distance can be by the feature vector of the minimum hash and mark post news of the feature vector of other news
Minimum hash determines.
In min-hash algorithm, it is assumed that vector A=(a1,a2...ai...aN) it is a N-dimensional vector, in vector
Each element ai, H (ai) it is by aiIt is mapped to the hash function of an integer, hminIt (A) is element in vector A by Hash
Resulting minimum hash after function processing.For vector A and B, hmin(A)=hmin(B) condition set up is that have in A ∪ B
The element of minimum hash is also in A ∩ B.It is a good hash function that the premise that above formula is set up, which is this H, is had good
Different elements can be mapped to different integers by uniformity.
So that Pr(hmin(A)=hmin(B))=J (A, B).Wherein Pr indicates probability.I.e. vector A minimum hash with
The equal probability of vector B minimum hash is equal to vector A, the coefficient of community of B.It therefore, can minimum hash is identical or phase
Close vector gathers for one kind.
Assuming that there are vector A and B, the coefficient of community J of the two vectors is defined as:
J (A, B)=| A ∩ B |/| A ∪ B |
News generally comprises title, text from structure, and abstract etc. calculates the distance between news, can be to news
Text calculates the distance between vector after extracting structure vector, be also possible to extract headline or news in brief structure to
The distance between amount, then calculate vector.
It is associated with news determining device 230, for being not more than when described the distance between other news and the mark post news
When the threshold value of setting, determine that other described news are the association news of the news category.
The setting of threshold value can set and adjust according to actual needs.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors
Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice
Microprocessor or digital signal processor (DSP) realize the determining device of the association news according to an embodiment of the present invention
The some or all functions of some or all components.The present invention is also implemented as executing method as described herein
Some or all device or device programs (for example, computer program and computer program product).Such reality
Existing program of the invention can store on a computer-readable medium, or may be in the form of one or more signals.
Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or in any other forms
It provides.
" one embodiment ", " embodiment " or " one or more embodiment " referred to herein it is meant that in conjunction with
Special characteristic, structure or the characteristic of embodiment description are included at least one embodiment of the present invention.Further, it is noted that
Here word example " in one embodiment " is not necessarily all referring to the same embodiment.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail
Structure and technology, so as not to obscure the understanding of this specification.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability
Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch
To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame
Claim.
Furthermore, it should also be noted that language used in this specification primarily to readable and introduction purpose and select
Select, rather than in order to explain or defining the subject matter of the present invention and select.Therefore, without departing from the appended claims
In the case where scope and spirit, many modifications and changes are all apparent for those skilled in the art
's.For the scope of the present invention, the disclosure that the present invention is done be it is illustrative and not restrictive, the scope of the present invention by
The appended claims limit.