CN105574184B - A kind of determination method and device being associated with news - Google Patents

A kind of determination method and device being associated with news Download PDF

Info

Publication number
CN105574184B
CN105574184B CN201510974718.5A CN201510974718A CN105574184B CN 105574184 B CN105574184 B CN 105574184B CN 201510974718 A CN201510974718 A CN 201510974718A CN 105574184 B CN105574184 B CN 105574184B
Authority
CN
China
Prior art keywords
news
feature vector
mark post
distance
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510974718.5A
Other languages
Chinese (zh)
Other versions
CN105574184A (en
Inventor
张伸正
魏少俊
陈培军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510974718.5A priority Critical patent/CN105574184B/en
Publication of CN105574184A publication Critical patent/CN105574184A/en
Application granted granted Critical
Publication of CN105574184B publication Critical patent/CN105574184B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of determination methods and device for being associated with news, wherein the described method includes: choosing mark post news of the news as a certain news category;Calculate the distance between other news and the mark post news;When the threshold value of the distance between other described news and the mark post news no more than setting, determine that other described news are the association news of the news category.The determination method and device of association news of the invention can effectively reduce the calculation amount of relatedness computation in Press release cluster process, can be improved speed and efficiency that association news determines.

Description

A kind of determination method and device being associated with news
Technical field
The present invention relates to the method and systems that Internet technical field more particularly to a kind of related information determine.
Background technique
With Internet technology continuous development and become increasingly popular, the information content that news user is faced is with surprising Speed increases, and the demand to can easily obtain oneself interested news information is more more and more urgent.
Since news information amount rapidly increases, news category is more refined, and has very strong real-time, is often updated fast Speed, timeliness is extremely short, therefore is effectively classified to news, is to be supplied to different user or be supplied to different applications It is highly important.
In the prior art, there are the methods that a kind of pair of news is classified, the specially correlation between calculating Press release Degree, so that it is determined that the news cluster with certain degree of correlation.
Although this method in the prior art can cluster the news of certain degree of correlation, between Press release The degree of correlation be required to calculate after can just obtain cluster result, computationally intensive, computational efficiency is not high enough, it is difficult to quickly determine close Join news.
Summary of the invention
In view of the above problems, propose the present invention in order to provide one kind overcome the above problem or at least be partially solved or The method and apparatus that person slows down a kind of determination of association news of the above problem.
According to an aspect of the invention, there is provided a kind of determination method for being associated with news, comprising: choose a news conduct The mark post news of a certain news category;Calculate the distance between other news and mark post news;When other news and mark post news The distance between no more than setting threshold value when, determine the association news of other news news category thus.
Optionally, in the determination method of the association news of embodiment according to the present invention, news includes: headline, News in brief or full press release.
Optionally, in the determination method of the association news of embodiment according to the present invention, distance by other news spy The intersection for levying the feature vector of vector and mark post news determines.
Optionally, in the determination method of the association news of embodiment according to the present invention, distance by other news spy The inner product or folder cosine of an angle for levying the feature vector of vector and mark post news determine.
Optionally, in the determination method of the association news of embodiment according to the present invention, distance by other news spy The minimum hash for levying the minimum hash of vector and the feature vector of mark post news determines.
Optionally, in the determination method of the association news of embodiment according to the present invention, the composition of feature vector is specific For words sequence will be formed after news progress word segmentation processing, according to the sequence weight of the words frequency of occurrences in words sequence from high to low New arrangement words sequence, takes out feature vector of the words of preset quantity as news from front to back.
Optionally, in the determination method of the association news of embodiment according to the present invention, news is subjected to word segmentation processing Further progress goes garbage processing to re-form the words sequence before sequence afterwards.
Optionally, described new to having determined as in the determination method of the association news of embodiment according to the present invention At least one of the following factor of peg of news for hearing classification is ranked up: clicking rate, news reprint rate and number of reviews, First news will be come as the mark post news.
According to another aspect of the present invention, a kind of determining device for being associated with news, including selecting device are provided, is used for Choose mark post news of the news as a certain news category;Apart from computing device, for calculating other news and mark post news The distance between;It is associated with news determining device, for when the threshold of the distance between other news and mark post news no more than setting When value, determine that other news are the association news of news category.
Optionally, in the determining device of the association news of embodiment according to the present invention, news includes: headline, News in brief or full press release.
Optionally, in the determining device of the association news of embodiment according to the present invention, distance by other news spy The intersection for levying the feature vector of vector and mark post news determines.
Optionally, in the determining device of the association news of embodiment according to the present invention, distance by other news spy The inner product or folder cosine of an angle for levying the feature vector of vector and mark post news determine.
Optionally, in the determining device of the association news of embodiment according to the present invention, distance by other news spy The minimum hash for levying the minimum hash of vector and the feature vector of mark post news determines.
Optionally, in the determining device of the association news of embodiment according to the present invention, further include apart from computing device Feature vector constituent apparatus occurs for forming words sequence after news is carried out word segmentation processing according to words in words sequence The sequence of frequency from high to low rearranges words sequence, takes out spy of the words of preset quantity as the news from front to back Levy vector.
Optionally, in the determining device of the association news of embodiment according to the present invention, further include apart from computing device Garbage processing unit re-forms the word before sequence for the words sequence after word segmentation processing to be carried out garbage processing Word sequence.
Optionally, in the determining device of the association news of embodiment according to the present invention, selecting device, for At least one of the following factor of peg of news for being determined as the news category is ranked up: clicking rate, news reprint rate, with And number of reviews, first news will be come as the mark post news.
The invention has the benefit that the determination method and device of association news of the invention can effectively reduce newly The calculation amount for hearing relatedness computation in contribution cluster process can be improved speed and efficiency that association news determines.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 diagrammatically illustrates the flow chart of the determination method of association news according to an embodiment of the invention;
Fig. 2 diagrammatically illustrates the flow chart for the method that distance calculates between news according to an embodiment of the invention;
Fig. 3 diagrammatically illustrates the block diagram of the determining device of association news according to an embodiment of the invention;
The distance that Fig. 4 is diagrammatically illustrated in the determining device of association news according to an embodiment of the invention calculates dress The block diagram set.
Specific embodiment
The invention will be further described with specific embodiment with reference to the accompanying drawing.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It should be further understood that wording used in specification of the invention " comprising " refers to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition one Other a or multiple features, integer, step, operation, element, component and/or their group.It should be understood that when we claim element It is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be in Between element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.Wording used herein "and/or" includes one or more associated wholes for listing item or any cell and all combinations.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific term), there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Should also Understand, those terms such as defined in the general dictionary, it should be understood that have in the context of the prior art The consistent meaning of meaning, and unless idealization or meaning too formal otherwise will not be used by specific definitions as here To explain.
Cluster refers to that the set by physics or abstract object is divided into the multiple classes being made of similar object in the present invention Process.By clustering the set that cluster generated is one group of data object, these objects and the object in the same cluster are similar to each other, It is different with the object in other clusters.
Referring to Fig. 1, a kind of determination method of the association news provided it illustrates the specific embodiment of the invention, packet It includes: step 110, choosing mark post news of the news as a certain news category;Step 120, other news and the mark are calculated The distance between bar news;Step 130, when the threshold of the distance between other described news and the mark post news no more than setting When value, determine that other described news are the association news of the news category.
Wherein, step 110, mark post news of the news as a certain news category is chosen.
In a certain embodiment of the invention, the selection of mark post news can be ranked up clustered news, can To be ranked up according to a certain item in news clicking rate, news reprint rate and news comment quantity to the news clustered, To the news after sequence, the news for coming first is chosen as mark post news.
In another embodiment of the invention, the selection of mark post news can be ranked up clustered news, can The news clustered is ranked up with the Multiple factors in news clicking rate, news reprint rate and news comment quantity, it is right News after sequence chooses the news for coming first as mark post news.
In another embodiment of the invention, in the following factor of the peg of news for having determined as news category at least One is ranked up: clicking rate, news reprint rate and number of reviews, will come first news as mark post news.It is optional , the news after sequence is further screened, is more than the news of certain time threshold value to issuing time, is not selected as Mark post news.
In another embodiment of the invention, M clustered news are randomly selected, by calculating between M news Distance, select with M-1 news sum of the distance of other news be shortest news as mark post news.Due to M news pair It is therefore a lesser value can't produce the computational efficiency of methods and apparatus of the present invention for entire news category Raw apparent influence.Optionally, it is further screened to according to the news after the ascending sequence of sum of the distance, when to publication Between be more than certain time threshold value news, be not selected as mark post news.
Step 120, the distance between other news and the mark post news are calculated;
Specifically, step 120 optionally includes the following steps, please refers to Fig. 2:
Step 1201, word segmentation processing is carried out to news;
Present embodiment can carry out word segmentation processing first, obtain a words.Words after word segmentation processing Include such as keywords such as " Ma Yili ", " new film ", " scale ", further comprises garbage.
Step 1202, garbage is carried out to the words after word segmentation processing to handle;
It can be divided into punctuation mark by garbage, with structural auxiliary word function word etc. in Chinese meaningless vocabulary.? In the specific embodiment of the invention, after word segmentation processing, can further include the words after word segmentation processing is carried out it is useless Information processing.
Step 1203, representative words is selected to constitute the feature vector of news;
It optionally, can be using the words for going garbage to obtain after handling as the feature vector of news.Or it extracts and goes Representative words constitutes the feature vector of news in the words obtained after garbage processing.
For example, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage Arrange S=(s1,s2,s3......,sN), wherein the expressions such as s1, s2, s3 by participle and go garbage treated words.
It is possible that identical words in words sequence S, therefore related word frequency can be carried out to the words in words sequence Statistics, is further arranged according to the sequence of words frequency of occurrence from high to low, takes out the word of preset quantity from front to back Accord with the feature vector as the newsletter archive.
Step 1204, other news are calculated at a distance from mark post news according to the feature vector of news;
Optionally, it is assumed that the feature vector of other news is Si, and the feature vector of mark post news is Sp, other news and mark The distance between bar news is shown below:
D=1- | Si∩Sp|/|Si∪Sp| (1)
That is, the feature vector Sp of the feature vector Si of other news and mark post news intersection and other news feature to Measure the ratio of the union of the feature vector Sp of Si and mark post news and 1 difference.
For example, the feature vector Sp of mark post news is (drive elder sister's model and must so wear in the big workplace of Ma Yili new film scale), The feature vector S1 of one of other news is that the big collection of (Ma Yi Li new film scale is affectionate for several times), the feature vector Sp of mark post news Intersection between the feature vector S1 of one of other news is 4, union 17, therefore it is 0.76 that distance, which can be calculated,.
The feature vector S2 of the two of other news is that the newest new film stage photo of (Ma Yi Li is classy), the feature of mark post news to Measuring the intersection between the feature vector S2 of one of Sp and other news is 3, union 16, therefore distance can be calculated and be 0.81。
It can be seen that the distance of feature vector is bigger, correlation is smaller, and apart from smaller, correlation is bigger.Art technology Personnel are it was determined that formula (1) is only a kind of example for determining feature vector distance, the feature vector and mark post of other news The intersection of the feature vector of news or other functions composition can also be with the distance between characteristic feature vectors.
Optionally, distance can be by the inner product or angle of the feature vector of the feature vector and mark post news of other news Cosine determines.
Optionally, distance can be by the feature vector of the minimum hash and mark post news of the feature vector of other news Minimum hash determines.
In min-hash algorithm, it is assumed that vector A=(a1,a2...ai...aN) it is a N-dimensional vector, in vector Each element ai, H (ai) it is by aiIt is mapped to the hash function of an integer, hminIt (A) is element in vector A by Hash Resulting minimum hash after function processing.For vector A and B, hmin(A)=hmin(B) condition set up is that have in A ∪ B The element of minimum hash is also in A ∩ B.It is a good hash function that the premise that above formula is set up, which is this H, is had good Different elements can be mapped to different integers by uniformity.
So that Pr(hmin(A)=hmin(B))=J (A, B).Wherein Pr indicates probability.I.e. vector A minimum hash with The equal probability of vector B minimum hash is equal to vector A, the coefficient of community of B.It therefore, can minimum hash is identical or phase Close vector gathers for one kind.
Assuming that there are vector A and B, the coefficient of community J of the two vectors is defined as:
J (A, B)=| A ∩ B |/| A ∪ B |
News generally comprises title, text from structure, and abstract etc. calculates the distance between news, can be to news Text calculates the distance between vector after extracting structure vector, be also possible to extract headline or news in brief structure to The distance between amount, then calculate vector.
Step 130, it when the threshold value of the distance between other described news and the mark post news no more than setting, determines Other described news are the association news of the news category.
The setting of threshold value can set and adjust according to actual needs.
Referring to Fig. 3, a kind of determining device of the association news provided it illustrates the specific embodiment of the invention, packet Include: selecting device 210 and is associated with news determining device 230 apart from computing device 220.
Selecting device 210, for choosing mark post news of the news as a certain news category.
In a certain embodiment of the invention, the selection of mark post news can be ranked up clustered news, can To be ranked up according to a certain item in news clicking rate, news reprint rate and news comment quantity to the news clustered, To the news after sequence, the news for coming first is chosen as mark post news.
In another embodiment of the invention, the selection of mark post news can be ranked up clustered news, can The news clustered is ranked up with the Multiple factors in news clicking rate, news reprint rate and news comment quantity, it is right News after sequence chooses the news for coming first as mark post news.
In another embodiment of the invention, in the following factor of the peg of news for having determined as news category at least One is ranked up: clicking rate, news reprint rate and number of reviews, will come first news as mark post news.It is optional , the news after sequence is further screened, is more than the news of certain time threshold value to issuing time, is not selected as Mark post news.
In another embodiment of the invention, M clustered news are randomly selected, by calculating between M news Distance, select with M-1 news sum of the distance of other news be shortest news as mark post news.Due to M news pair It is therefore a lesser value can't produce the computational efficiency of methods and apparatus of the present invention for entire news category Raw apparent influence.Optionally, it is further screened to according to the news after the ascending sequence of sum of the distance, when to publication Between be more than certain time threshold value news, be not selected as mark post news.
Apart from computing device 220, for calculating the distance between other news and the mark post news.
Specifically, include optionally consisting of part apart from computing device 220, please refer to Fig. 4:
Device 2201 is segmented, for carrying out word segmentation processing to news.
Present embodiment can carry out word segmentation processing first, obtain a words.Words after word segmentation processing Include such as keywords such as " Ma Yili ", " new film ", " scale ", further comprises garbage.
Garbage processing unit 2202 carries out garbage to the words after word segmentation processing and handles.
It can be divided into punctuation mark by garbage, with structural auxiliary word function word etc. in Chinese meaningless vocabulary.? In the specific embodiment of the invention, after word segmentation processing, can further include the words after word segmentation processing is carried out it is useless Information processing.
Feature vector constituent apparatus 2203, for selecting representative words to constitute the feature vector of news.
It optionally, can be using the words for going garbage to obtain after handling as the feature vector of news.Or it extracts and goes Representative words constitutes the feature vector of news in the words obtained after garbage processing.
For example, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage Arrange S=(s1,s2,s3......,sN), wherein the expressions such as s1, s2, s3 by participle and go garbage treated words.
It is possible that identical words in words sequence S, therefore related word frequency can be carried out to the words in words sequence Statistics, is further arranged according to the sequence of words frequency of occurrence from high to low, takes out the word of preset quantity from front to back Accord with the feature vector as the newsletter archive.
Apart from determining device 2204, other news are calculated at a distance from mark post news for the feature vector according to news;
Optionally, it is assumed that the feature vector of other news is Si, and the feature vector of mark post news is Sp, other news and mark The distance between bar news is shown below:
D=1- | Si∩Sp|/|Si∪Sp| (1)
That is, the feature vector Sp of the feature vector Si of other news and mark post news intersection and other news feature to Measure the ratio of the union of the feature vector Sp of Si and mark post news and 1 difference.
For example, the feature vector Sp of mark post news is (drive elder sister's model and must so wear in the big workplace of Ma Yili new film scale), The feature vector S1 of one of other news is that the big collection of (Ma Yi Li new film scale is affectionate for several times), the feature vector Sp of mark post news Intersection between the feature vector S1 of one of other news is 4, union 17, therefore it is 0.76 that distance, which can be calculated,.
The feature vector S2 of the two of other news is that the newest new film stage photo of (Ma Yi Li is classy), the feature of mark post news to Measuring the intersection between the feature vector S2 of one of Sp and other news is 3, union 16, therefore distance can be calculated and be 0.81。
It can be seen that the distance of feature vector is bigger, correlation is smaller, and apart from smaller, correlation is bigger.Art technology Personnel are it was determined that formula (1) is only a kind of example for determining feature vector distance, the feature vector and mark post of other news The intersection of the feature vector of news or other functions composition can also be with the distance between characteristic feature vectors.
Optionally, distance can be by the inner product or angle of the feature vector of the feature vector and mark post news of other news Cosine determines.
Optionally, distance can be by the feature vector of the minimum hash and mark post news of the feature vector of other news Minimum hash determines.
In min-hash algorithm, it is assumed that vector A=(a1,a2...ai...aN) it is a N-dimensional vector, in vector Each element ai, H (ai) it is by aiIt is mapped to the hash function of an integer, hminIt (A) is element in vector A by Hash Resulting minimum hash after function processing.For vector A and B, hmin(A)=hmin(B) condition set up is that have in A ∪ B The element of minimum hash is also in A ∩ B.It is a good hash function that the premise that above formula is set up, which is this H, is had good Different elements can be mapped to different integers by uniformity.
So that Pr(hmin(A)=hmin(B))=J (A, B).Wherein Pr indicates probability.I.e. vector A minimum hash with The equal probability of vector B minimum hash is equal to vector A, the coefficient of community of B.It therefore, can minimum hash is identical or phase Close vector gathers for one kind.
Assuming that there are vector A and B, the coefficient of community J of the two vectors is defined as:
J (A, B)=| A ∩ B |/| A ∪ B |
News generally comprises title, text from structure, and abstract etc. calculates the distance between news, can be to news Text calculates the distance between vector after extracting structure vector, be also possible to extract headline or news in brief structure to The distance between amount, then calculate vector.
It is associated with news determining device 230, for being not more than when described the distance between other news and the mark post news When the threshold value of setting, determine that other described news are the association news of the news category.
The setting of threshold value can set and adjust according to actual needs.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) realize the determining device of the association news according to an embodiment of the present invention The some or all functions of some or all components.The present invention is also implemented as executing method as described herein Some or all device or device programs (for example, computer program and computer program product).Such reality Existing program of the invention can store on a computer-readable medium, or may be in the form of one or more signals. Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or in any other forms It provides.
" one embodiment ", " embodiment " or " one or more embodiment " referred to herein it is meant that in conjunction with Special characteristic, structure or the characteristic of embodiment description are included at least one embodiment of the present invention.Further, it is noted that Here word example " in one embodiment " is not necessarily all referring to the same embodiment.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail Structure and technology, so as not to obscure the understanding of this specification.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.
Furthermore, it should also be noted that language used in this specification primarily to readable and introduction purpose and select Select, rather than in order to explain or defining the subject matter of the present invention and select.Therefore, without departing from the appended claims In the case where scope and spirit, many modifications and changes are all apparent for those skilled in the art 's.For the scope of the present invention, the disclosure that the present invention is done be it is illustrative and not restrictive, the scope of the present invention by The appended claims limit.

Claims (8)

1. a kind of determination method for being associated with news, comprising:
Choose mark post news of the news as a certain news category;
Calculate the distance between other news and the mark post news;
When the threshold value of the distance between other described news and the mark post news no more than setting, other described news are determined For the association news of the news category;
At least one of following factor of the peg of news for having determined as the news category is ranked up: clicking rate, new Reprinting rate and number of reviews are heard, first news will be come as the mark post news;Or it randomly selects M and has gathered The news of class selects with M-1 news sum of the distance of other news to be shortest new by calculating the distance between M news It hears and is used as the mark post news;And it is more than the news of certain time threshold value to issuing time, is not selected as the mark post News;
Described calculating the distance between other news and the mark post news include:
Step 1201, word segmentation processing is carried out to news;
Step 1202, garbage is carried out to the words after word segmentation processing to handle;
Step 1203, representative words is selected to constitute the feature vector of news;
Specifically, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage S=(s1,s2,s3......,sN), wherein s1、s2、s3To sNIt indicates by participle and goes garbage treated words;
Related word frequency statistics are carried out to words identical in words sequence S, further from high to low according to words frequency of occurrence Sequence arranged, take out feature vector of the character as the newsletter archive of preset quantity from front to back;
Step 1204, other news are calculated at a distance from mark post news according to the feature vector of news;
The feature vector of other news is Si, and the feature vector of mark post news is Sp, between other news and mark post news away from From being shown below: D=1- | Si∩Sp|/|Si∪Sp|;
The feature vector Si of the intersection of the feature vector Sp of the feature vector Si and mark post news of other news and other news with The ratio of the union of the feature vector Sp of mark post news and 1 difference.
2. the determination method of association news according to claim 1, which is characterized in that the news includes: headline, News in brief or full press release.
3. it is according to claim 1 or 2 association news determination method, which is characterized in that the distance by it is described other The inner product or folder cosine of an angle of the feature vector of the feature vector of news and the mark post news determine.
4. it is according to claim 1 or 2 association news determination method, which is characterized in that the distance by it is described other The minimum hash of the feature vector of the minimum hash of the feature vector of news and the mark post news determines.
5. a kind of determining device for being associated with news, comprising:
Selecting device, for choosing mark post news of the news as a certain news category;
Apart from computing device, for calculating the distance between other news and the mark post news;
It is associated with news determining device, for when the threshold of described the distance between other news and the mark post news no more than setting When value, determine that other described news are the association news of the news category;
The selecting device, for at least one of following factor of the peg of news for having determined as the news category into Row sequence: clicking rate, news reprint rate and number of reviews will come first news as the mark post news;Or with Machine chooses M clustered news, by calculating the distance between M news, selection and M-1 news distance of other news The sum of for shortest news as the mark post news;And it is more than the news of certain time threshold value to issuing time, does not select It is as the mark post news;
Include consisting of part apart from computing device:
Device is segmented, for carrying out word segmentation processing to news;
Garbage processing unit carries out garbage to the words after word segmentation processing and handles;
Feature vector constituent apparatus, for selecting representative words to constitute the feature vector of news;
Specifically, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage S=(s1,s2,s3......,sN), wherein s1、s2、s3To sNIt indicates by participle and goes garbage treated words;
Related word frequency statistics are carried out to words identical in words sequence S, further from high to low according to words frequency of occurrence Sequence arranged, take out feature vector of the character as the newsletter archive of preset quantity from front to back;
Apart from determining device, other news are calculated at a distance from mark post news for the feature vector according to news;
Assuming that the feature vector of other news is Si, the feature vector of mark post news is Sp, between other news and mark post news Distance be shown below: D=1- | Si∩Sp|/|Si∪Sp|;
The feature vector Si of the intersection of the feature vector Sp of the feature vector Si and mark post news of other news and other news with The ratio of the union of the feature vector Sp of mark post news and 1 difference.
6. the determining device of association news according to claim 5, which is characterized in that the news includes: headline, News in brief or full press release.
7. it is according to claim 5 or 6 association news determining device, which is characterized in that the distance by it is described other The inner product or folder cosine of an angle of the feature vector of the feature vector of news and the mark post news determine.
8. it is according to claim 5 or 6 association news determining device, which is characterized in that the distance by it is described other The minimum hash of the feature vector of the minimum hash of the feature vector of news and the mark post news determines.
CN201510974718.5A 2015-12-22 2015-12-22 A kind of determination method and device being associated with news Expired - Fee Related CN105574184B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510974718.5A CN105574184B (en) 2015-12-22 2015-12-22 A kind of determination method and device being associated with news

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510974718.5A CN105574184B (en) 2015-12-22 2015-12-22 A kind of determination method and device being associated with news

Publications (2)

Publication Number Publication Date
CN105574184A CN105574184A (en) 2016-05-11
CN105574184B true CN105574184B (en) 2019-09-24

Family

ID=55884315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510974718.5A Expired - Fee Related CN105574184B (en) 2015-12-22 2015-12-22 A kind of determination method and device being associated with news

Country Status (1)

Country Link
CN (1) CN105574184B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096017B (en) * 2016-06-24 2021-01-29 百度在线网络技术(北京)有限公司 Search result providing method and device
CN113221535B (en) * 2021-05-31 2023-03-28 南方电网数字电网研究院有限公司 Information processing method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN103164427A (en) * 2011-12-13 2013-06-19 中国移动通信集团公司 Method and device of news aggregation
CN103336847A (en) * 2013-07-22 2013-10-02 厦门市美亚柏科信息股份有限公司 Generation method and system for hot news tag
AU2013206450A1 (en) * 2012-06-20 2014-01-16 Fujitsu Australia Limited Systems and methods for providing access to electronic documents
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164427A (en) * 2011-12-13 2013-06-19 中国移动通信集团公司 Method and device of news aggregation
AU2013206450A1 (en) * 2012-06-20 2014-01-16 Fujitsu Australia Limited Systems and methods for providing access to electronic documents
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN103336847A (en) * 2013-07-22 2013-10-02 厦门市美亚柏科信息股份有限公司 Generation method and system for hot news tag
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Online hot topic detection from web news archive in short terms;RuiGuo Yu 等;《2014 11th International Conference on Fuzzy Systems and Knowledge Discovery》;20141211;第919-923页 *
分布式平台下MinHash算法研究与实现;王洪亚 等;《智能计算机与应用》;20141231;第4卷(第6期);第44-46页 *

Also Published As

Publication number Publication date
CN105574184A (en) 2016-05-11

Similar Documents

Publication Publication Date Title
CN108009228B (en) Method and device for setting content label and storage medium
TWI582619B (en) Method and apparatus for providing referral words
CN105630767B (en) The comparative approach and device of a kind of text similarity
Rubinstein et al. A comparative study of image retargeting
TWI496015B (en) Text matching method and device
JP6494804B2 (en) Personalized search device and method based on product image features
CN103106262B (en) The method and apparatus that document classification, supporting vector machine model generate
CN105488023B (en) A kind of text similarity appraisal procedure and device
JP2019519019A5 (en)
CN107679119A (en) The method and apparatus for generating brand derivative words
WO2021120685A1 (en) Video generation method and apparatus, and computer system
CN107797982A (en) For identifying the method, apparatus and equipment of text type
CN104462553A (en) Method and device for recommending question and answer page related questions
CN109508373A (en) Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index
JP6419969B2 (en) Method and apparatus for providing image presentation information
CN110457524B (en) Model generation method, video classification method and device
CN110516033A (en) A kind of method and apparatus calculating user preference
CN107423430A (en) Data processing method, device and computer-readable recording medium
CN112131322A (en) Time series classification method and device
CN105574184B (en) A kind of determination method and device being associated with news
CN106126495B (en) One kind being based on large-scale corpus prompter method and apparatus
CN111160410A (en) Object detection method and device
EP3304342A1 (en) Comment-centered news reader
CN105653598B (en) A kind of determination method and device being associated with news
WO2015074493A1 (en) Method and apparatus for filtering out low-frequency click, computer program, and computer readable medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220726

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: Room 112, block D, No. 28, Xinjiekou outer street, Xicheng District, Beijing 100088 (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190924