CN104239285A

CN104239285A - New article chapter detecting method and device

Info

Publication number: CN104239285A
Application number: CN201310223253.0A
Authority: CN
Inventors: 蔡兵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2013-06-06
Filing date: 2013-06-06
Publication date: 2014-12-24
Also published as: CN110347931A

Abstract

The invention discloses a new article chapter detecting method and device and belongs to the technical field of internet. The method includes that a first subject term vector quantity of a detected chapter of an article is determined; the first subject term vector quantity is used for identifying content of the detected chapter of the article; a second subject term vector quantity of a new chapter of the article is determined; the second subject term vector quantity is used for identifying content of the new chapter of the article; the similarity of the first subject term vector quantity and the second subject term vector quantity is calculated; according to a large-small relation between the similarity and a preset similarity threshold value, whether the new chapter is a false chapter of the article or not is determined. By means of the technical scheme, the on-line identifying process merely requires millisecond grade, the chapter pushing speed is not affected at all, thereby, when the new chapter is an effective chapter, the new chapter can be timely pushed, and the pushing efficiency of the new chapter of the article can be effectively guaranteed.

Description

The detection method of the new chapters and sections of article and device

Technical field

The present invention relates to Internet technical field, particularly the detection method of the new chapters and sections of a kind of article and device.

Background technology

Along with the development of Internet science and technology, more tired more people can carry out various activity by internet, and such as people can read some articles published in instalments etc. by internet.

In prior art, along with the day by day fiery of web documents has also expedited the emergence of the birth of increasing article website, according to incompletely statistics, various middle-size and small-size article Websites quantity has reached hundreds thousand of, its quality is very different, frequent existence some steal content and even manufacture false new chapters and sections and click to gain article user by cheating, the behavior that harm users is experienced.As polymerizable clc platform, after the new chapters and sections data of article capturing these websites, manual examination and verification are carried out to the new chapters and sections of article, the new chapters and sections of falseness are identified and filter out in time, to provide the article of better quality to user.The program is improve the important step of polymerizable clc platform mass, optimizing user reading experience.

Realizing in process of the present invention, inventor finds that prior art at least exists following problem: the mode of above-mentioned existing employing manual examination and verification audits the method for the new chapters and sections of article, and audit time is longer, causes the new chapters and sections of article not pushed in time.

Summary of the invention

In order to solve the problem of prior art, embodiments provide detection method and the device of the new chapters and sections of a kind of article.Described technical scheme is as follows:

On the one hand, provide the detection method of the new chapters and sections of a kind of article, described method comprises:

Determine the first descriptor vector detecting chapters and sections of article; Described first descriptor vector is for identifying the content detecting chapters and sections of described article;

Determine the second theme term vector of the new chapters and sections of described article; Described second theme term vector is for identifying the content of the new chapters and sections of described article;

Calculate the similarity of second theme term vector described in described first descriptor vector sum;

According to the magnitude relationship of described similarity and default similarity threshold, judge that whether described new chapters and sections are the false chapters and sections of described article.

On the other hand, provide the pick-up unit of the new chapters and sections of a kind of article, described device comprises:

First determination module, for determining the first descriptor vector detecting chapters and sections of article; Described first descriptor vector is for identifying the content detecting chapters and sections of described article;

Second determination module, for determining the second theme term vector of the new chapters and sections of described article; Described second theme term vector is for identifying the content of the new chapters and sections of described article;

Computing module, for calculating the similarity of second theme term vector described in described first descriptor vector sum;

Judge module, for the magnitude relationship according to described similarity and default similarity threshold, judges that whether described new chapters and sections are the false chapters and sections of described article.

The detection method of the new chapters and sections of article of the embodiment of the present invention and device, by determining the first descriptor vector detecting chapters and sections of article; First descriptor vector is for identifying the content detecting chapters and sections of article; Determine the second theme term vector of the new chapters and sections of article; Second theme term vector is for identifying the content of the new chapters and sections of article; Calculate the similarity of the first descriptor vector sum second theme term vector; According to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.Adopt the technical scheme of the embodiment of the present invention, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

The process flow diagram of the detection method of the new chapters and sections of article that Fig. 1 provides for one embodiment of the invention;

The process flow diagram of the detection method of the new chapters and sections of article that Fig. 2 provides for another embodiment of the present invention;

The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 3 provides for one embodiment of the invention;

The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 4 provides for another embodiment of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

The process flow diagram of the detection method of the new chapters and sections of article that Fig. 1 provides for one embodiment of the invention.As shown in Figure 1, the detection method of the new chapters and sections of article of the present embodiment, specifically can comprise the steps:

100, the first descriptor vector detecting chapters and sections of article is determined;

Wherein the first descriptor vector is for identifying the content detecting chapters and sections of article; The present embodiment detect the effective chapters and sections determined that chapters and sections are this article, these effective chapters and sections can be understood as the chapters and sections adopting the method for the embodiment of the present invention to be defined as effective chapters and sections.Whether it should be noted that, when determining first chapters and sections of this article, detecting chapters and sections, can not adopt the method for the embodiment of the present invention owing to not existing, it is effective chapters and sections that the method for manual examination and verification can be adopted to audit the first chapters and sections.

Such as, determine that the process detecting the first descriptor vector of chapters and sections of article can be understood as the process to detecting chapters and sections and carry out training extraction first descriptor vector.

101, the second theme term vector of the new chapters and sections of article is determined;

Wherein second theme term vector is for identifying the content of the new chapters and sections of article.

In the present embodiment, step 101 " determine the new chapters and sections of article second theme term vector " with step " determine article detect chapters and sections first descriptor vector " specific implementation process can be identical.Such as, determine that the process of the second theme term vector of the new chapters and sections of article can be understood as to carry out training the process extracting second theme term vector to new chapters and sections.Wherein preferably, in the present embodiment, second theme term vector is identical with the quantity of the descriptor that the first descriptor vector comprises.

102, the similarity of the first descriptor vector sum second theme term vector is calculated;

103, according to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.

The executive agent of the detection method of the new chapters and sections of article of the present embodiment can be the pick-up unit of the new chapters and sections of an article.The pick-up unit of the new chapters and sections of such as this article can be arranged in polymerizable clc platform.

The detection method of the new chapters and sections of article of the present embodiment, by determining the first descriptor vector detecting chapters and sections of article; First descriptor vector is for identifying the content detecting chapters and sections of article; Determine the second theme term vector of the new chapters and sections of article; Second theme term vector is for identifying the content of the new chapters and sections of article; Calculate the similarity of the first descriptor vector sum second theme term vector; According to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.Adopt the technical scheme of the present embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the present embodiment, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the present embodiment only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.

Alternatively, on the basis of above-mentioned technical scheme embodiment illustrated in fig. 1, wherein step 100 " determines the first descriptor vector detecting chapters and sections of article ", specifically can comprise the steps:

(1) word fractionation is carried out to the chapters and sections of detection of article, obtain multiple candidate word;

(2) weight of each candidate word in multiple candidate word is calculated;

(3) according to the weight of each candidate word in multiple candidate word and multiple candidate word, the first descriptor vector is generated.

Such as wherein step (2) " calculates the weight of each candidate word in multiple candidate word ", specifically can comprise: the entropy of the length calculating each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set; And according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, calculate the weight of each candidate word.Left adjacent character set refers to the set of the character set that certain word left side occurs in one section of word, and right adjacent character set refers to the set of the character set occurred on the right of certain word in one section of word.Such as " seeing their appearance, think that they feel bad especially, is also their blessing." the words, the left adjacent character set of candidate word " they "=see, for, right adjacent character set=, special, wish.The determination of left adjacent character set and right adjacent character set with reference to related art, can not repeat them here particularly.

Further alternatively, wherein " according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, calculate the weight of each candidate word ", the concrete weight adopting each candidate word of following formulae discovery:

wherein, W is the weight of candidate word, and TF is the frequency that candidate word occurs in article, and Ha is the entropy of left adjacent character set, and Hb is the entropy of right adjacent character set, and L is the length of candidate word.

Further alternatively, wherein above-mentioned steps (3) " according to the weight of each candidate word in multiple candidate word and multiple candidate word; generate the first descriptor vector " specifically can comprise: from multiple candidate word, according to weight from high to low take out M candidate word in turn, generate first descriptor vector.Particularly, the size of the quantity M of the descriptor that second theme term vector and the first descriptor vector comprise can be selected according to actual conditions, such as, can weigh higher Top10 by weighting, also can weigh higher Top100 by weighting, or Top200 etc.

Such as wherein the length of candidate word between 2-5 Chinese character.Such as " abcd " can split and obtain " ab ", " bc ", " cd ", " abc ", " bcd ", the candidate word of " abcd ".And add up frequency, length, the entropy of left contiguous character set and the entropy of right contiguous character set that each candidate word occurs in this this article, wherein entropy is larger, expresses this candidate word more important.Finally utilize formula calculate the weight of each candidate word, and sorted on earth by height by weight, such as, can form the first descriptor vector by the heavy the highest TOP500 word of weighting, as the first descriptor vector of this this article.Wherein the formula of entropy is: H=-plogp.P represents the probability of each character in this character set in character set.If be that { a, a, b, c}, then the entropy of its left character set is than its left character set of certain candidate word

Ha = - \frac{2}{4} \log (\frac{2}{4}) - \frac{1}{4} \log (\frac{1}{4}) - \frac{1}{4} \log (\frac{1}{4}) .

It is stronger that obvious entropy shows that more greatly this candidate goes here and there independence, is more likely the descriptor of article.Such as table 1 is front 10 candidate word that weight is the highest that a certain article calculates, and can see main based on character name, mechanism etc. in article, have uniqueness clearly.In practical application, these 10 candidate word that weight is the highest can be adopted as the first descriptor vector of this this article.

Table 1

Further alternatively, on the basis of the technical scheme of above-described embodiment, wherein after step (2) " calculates the weight of each candidate word in multiple candidate word ", before step (3) " according to the weight of each candidate word in multiple candidate word and multiple candidate word; generate the first descriptor vector ", can also comprise the steps:

A () adds up the document frequency of each candidate word in multiple candidate word;

The article record that the document frequency of the present embodiment occurs in the N section article included by article pond for candidate word.Such as article pond has 100 articles, occurs, then its document frequency DF=20 in the candidate topics term vector of word x 20 articles wherein.The document frequency DF of a descriptor is larger, then this word uniqueness is poorer, and therefore its certain this article relative is more important.On the contrary, if the document frequency DF=1 of a descriptor, namely only occurred in the descriptor vector of an article, then this word is probably the exclusive word of this this article, and uniqueness is very high.

C () N section article included by the document frequency of candidate word each in multiple candidate word and article pond, upgrades the weight of each candidate word in multiple candidate word.

Such as specifically can adopt the weight of each candidate word of following formulae discovery:

W=W*log (N/DF), wherein W is the weight of candidate word, and DF is the document frequency of candidate word.

After step (c), then according to the weight rearrangement after renewal, M can be selected if TOP200 is as the first final descriptor vector of this article every.

It should be noted that, above-described embodiment is all the determination mode explaining the first descriptor vector, wherein the determination mode of second theme term vector is identical with the determination mode of the first descriptor vector, with reference to the record of above-described embodiment, can not repeat them here in detail.

Further alternatively, on the basis of above-mentioned technical scheme embodiment illustrated in fig. 1, step 102 " calculates the similarity of the first descriptor vector sum second theme term vector ", specifically can comprise the similarity adopting following formulae discovery first descriptor vector sum second theme term vector:

Wherein D represents the first descriptor vector, D _irepresent i-th descriptor in the first descriptor vector; Q represents described second theme term vector, Q _irepresent i-th descriptor in second theme term vector; M represents the number of each included descriptor of the first descriptor vector sum second theme term vector; Sim (D, Q) represents the similarity of the first descriptor vector sum second theme term vector.Wherein sim (D, Q) span is between 0-1, and it is higher to be worth larger expression two vector similarity.

Further alternatively, on the basis of above-mentioned technical scheme embodiment illustrated in fig. 1, step 103 " according to the magnitude relationship of similarity and default similarity threshold; judge that whether new chapters and sections are the false chapters and sections of article ", specifically can comprise: when similarity is more than or equal to default similarity threshold, determine that new chapters and sections are effective chapters and sections of article; When similarity is less than default similarity threshold, determine that new chapters and sections are the false chapters and sections of article.

Further alternatively, on the basis of the technical scheme of above-described embodiment, after determining that new chapters and sections are the false chapters and sections of article, can also comprise: the new chapters and sections filtering article.That is, do not show the new chapters and sections of this falseness to the user of polymerizable clc platform, thus improve the article quality of polymerizable clc platform, improve the Experience Degree of user.

All alternatives of above-described embodiment, combinable mode combination in any can be adopted to form optional embodiment of the present invention, and this is no longer going to repeat them.

The detection method of the new chapters and sections of article of above-described embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.

The process flow diagram of the detection method of the new chapters and sections of article that Fig. 2 provides for another embodiment of the present invention.The detection method of the new chapters and sections of article of the present embodiment, on the basis of above-mentioned Fig. 1 and embodiment thereof, introduces technical scheme of the present invention further in further detail.As shown in Figure 2, the detection method of the new chapters and sections of article of the present embodiment, specifically can comprise the steps:

200, word fractionation is carried out to the chapters and sections of detection of article, obtain multiple candidate word;

201, the length of each candidate word in multiple candidate word, the frequency occurred in article, the entropy of left adjacent character set and the entropy of right adjacent character set is calculated;

202, according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, the weight of each candidate word is calculated;

Specifically can adopt the correlation technique of above-described embodiment, not repeat them here.

203, the document frequency of each candidate word in multiple candidate word is added up;

204, the N section article included by the document frequency of candidate word each in multiple candidate word and article pond, upgrades the weight of each candidate word in multiple candidate word according to following formula;

The weight upgrading each candidate word in multiple candidate word according to formula W=W*log (N/DF) wherein W is the weight of candidate word, and DF is the document frequency of candidate word.The W on the equal sign left side is the weight of the candidate word after upgrading, and the W on the right of equal sign is the weight of the candidate word that step 202 calculates, the weight of the candidate word before namely upgrading.

205, from multiple candidate word, according to weight from high to low take out Top200 candidate word in turn, generate first descriptor vector;

206, the second theme term vector of the new chapters and sections of article is determined;

Determine with above-mentioned steps 200-205, the second theme term vector specific implementation process of the new chapters and sections of article determines that the process of the first descriptor vector is identical, with reference to the record of above-mentioned steps 200-205, can not repeat them here in detail.It should be noted that, the descriptor that the first descriptor vector comprises is identical with the quantity of the descriptor that second theme term vector comprises.

207, the similarity of following formulae discovery first descriptor vector sum second theme term vector is adopted:

208, judging whether similarity is more than or equal to default similarity threshold T, when being more than or equal to, performing step 209; Otherwise when being less than, perform step 210;

209, determine that new chapters and sections are effective chapters and sections of this article;

210, determine that new chapters and sections are the false chapters and sections of this article, perform step 211;

211, these new chapters and sections of article are filtered.

Such as with following table 2 for article is called some information of the article of novel_tiancaixiangshi, wherein the 2nd row are certain article names, and the 3rd row are the chapters and sections from different article, and the 1st row are 2,3 row Similarity value.The first row represents some chapters and sections be detected from this article, and eighth row represents false chapters and sections.This article chapters and sections only having the first row to represent can be seen, the similarity of itself and the 2nd row article vector is greater than 0.3, and the chapters and sections of other article remaining and false chapters and sections similarity are all less than 0.05, therefore, it is possible to effective chapters and sections and false chapters and sections are made a distinction very exactly.

Table 2

The detection method of the new chapters and sections of article of the present embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.

The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 3 provides for one embodiment of the invention.As shown in Figure 3, the pick-up unit of the new chapters and sections of the article of the present embodiment comprises: the first determination module 10, second determination module 11, computing module 12 and judge module 13.

Wherein the first determination module 10 is for determining the first descriptor vector detecting chapters and sections of article; This first descriptor vector is for identifying the content detecting chapters and sections of article; Second determination module 11 is for determining the second theme term vector of the new chapters and sections of article; This second theme term vector is for identifying the content of the new chapters and sections of article; Computing module 12 is connected with the first determination module 10 and the second determination module 11 respectively, the similarity of the second theme term vector that computing module 12 is determined for the first descriptor vector sum second determination module 11 calculating the first determination module 10 and determine; Judge module 13 is connected with computing module 12, and judge module 13, for the magnitude relationship of the similarity that calculates according to computing module 12 and default similarity threshold, judges that whether new chapters and sections are the false chapters and sections of article.

The pick-up unit of the new chapters and sections of article of the present embodiment, identical with the realization mechanism of above-mentioned related method embodiment by adopting above-mentioned module to realize the detection of the new chapters and sections of article, with reference to the record of above-mentioned related embodiment, can not repeat them here in detail.

The pick-up unit of the new chapters and sections of article of the present embodiment, the first descriptor vector detecting chapters and sections by adopting the realization of above-mentioned module to determine article; First descriptor vector is for identifying the content detecting chapters and sections of article; Determine the second theme term vector of the new chapters and sections of article; Second theme term vector is for identifying the content of the new chapters and sections of article; Calculate the similarity of the first descriptor vector sum second theme term vector; According to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.Adopt the technical scheme of the present embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the present embodiment, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the present embodiment only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.

The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 4 provides for another embodiment of the present invention.As shown in Figure 4, the pick-up unit of the new chapters and sections of article of the present embodiment, on above-mentioned basis embodiment illustrated in fig. 3, comprises following technical scheme further.

As shown in Figure 4, the first determination module 10 in the pick-up unit of new chapters and sections of the article of the present embodiment comprises split cells 101, computing unit 102 and generation unit 103.

Wherein split cells 101 is for carrying out word fractionation to the chapters and sections of detection of article, obtains multiple candidate word; Computing unit 102 is connected with split cells 101, and computing unit 102 splits the weight of each candidate word in the multiple candidate word obtained for calculating split cells 101; Generation unit 103 is connected with split cells 101 and computing unit 102 respectively, generation unit 103, for the weight of each candidate word in multiple candidate word of splitting multiple candidate word of obtaining and computing unit 102 according to split cells 101 and calculating, generates the first descriptor vector.

Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, computing unit 102 splits the length of each candidate word in the multiple candidate word obtained, the frequency occurred in article, the entropy of left adjacent character set and the entropy of right adjacent character set specifically for calculating split cells 101; And according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, calculate the weight of each candidate word.

Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, computing unit 102 specifically adopts following formulae discovery split cells 101 to split the weight of each candidate word in the multiple candidate word obtained:

Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, generation unit 103 is specifically for splitting in multiple candidate word of obtaining from split cells 101, the weight weight of each candidate word calculated according to computing unit 102 from high to low take out M candidate word in turn, generate the first descriptor vector.

Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, the first determination module 20 also comprises statistic unit 104 and updating block 105.

Statistic unit 104 is connected with split cells 101, statistic unit 104 is for after calculating the weight of each candidate word in multiple candidate word at computing unit 102, generation unit 103 is according to the weight of each candidate word in multiple candidate word and multiple candidate word, before generating the first descriptor vector, statistics split cells 101 splits the document frequency of each candidate word in the multiple candidate word obtained; The article record that the document frequency occurs in the N section article included by article pond for candidate word.Updating block 105 is connected with statistic unit 104 and computing unit 102 respectively; Updating block 105 is for adding up the N section article included by the document frequency of each candidate word in multiple candidate word of obtaining and article pond according to statistic unit 104, and the weight of each candidate word in multiple candidate word of calculating of computing unit 102, upgrade the weight of each candidate word in multiple candidate word that computing unit 102 calculates.

Now corresponding generation unit 103 is connected with updating block 105, generation unit 103 upgrades the weight of each candidate word in the multiple candidate word obtained for splitting multiple candidate word of obtaining and updating block 105 according to split cells 101, generate the first descriptor vector.

Such as updating block 105 specifically adopts the weight of each candidate word of following formulae discovery:

W=W*log (N/DF), wherein W is the weight of candidate word, and DF is the document frequency of candidate word.The W on the equal sign left side is the weight of the candidate word after upgrading, and the W on the right of equal sign is the weight of the candidate word that step 202 calculates, the weight of the candidate word before namely upgrading.

Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, computing module 12 specifically can be connected with generation unit 103, the similarity of concrete the first descriptor vector sum second theme term vector adopting following formulae discovery generation unit 103 to generate:

Wherein D represents the first descriptor vector, D _irepresent i-th descriptor in the first descriptor vector; Q represents described second theme term vector, Q _irepresent i-th descriptor in second theme term vector; M represents the number of each included descriptor of the first descriptor vector sum second theme term vector; Sim (D, Q) represents the similarity of the first descriptor vector sum second theme term vector.

Particularly, second determination module 11 also comprises above-mentioned split cells 101, computing unit 102 and generation unit 103 in picture the first determination module 10, and statistic unit 104 and updating block 105, realize the determination of the first descriptor vector, with reference to the record of above-described embodiment, can not repeat them here in detail.

Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, judge module 13 is specifically for judging the magnitude relationship of the similarity that computing module 12 calculates and default similarity threshold, when similarity is more than or equal to default similarity threshold, determine that new chapters and sections are effective chapters and sections of article; When similarity is less than default similarity threshold, determine that new chapters and sections are the false chapters and sections of article.

Further alternatively, filtering module 14 is also comprised in the pick-up unit of new chapters and sections of the article of the present embodiment.This filtering module 14 is connected with judge module 13, and filtering module 14, for after determining that at judge module 13 new chapters and sections are the false chapters and sections of article, filters the new chapters and sections of article.

All alternatives in the pick-up unit of the new chapters and sections of article of the present embodiment, combinable mode combination in any can be adopted to form optional embodiment of the present invention, and this is no longer going to repeat them.

The pick-up unit of the new chapters and sections of article of the present embodiment, manual intervention is not needed by the testing process adopting above-mentioned module to realize the new chapters and sections of whole article, cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.

The embodiment of the present invention can also provide a kind of polymerizable clc platform, this article aggregation platform is provided with the pick-up unit of as above Fig. 3 or the new chapters and sections of article embodiment illustrated in fig. 4, the pick-up unit of the new chapters and sections of this article specifically can adopt the detection method of above-mentioned Fig. 1 or the new chapters and sections of article embodiment illustrated in fig. 2 to realize the detection of the new chapters and sections of article, the record of above-mentioned related embodiment can be adopted in detail, do not repeat them here.

It should be noted that: the pick-up unit of the new chapters and sections of the article that above-described embodiment provides is when the detection of the new chapters and sections of article, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the pick-up unit of the new chapters and sections of the article that above-described embodiment provides and the detection method embodiment of the new chapters and sections of article belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a detection method for the new chapters and sections of article, is characterized in that, described method comprises:

2. method according to claim 1, is characterized in that, described the first descriptor vector detecting chapters and sections determining article, comprising:

Word fractionation is carried out to the chapters and sections of detection of described article, obtains multiple candidate word;

Calculate the weight of each described candidate word in described multiple candidate word;

According to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, generate described first descriptor vector.

3. method according to claim 2, is characterized in that, calculates the weight of each described candidate word in described multiple candidate word, comprising:

Calculate the length of each described candidate word, the frequency occurred in described article, the entropy of left adjacent character set and the entropy of right adjacent character set;

According to the entropy of the length of each described candidate word, the frequency occurred in described article, described left adjacent character set and the entropy of described right adjacent character set, calculate the weight of each described candidate word.

4. method according to claim 3, it is characterized in that, according to the entropy of the length of each described candidate word, the frequency occurred in described article, described left adjacent character set and the entropy of described right adjacent character set, calculate the weight of each described candidate word, the concrete weight adopting each described candidate word of following formulae discovery:

wherein, described W is the weight of described candidate word, and described TF is the frequency that described candidate word occurs in described article, and described Ha is the entropy of described left adjacent character set, and described Hb is the entropy of described right adjacent character set, and described L is the length of described candidate word.

5. method according to claim 2, is characterized in that, according to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, generates described first descriptor vector, comprising:

From described multiple candidate word, according to weight from high to low take out M candidate word in turn, generate described first descriptor vector.

6. according to the arbitrary described method of claim 2-5, it is characterized in that, after calculating the weight of each described candidate word in described multiple candidate word, according to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, before generating described first descriptor vector, described method also comprises:

Add up the document frequency of each described candidate word in described multiple candidate word; The article record that described document frequency occurs in the N section article included by article pond for described candidate word;

According to the N section article included by the document frequency of described candidate word each in described multiple candidate word and described article pond, upgrade the weight of each described candidate word in described multiple candidate word.

7. method according to claim 6, it is characterized in that, according to the N section article included by the document frequency of described candidate word each in described multiple candidate word and described article pond, upgrade the weight of each described candidate word in described multiple candidate word, the concrete weight adopting each described candidate word of following formulae discovery:

W=W*log (N/DF), wherein said W are the weight of described candidate word, and described DF is the document frequency of described candidate word.

8. method according to claim 1, is characterized in that, calculates the similarity of second theme term vector described in described first descriptor vector sum, comprises the similarity adopting second theme term vector described in the first descriptor vector sum described in following formulae discovery:

Wherein said D represents described first descriptor vector, described D _irepresent i-th descriptor in the first descriptor vector; Described Q represents described second theme term vector, described Q _irepresent i-th descriptor in second theme term vector; Described m represents the number of each included descriptor of second theme term vector described in described first descriptor vector sum; Described sim (D, Q) represents the similarity of second theme term vector described in described first descriptor vector sum.

9. the method according to claim 7 or 8, is characterized in that, according to the magnitude relationship of described similarity and default similarity threshold, judges that whether described new chapters and sections are the false chapters and sections of described article, comprising:

When described similarity is more than or equal to described default similarity threshold, determine that described new chapters and sections are effective chapters and sections of described article;

When described similarity is less than described default similarity threshold, determine that described new chapters and sections are the false chapters and sections of described article.

10. method according to claim 9, is characterized in that, after determining that described new chapters and sections are the false chapters and sections of described article, described method also comprises:

Filter the described new chapters and sections of described article.

The pick-up unit of 11. 1 kinds of new chapters and sections of article, is characterized in that, described device comprises:

12. devices according to claim 11, is characterized in that, described first determination module comprises:

Split cells, for carrying out word fractionation to the chapters and sections of detection of described article, obtains multiple candidate word;

Computing unit, for calculating the weight of each described candidate word in described multiple candidate word;

Generation unit, for the weight according to each described candidate word in described multiple candidate word and described multiple candidate word, generates described first descriptor vector.

13. devices according to claim 12, is characterized in that, described computing unit, specifically for calculating the length of each described candidate word, the frequency occurred in described article, the entropy of left adjacent character set and the entropy of right adjacent character set; And according to the entropy of the length of each described candidate word, the frequency occurred in described article, described left adjacent character set and the entropy of described right adjacent character set, calculate the weight of each described candidate word.

14. devices according to claim 13, is characterized in that, described computing unit, the concrete weight adopting each described candidate word of following formulae discovery:

15. devices according to claim 12, is characterized in that, described generation unit, specifically for from described multiple candidate word, according to weight from high to low take out M candidate word in turn, generate described first descriptor vector.

16. according to the arbitrary described device of claim 12-15, and it is characterized in that, described first determination module also comprises:

Statistic unit, after calculating the weight of each described candidate word in described multiple candidate word at described computing unit, described generation unit is according to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, before generating described first descriptor vector, add up the document frequency of each described candidate word in described multiple candidate word; The article record that described document frequency occurs in the N section article included by article pond for described candidate word;

Updating block, for the N section article included by the document frequency of described candidate word each in described multiple candidate word and described article pond, upgrades the weight of each described candidate word in described multiple candidate word.

17. devices according to claim 16, is characterized in that, described updating block, the concrete weight adopting each described candidate word of following formulae discovery:

18. devices according to claim 11, is characterized in that, described computing module, the concrete similarity adopting second theme term vector described in the first descriptor vector sum described in following formulae discovery:

19. devices according to claim 17 or 18, is characterized in that, described judge module, specifically for being more than or equal to described default similarity threshold when described similarity, determine that described new chapters and sections are effective chapters and sections of described article; When described similarity is less than described default similarity threshold, determine that described new chapters and sections are the false chapters and sections of described article.

20. devices according to claim 19, is characterized in that, described device also comprises:

Filtering module, after determining that at described judge module described new chapters and sections are the false chapters and sections of described article, filters the described new chapters and sections of described article.