CN104239285A - New article chapter detecting method and device - Google Patents

New article chapter detecting method and device Download PDF

Info

Publication number
CN104239285A
CN104239285A CN201310223253.0A CN201310223253A CN104239285A CN 104239285 A CN104239285 A CN 104239285A CN 201310223253 A CN201310223253 A CN 201310223253A CN 104239285 A CN104239285 A CN 104239285A
Authority
CN
China
Prior art keywords
candidate word
article
sections
chapters
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310223253.0A
Other languages
Chinese (zh)
Inventor
蔡兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910649833.3A priority Critical patent/CN110347931A/en
Priority to CN201310223253.0A priority patent/CN104239285A/en
Publication of CN104239285A publication Critical patent/CN104239285A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a new article chapter detecting method and device and belongs to the technical field of internet. The method includes that a first subject term vector quantity of a detected chapter of an article is determined; the first subject term vector quantity is used for identifying content of the detected chapter of the article; a second subject term vector quantity of a new chapter of the article is determined; the second subject term vector quantity is used for identifying content of the new chapter of the article; the similarity of the first subject term vector quantity and the second subject term vector quantity is calculated; according to a large-small relation between the similarity and a preset similarity threshold value, whether the new chapter is a false chapter of the article or not is determined. By means of the technical scheme, the on-line identifying process merely requires millisecond grade, the chapter pushing speed is not affected at all, thereby, when the new chapter is an effective chapter, the new chapter can be timely pushed, and the pushing efficiency of the new chapter of the article can be effectively guaranteed.

Description

The detection method of the new chapters and sections of article and device
Technical field
The present invention relates to Internet technical field, particularly the detection method of the new chapters and sections of a kind of article and device.
Background technology
Along with the development of Internet science and technology, more tired more people can carry out various activity by internet, and such as people can read some articles published in instalments etc. by internet.
In prior art, along with the day by day fiery of web documents has also expedited the emergence of the birth of increasing article website, according to incompletely statistics, various middle-size and small-size article Websites quantity has reached hundreds thousand of, its quality is very different, frequent existence some steal content and even manufacture false new chapters and sections and click to gain article user by cheating, the behavior that harm users is experienced.As polymerizable clc platform, after the new chapters and sections data of article capturing these websites, manual examination and verification are carried out to the new chapters and sections of article, the new chapters and sections of falseness are identified and filter out in time, to provide the article of better quality to user.The program is improve the important step of polymerizable clc platform mass, optimizing user reading experience.
Realizing in process of the present invention, inventor finds that prior art at least exists following problem: the mode of above-mentioned existing employing manual examination and verification audits the method for the new chapters and sections of article, and audit time is longer, causes the new chapters and sections of article not pushed in time.
Summary of the invention
In order to solve the problem of prior art, embodiments provide detection method and the device of the new chapters and sections of a kind of article.Described technical scheme is as follows:
On the one hand, provide the detection method of the new chapters and sections of a kind of article, described method comprises:
Determine the first descriptor vector detecting chapters and sections of article; Described first descriptor vector is for identifying the content detecting chapters and sections of described article;
Determine the second theme term vector of the new chapters and sections of described article; Described second theme term vector is for identifying the content of the new chapters and sections of described article;
Calculate the similarity of second theme term vector described in described first descriptor vector sum;
According to the magnitude relationship of described similarity and default similarity threshold, judge that whether described new chapters and sections are the false chapters and sections of described article.
On the other hand, provide the pick-up unit of the new chapters and sections of a kind of article, described device comprises:
First determination module, for determining the first descriptor vector detecting chapters and sections of article; Described first descriptor vector is for identifying the content detecting chapters and sections of described article;
Second determination module, for determining the second theme term vector of the new chapters and sections of described article; Described second theme term vector is for identifying the content of the new chapters and sections of described article;
Computing module, for calculating the similarity of second theme term vector described in described first descriptor vector sum;
Judge module, for the magnitude relationship according to described similarity and default similarity threshold, judges that whether described new chapters and sections are the false chapters and sections of described article.
The detection method of the new chapters and sections of article of the embodiment of the present invention and device, by determining the first descriptor vector detecting chapters and sections of article; First descriptor vector is for identifying the content detecting chapters and sections of article; Determine the second theme term vector of the new chapters and sections of article; Second theme term vector is for identifying the content of the new chapters and sections of article; Calculate the similarity of the first descriptor vector sum second theme term vector; According to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.Adopt the technical scheme of the embodiment of the present invention, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
The process flow diagram of the detection method of the new chapters and sections of article that Fig. 1 provides for one embodiment of the invention;
The process flow diagram of the detection method of the new chapters and sections of article that Fig. 2 provides for another embodiment of the present invention;
The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 3 provides for one embodiment of the invention;
The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 4 provides for another embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
The process flow diagram of the detection method of the new chapters and sections of article that Fig. 1 provides for one embodiment of the invention.As shown in Figure 1, the detection method of the new chapters and sections of article of the present embodiment, specifically can comprise the steps:
100, the first descriptor vector detecting chapters and sections of article is determined;
Wherein the first descriptor vector is for identifying the content detecting chapters and sections of article; The present embodiment detect the effective chapters and sections determined that chapters and sections are this article, these effective chapters and sections can be understood as the chapters and sections adopting the method for the embodiment of the present invention to be defined as effective chapters and sections.Whether it should be noted that, when determining first chapters and sections of this article, detecting chapters and sections, can not adopt the method for the embodiment of the present invention owing to not existing, it is effective chapters and sections that the method for manual examination and verification can be adopted to audit the first chapters and sections.
Such as, determine that the process detecting the first descriptor vector of chapters and sections of article can be understood as the process to detecting chapters and sections and carry out training extraction first descriptor vector.
101, the second theme term vector of the new chapters and sections of article is determined;
Wherein second theme term vector is for identifying the content of the new chapters and sections of article.
In the present embodiment, step 101 " determine the new chapters and sections of article second theme term vector " with step " determine article detect chapters and sections first descriptor vector " specific implementation process can be identical.Such as, determine that the process of the second theme term vector of the new chapters and sections of article can be understood as to carry out training the process extracting second theme term vector to new chapters and sections.Wherein preferably, in the present embodiment, second theme term vector is identical with the quantity of the descriptor that the first descriptor vector comprises.
102, the similarity of the first descriptor vector sum second theme term vector is calculated;
103, according to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.
The executive agent of the detection method of the new chapters and sections of article of the present embodiment can be the pick-up unit of the new chapters and sections of an article.The pick-up unit of the new chapters and sections of such as this article can be arranged in polymerizable clc platform.
The detection method of the new chapters and sections of article of the present embodiment, by determining the first descriptor vector detecting chapters and sections of article; First descriptor vector is for identifying the content detecting chapters and sections of article; Determine the second theme term vector of the new chapters and sections of article; Second theme term vector is for identifying the content of the new chapters and sections of article; Calculate the similarity of the first descriptor vector sum second theme term vector; According to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.Adopt the technical scheme of the present embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the present embodiment, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the present embodiment only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.
Alternatively, on the basis of above-mentioned technical scheme embodiment illustrated in fig. 1, wherein step 100 " determines the first descriptor vector detecting chapters and sections of article ", specifically can comprise the steps:
(1) word fractionation is carried out to the chapters and sections of detection of article, obtain multiple candidate word;
(2) weight of each candidate word in multiple candidate word is calculated;
(3) according to the weight of each candidate word in multiple candidate word and multiple candidate word, the first descriptor vector is generated.
Such as wherein step (2) " calculates the weight of each candidate word in multiple candidate word ", specifically can comprise: the entropy of the length calculating each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set; And according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, calculate the weight of each candidate word.Left adjacent character set refers to the set of the character set that certain word left side occurs in one section of word, and right adjacent character set refers to the set of the character set occurred on the right of certain word in one section of word.Such as " seeing their appearance, think that they feel bad especially, is also their blessing." the words, the left adjacent character set of candidate word " they "=see, for, right adjacent character set=, special, wish.The determination of left adjacent character set and right adjacent character set with reference to related art, can not repeat them here particularly.
Further alternatively, wherein " according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, calculate the weight of each candidate word ", the concrete weight adopting each candidate word of following formulae discovery:
wherein, W is the weight of candidate word, and TF is the frequency that candidate word occurs in article, and Ha is the entropy of left adjacent character set, and Hb is the entropy of right adjacent character set, and L is the length of candidate word.
Further alternatively, wherein above-mentioned steps (3) " according to the weight of each candidate word in multiple candidate word and multiple candidate word; generate the first descriptor vector " specifically can comprise: from multiple candidate word, according to weight from high to low take out M candidate word in turn, generate first descriptor vector.Particularly, the size of the quantity M of the descriptor that second theme term vector and the first descriptor vector comprise can be selected according to actual conditions, such as, can weigh higher Top10 by weighting, also can weigh higher Top100 by weighting, or Top200 etc.
Such as wherein the length of candidate word between 2-5 Chinese character.Such as " abcd " can split and obtain " ab ", " bc ", " cd ", " abc ", " bcd ", the candidate word of " abcd ".And add up frequency, length, the entropy of left contiguous character set and the entropy of right contiguous character set that each candidate word occurs in this this article, wherein entropy is larger, expresses this candidate word more important.Finally utilize formula calculate the weight of each candidate word, and sorted on earth by height by weight, such as, can form the first descriptor vector by the heavy the highest TOP500 word of weighting, as the first descriptor vector of this this article.Wherein the formula of entropy is: H=-plogp.P represents the probability of each character in this character set in character set.If be that { a, a, b, c}, then the entropy of its left character set is than its left character set of certain candidate word Ha = - 2 4 log ( 2 4 ) - 1 4 log ( 1 4 ) - 1 4 log ( 1 4 ) . It is stronger that obvious entropy shows that more greatly this candidate goes here and there independence, is more likely the descriptor of article.Such as table 1 is front 10 candidate word that weight is the highest that a certain article calculates, and can see main based on character name, mechanism etc. in article, have uniqueness clearly.In practical application, these 10 candidate word that weight is the highest can be adopted as the first descriptor vector of this this article.
Table 1
Further alternatively, on the basis of the technical scheme of above-described embodiment, wherein after step (2) " calculates the weight of each candidate word in multiple candidate word ", before step (3) " according to the weight of each candidate word in multiple candidate word and multiple candidate word; generate the first descriptor vector ", can also comprise the steps:
A () adds up the document frequency of each candidate word in multiple candidate word;
The article record that the document frequency of the present embodiment occurs in the N section article included by article pond for candidate word.Such as article pond has 100 articles, occurs, then its document frequency DF=20 in the candidate topics term vector of word x 20 articles wherein.The document frequency DF of a descriptor is larger, then this word uniqueness is poorer, and therefore its certain this article relative is more important.On the contrary, if the document frequency DF=1 of a descriptor, namely only occurred in the descriptor vector of an article, then this word is probably the exclusive word of this this article, and uniqueness is very high.
C () N section article included by the document frequency of candidate word each in multiple candidate word and article pond, upgrades the weight of each candidate word in multiple candidate word.
Such as specifically can adopt the weight of each candidate word of following formulae discovery:
W=W*log (N/DF), wherein W is the weight of candidate word, and DF is the document frequency of candidate word.
After step (c), then according to the weight rearrangement after renewal, M can be selected if TOP200 is as the first final descriptor vector of this article every.
It should be noted that, above-described embodiment is all the determination mode explaining the first descriptor vector, wherein the determination mode of second theme term vector is identical with the determination mode of the first descriptor vector, with reference to the record of above-described embodiment, can not repeat them here in detail.
Further alternatively, on the basis of above-mentioned technical scheme embodiment illustrated in fig. 1, step 102 " calculates the similarity of the first descriptor vector sum second theme term vector ", specifically can comprise the similarity adopting following formulae discovery first descriptor vector sum second theme term vector:
Wherein D represents the first descriptor vector, D irepresent i-th descriptor in the first descriptor vector; Q represents described second theme term vector, Q irepresent i-th descriptor in second theme term vector; M represents the number of each included descriptor of the first descriptor vector sum second theme term vector; Sim (D, Q) represents the similarity of the first descriptor vector sum second theme term vector.Wherein sim (D, Q) span is between 0-1, and it is higher to be worth larger expression two vector similarity.
Further alternatively, on the basis of above-mentioned technical scheme embodiment illustrated in fig. 1, step 103 " according to the magnitude relationship of similarity and default similarity threshold; judge that whether new chapters and sections are the false chapters and sections of article ", specifically can comprise: when similarity is more than or equal to default similarity threshold, determine that new chapters and sections are effective chapters and sections of article; When similarity is less than default similarity threshold, determine that new chapters and sections are the false chapters and sections of article.
Further alternatively, on the basis of the technical scheme of above-described embodiment, after determining that new chapters and sections are the false chapters and sections of article, can also comprise: the new chapters and sections filtering article.That is, do not show the new chapters and sections of this falseness to the user of polymerizable clc platform, thus improve the article quality of polymerizable clc platform, improve the Experience Degree of user.
All alternatives of above-described embodiment, combinable mode combination in any can be adopted to form optional embodiment of the present invention, and this is no longer going to repeat them.
The detection method of the new chapters and sections of article of above-described embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.
The process flow diagram of the detection method of the new chapters and sections of article that Fig. 2 provides for another embodiment of the present invention.The detection method of the new chapters and sections of article of the present embodiment, on the basis of above-mentioned Fig. 1 and embodiment thereof, introduces technical scheme of the present invention further in further detail.As shown in Figure 2, the detection method of the new chapters and sections of article of the present embodiment, specifically can comprise the steps:
200, word fractionation is carried out to the chapters and sections of detection of article, obtain multiple candidate word;
201, the length of each candidate word in multiple candidate word, the frequency occurred in article, the entropy of left adjacent character set and the entropy of right adjacent character set is calculated;
202, according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, the weight of each candidate word is calculated;
Specifically can adopt the correlation technique of above-described embodiment, not repeat them here.
203, the document frequency of each candidate word in multiple candidate word is added up;
204, the N section article included by the document frequency of candidate word each in multiple candidate word and article pond, upgrades the weight of each candidate word in multiple candidate word according to following formula;
The weight upgrading each candidate word in multiple candidate word according to formula W=W*log (N/DF) wherein W is the weight of candidate word, and DF is the document frequency of candidate word.The W on the equal sign left side is the weight of the candidate word after upgrading, and the W on the right of equal sign is the weight of the candidate word that step 202 calculates, the weight of the candidate word before namely upgrading.
205, from multiple candidate word, according to weight from high to low take out Top200 candidate word in turn, generate first descriptor vector;
206, the second theme term vector of the new chapters and sections of article is determined;
Determine with above-mentioned steps 200-205, the second theme term vector specific implementation process of the new chapters and sections of article determines that the process of the first descriptor vector is identical, with reference to the record of above-mentioned steps 200-205, can not repeat them here in detail.It should be noted that, the descriptor that the first descriptor vector comprises is identical with the quantity of the descriptor that second theme term vector comprises.
207, the similarity of following formulae discovery first descriptor vector sum second theme term vector is adopted:
Wherein D represents the first descriptor vector, D irepresent i-th descriptor in the first descriptor vector; Q represents described second theme term vector, Q irepresent i-th descriptor in second theme term vector; M represents the number of each included descriptor of the first descriptor vector sum second theme term vector; Sim (D, Q) represents the similarity of the first descriptor vector sum second theme term vector.Wherein sim (D, Q) span is between 0-1, and it is higher to be worth larger expression two vector similarity.
208, judging whether similarity is more than or equal to default similarity threshold T, when being more than or equal to, performing step 209; Otherwise when being less than, perform step 210;
209, determine that new chapters and sections are effective chapters and sections of this article;
210, determine that new chapters and sections are the false chapters and sections of this article, perform step 211;
211, these new chapters and sections of article are filtered.
Such as with following table 2 for article is called some information of the article of novel_tiancaixiangshi, wherein the 2nd row are certain article names, and the 3rd row are the chapters and sections from different article, and the 1st row are 2,3 row Similarity value.The first row represents some chapters and sections be detected from this article, and eighth row represents false chapters and sections.This article chapters and sections only having the first row to represent can be seen, the similarity of itself and the 2nd row article vector is greater than 0.3, and the chapters and sections of other article remaining and false chapters and sections similarity are all less than 0.05, therefore, it is possible to effective chapters and sections and false chapters and sections are made a distinction very exactly.
Table 2
The detection method of the new chapters and sections of article of the present embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.
The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 3 provides for one embodiment of the invention.As shown in Figure 3, the pick-up unit of the new chapters and sections of the article of the present embodiment comprises: the first determination module 10, second determination module 11, computing module 12 and judge module 13.
Wherein the first determination module 10 is for determining the first descriptor vector detecting chapters and sections of article; This first descriptor vector is for identifying the content detecting chapters and sections of article; Second determination module 11 is for determining the second theme term vector of the new chapters and sections of article; This second theme term vector is for identifying the content of the new chapters and sections of article; Computing module 12 is connected with the first determination module 10 and the second determination module 11 respectively, the similarity of the second theme term vector that computing module 12 is determined for the first descriptor vector sum second determination module 11 calculating the first determination module 10 and determine; Judge module 13 is connected with computing module 12, and judge module 13, for the magnitude relationship of the similarity that calculates according to computing module 12 and default similarity threshold, judges that whether new chapters and sections are the false chapters and sections of article.
The pick-up unit of the new chapters and sections of article of the present embodiment, identical with the realization mechanism of above-mentioned related method embodiment by adopting above-mentioned module to realize the detection of the new chapters and sections of article, with reference to the record of above-mentioned related embodiment, can not repeat them here in detail.
The pick-up unit of the new chapters and sections of article of the present embodiment, the first descriptor vector detecting chapters and sections by adopting the realization of above-mentioned module to determine article; First descriptor vector is for identifying the content detecting chapters and sections of article; Determine the second theme term vector of the new chapters and sections of article; Second theme term vector is for identifying the content of the new chapters and sections of article; Calculate the similarity of the first descriptor vector sum second theme term vector; According to the magnitude relationship of similarity and default similarity threshold, judge that whether new chapters and sections are the false chapters and sections of article.Adopt the technical scheme of the present embodiment, the testing process of the new chapters and sections of whole article does not need manual intervention, and cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the present embodiment, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the present embodiment only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.
The structural representation of the pick-up unit of the new chapters and sections of article that Fig. 4 provides for another embodiment of the present invention.As shown in Figure 4, the pick-up unit of the new chapters and sections of article of the present embodiment, on above-mentioned basis embodiment illustrated in fig. 3, comprises following technical scheme further.
As shown in Figure 4, the first determination module 10 in the pick-up unit of new chapters and sections of the article of the present embodiment comprises split cells 101, computing unit 102 and generation unit 103.
Wherein split cells 101 is for carrying out word fractionation to the chapters and sections of detection of article, obtains multiple candidate word; Computing unit 102 is connected with split cells 101, and computing unit 102 splits the weight of each candidate word in the multiple candidate word obtained for calculating split cells 101; Generation unit 103 is connected with split cells 101 and computing unit 102 respectively, generation unit 103, for the weight of each candidate word in multiple candidate word of splitting multiple candidate word of obtaining and computing unit 102 according to split cells 101 and calculating, generates the first descriptor vector.
Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, computing unit 102 splits the length of each candidate word in the multiple candidate word obtained, the frequency occurred in article, the entropy of left adjacent character set and the entropy of right adjacent character set specifically for calculating split cells 101; And according to the entropy of the length of each candidate word, the frequency occurred in article, left adjacent character set and the entropy of right adjacent character set, calculate the weight of each candidate word.
Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, computing unit 102 specifically adopts following formulae discovery split cells 101 to split the weight of each candidate word in the multiple candidate word obtained:
wherein, W is the weight of candidate word, and TF is the frequency that candidate word occurs in article, and Ha is the entropy of left adjacent character set, and Hb is the entropy of right adjacent character set, and L is the length of candidate word.
Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, generation unit 103 is specifically for splitting in multiple candidate word of obtaining from split cells 101, the weight weight of each candidate word calculated according to computing unit 102 from high to low take out M candidate word in turn, generate the first descriptor vector.
Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, the first determination module 20 also comprises statistic unit 104 and updating block 105.
Statistic unit 104 is connected with split cells 101, statistic unit 104 is for after calculating the weight of each candidate word in multiple candidate word at computing unit 102, generation unit 103 is according to the weight of each candidate word in multiple candidate word and multiple candidate word, before generating the first descriptor vector, statistics split cells 101 splits the document frequency of each candidate word in the multiple candidate word obtained; The article record that the document frequency occurs in the N section article included by article pond for candidate word.Updating block 105 is connected with statistic unit 104 and computing unit 102 respectively; Updating block 105 is for adding up the N section article included by the document frequency of each candidate word in multiple candidate word of obtaining and article pond according to statistic unit 104, and the weight of each candidate word in multiple candidate word of calculating of computing unit 102, upgrade the weight of each candidate word in multiple candidate word that computing unit 102 calculates.
Now corresponding generation unit 103 is connected with updating block 105, generation unit 103 upgrades the weight of each candidate word in the multiple candidate word obtained for splitting multiple candidate word of obtaining and updating block 105 according to split cells 101, generate the first descriptor vector.
Such as updating block 105 specifically adopts the weight of each candidate word of following formulae discovery:
W=W*log (N/DF), wherein W is the weight of candidate word, and DF is the document frequency of candidate word.The W on the equal sign left side is the weight of the candidate word after upgrading, and the W on the right of equal sign is the weight of the candidate word that step 202 calculates, the weight of the candidate word before namely upgrading.
Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, computing module 12 specifically can be connected with generation unit 103, the similarity of concrete the first descriptor vector sum second theme term vector adopting following formulae discovery generation unit 103 to generate:
Wherein D represents the first descriptor vector, D irepresent i-th descriptor in the first descriptor vector; Q represents described second theme term vector, Q irepresent i-th descriptor in second theme term vector; M represents the number of each included descriptor of the first descriptor vector sum second theme term vector; Sim (D, Q) represents the similarity of the first descriptor vector sum second theme term vector.
Particularly, second determination module 11 also comprises above-mentioned split cells 101, computing unit 102 and generation unit 103 in picture the first determination module 10, and statistic unit 104 and updating block 105, realize the determination of the first descriptor vector, with reference to the record of above-described embodiment, can not repeat them here in detail.
Further alternatively, in the pick-up unit of the new chapters and sections of article of the present embodiment, judge module 13 is specifically for judging the magnitude relationship of the similarity that computing module 12 calculates and default similarity threshold, when similarity is more than or equal to default similarity threshold, determine that new chapters and sections are effective chapters and sections of article; When similarity is less than default similarity threshold, determine that new chapters and sections are the false chapters and sections of article.
Further alternatively, filtering module 14 is also comprised in the pick-up unit of new chapters and sections of the article of the present embodiment.This filtering module 14 is connected with judge module 13, and filtering module 14, for after determining that at judge module 13 new chapters and sections are the false chapters and sections of article, filters the new chapters and sections of article.
All alternatives in the pick-up unit of the new chapters and sections of article of the present embodiment, combinable mode combination in any can be adopted to form optional embodiment of the present invention, and this is no longer going to repeat them.
The pick-up unit of the new chapters and sections of article of the present embodiment, identical with the realization mechanism of above-mentioned related method embodiment by adopting above-mentioned module to realize the detection of the new chapters and sections of article, with reference to the record of above-mentioned related embodiment, can not repeat them here in detail.
The pick-up unit of the new chapters and sections of article of the present embodiment, manual intervention is not needed by the testing process adopting above-mentioned module to realize the new chapters and sections of whole article, cost is extremely low, can avoid adopting the mode of manual examination and verification to audit the new chapters and sections of article, effectively can save human cost.And adopt the technical scheme of the embodiment of the present invention, by intelligently effectively analysing in depth the chapters and sections of detection of article and new chapters and sections, whether chapters and sections of can determining exactly to make new advances are false chapters and sections.The technical scheme ONLINE RECOGNITION process of the embodiment of the present invention only needs Millisecond, does not affect chapters and sections pushing speed at all, thus when new chapters and sections are effective chapters and sections, can push new chapters and sections timely, effectively ensure that the pushing efficiency of the new chapters and sections of article.
The embodiment of the present invention can also provide a kind of polymerizable clc platform, this article aggregation platform is provided with the pick-up unit of as above Fig. 3 or the new chapters and sections of article embodiment illustrated in fig. 4, the pick-up unit of the new chapters and sections of this article specifically can adopt the detection method of above-mentioned Fig. 1 or the new chapters and sections of article embodiment illustrated in fig. 2 to realize the detection of the new chapters and sections of article, the record of above-mentioned related embodiment can be adopted in detail, do not repeat them here.
It should be noted that: the pick-up unit of the new chapters and sections of the article that above-described embodiment provides is when the detection of the new chapters and sections of article, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the pick-up unit of the new chapters and sections of the article that above-described embodiment provides and the detection method embodiment of the new chapters and sections of article belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (20)

1. a detection method for the new chapters and sections of article, is characterized in that, described method comprises:
Determine the first descriptor vector detecting chapters and sections of article; Described first descriptor vector is for identifying the content detecting chapters and sections of described article;
Determine the second theme term vector of the new chapters and sections of described article; Described second theme term vector is for identifying the content of the new chapters and sections of described article;
Calculate the similarity of second theme term vector described in described first descriptor vector sum;
According to the magnitude relationship of described similarity and default similarity threshold, judge that whether described new chapters and sections are the false chapters and sections of described article.
2. method according to claim 1, is characterized in that, described the first descriptor vector detecting chapters and sections determining article, comprising:
Word fractionation is carried out to the chapters and sections of detection of described article, obtains multiple candidate word;
Calculate the weight of each described candidate word in described multiple candidate word;
According to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, generate described first descriptor vector.
3. method according to claim 2, is characterized in that, calculates the weight of each described candidate word in described multiple candidate word, comprising:
Calculate the length of each described candidate word, the frequency occurred in described article, the entropy of left adjacent character set and the entropy of right adjacent character set;
According to the entropy of the length of each described candidate word, the frequency occurred in described article, described left adjacent character set and the entropy of described right adjacent character set, calculate the weight of each described candidate word.
4. method according to claim 3, it is characterized in that, according to the entropy of the length of each described candidate word, the frequency occurred in described article, described left adjacent character set and the entropy of described right adjacent character set, calculate the weight of each described candidate word, the concrete weight adopting each described candidate word of following formulae discovery:
wherein, described W is the weight of described candidate word, and described TF is the frequency that described candidate word occurs in described article, and described Ha is the entropy of described left adjacent character set, and described Hb is the entropy of described right adjacent character set, and described L is the length of described candidate word.
5. method according to claim 2, is characterized in that, according to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, generates described first descriptor vector, comprising:
From described multiple candidate word, according to weight from high to low take out M candidate word in turn, generate described first descriptor vector.
6. according to the arbitrary described method of claim 2-5, it is characterized in that, after calculating the weight of each described candidate word in described multiple candidate word, according to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, before generating described first descriptor vector, described method also comprises:
Add up the document frequency of each described candidate word in described multiple candidate word; The article record that described document frequency occurs in the N section article included by article pond for described candidate word;
According to the N section article included by the document frequency of described candidate word each in described multiple candidate word and described article pond, upgrade the weight of each described candidate word in described multiple candidate word.
7. method according to claim 6, it is characterized in that, according to the N section article included by the document frequency of described candidate word each in described multiple candidate word and described article pond, upgrade the weight of each described candidate word in described multiple candidate word, the concrete weight adopting each described candidate word of following formulae discovery:
W=W*log (N/DF), wherein said W are the weight of described candidate word, and described DF is the document frequency of described candidate word.
8. method according to claim 1, is characterized in that, calculates the similarity of second theme term vector described in described first descriptor vector sum, comprises the similarity adopting second theme term vector described in the first descriptor vector sum described in following formulae discovery:
Wherein said D represents described first descriptor vector, described D irepresent i-th descriptor in the first descriptor vector; Described Q represents described second theme term vector, described Q irepresent i-th descriptor in second theme term vector; Described m represents the number of each included descriptor of second theme term vector described in described first descriptor vector sum; Described sim (D, Q) represents the similarity of second theme term vector described in described first descriptor vector sum.
9. the method according to claim 7 or 8, is characterized in that, according to the magnitude relationship of described similarity and default similarity threshold, judges that whether described new chapters and sections are the false chapters and sections of described article, comprising:
When described similarity is more than or equal to described default similarity threshold, determine that described new chapters and sections are effective chapters and sections of described article;
When described similarity is less than described default similarity threshold, determine that described new chapters and sections are the false chapters and sections of described article.
10. method according to claim 9, is characterized in that, after determining that described new chapters and sections are the false chapters and sections of described article, described method also comprises:
Filter the described new chapters and sections of described article.
The pick-up unit of 11. 1 kinds of new chapters and sections of article, is characterized in that, described device comprises:
First determination module, for determining the first descriptor vector detecting chapters and sections of article; Described first descriptor vector is for identifying the content detecting chapters and sections of described article;
Second determination module, for determining the second theme term vector of the new chapters and sections of described article; Described second theme term vector is for identifying the content of the new chapters and sections of described article;
Computing module, for calculating the similarity of second theme term vector described in described first descriptor vector sum;
Judge module, for the magnitude relationship according to described similarity and default similarity threshold, judges that whether described new chapters and sections are the false chapters and sections of described article.
12. devices according to claim 11, is characterized in that, described first determination module comprises:
Split cells, for carrying out word fractionation to the chapters and sections of detection of described article, obtains multiple candidate word;
Computing unit, for calculating the weight of each described candidate word in described multiple candidate word;
Generation unit, for the weight according to each described candidate word in described multiple candidate word and described multiple candidate word, generates described first descriptor vector.
13. devices according to claim 12, is characterized in that, described computing unit, specifically for calculating the length of each described candidate word, the frequency occurred in described article, the entropy of left adjacent character set and the entropy of right adjacent character set; And according to the entropy of the length of each described candidate word, the frequency occurred in described article, described left adjacent character set and the entropy of described right adjacent character set, calculate the weight of each described candidate word.
14. devices according to claim 13, is characterized in that, described computing unit, the concrete weight adopting each described candidate word of following formulae discovery:
wherein, described W is the weight of described candidate word, and described TF is the frequency that described candidate word occurs in described article, and described Ha is the entropy of described left adjacent character set, and described Hb is the entropy of described right adjacent character set, and described L is the length of described candidate word.
15. devices according to claim 12, is characterized in that, described generation unit, specifically for from described multiple candidate word, according to weight from high to low take out M candidate word in turn, generate described first descriptor vector.
16. according to the arbitrary described device of claim 12-15, and it is characterized in that, described first determination module also comprises:
Statistic unit, after calculating the weight of each described candidate word in described multiple candidate word at described computing unit, described generation unit is according to the weight of each described candidate word in described multiple candidate word and described multiple candidate word, before generating described first descriptor vector, add up the document frequency of each described candidate word in described multiple candidate word; The article record that described document frequency occurs in the N section article included by article pond for described candidate word;
Updating block, for the N section article included by the document frequency of described candidate word each in described multiple candidate word and described article pond, upgrades the weight of each described candidate word in described multiple candidate word.
17. devices according to claim 16, is characterized in that, described updating block, the concrete weight adopting each described candidate word of following formulae discovery:
W=W*log (N/DF), wherein said W are the weight of described candidate word, and described DF is the document frequency of described candidate word.
18. devices according to claim 11, is characterized in that, described computing module, the concrete similarity adopting second theme term vector described in the first descriptor vector sum described in following formulae discovery:
Wherein said D represents described first descriptor vector, described D irepresent i-th descriptor in the first descriptor vector; Described Q represents described second theme term vector, described Q irepresent i-th descriptor in second theme term vector; Described m represents the number of each included descriptor of second theme term vector described in described first descriptor vector sum; Described sim (D, Q) represents the similarity of second theme term vector described in described first descriptor vector sum.
19. devices according to claim 17 or 18, is characterized in that, described judge module, specifically for being more than or equal to described default similarity threshold when described similarity, determine that described new chapters and sections are effective chapters and sections of described article; When described similarity is less than described default similarity threshold, determine that described new chapters and sections are the false chapters and sections of described article.
20. devices according to claim 19, is characterized in that, described device also comprises:
Filtering module, after determining that at described judge module described new chapters and sections are the false chapters and sections of described article, filters the described new chapters and sections of described article.
CN201310223253.0A 2013-06-06 2013-06-06 New article chapter detecting method and device Pending CN104239285A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910649833.3A CN110347931A (en) 2013-06-06 2013-06-06 The detection method and device of the new chapters and sections of article
CN201310223253.0A CN104239285A (en) 2013-06-06 2013-06-06 New article chapter detecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310223253.0A CN104239285A (en) 2013-06-06 2013-06-06 New article chapter detecting method and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201910649833.3A Division CN110347931A (en) 2013-06-06 2013-06-06 The detection method and device of the new chapters and sections of article

Publications (1)

Publication Number Publication Date
CN104239285A true CN104239285A (en) 2014-12-24

Family

ID=52227382

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201310223253.0A Pending CN104239285A (en) 2013-06-06 2013-06-06 New article chapter detecting method and device
CN201910649833.3A Pending CN110347931A (en) 2013-06-06 2013-06-06 The detection method and device of the new chapters and sections of article

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910649833.3A Pending CN110347931A (en) 2013-06-06 2013-06-06 The detection method and device of the new chapters and sections of article

Country Status (1)

Country Link
CN (2) CN104239285A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677641A (en) * 2016-01-13 2016-06-15 夏峰 Paper self-inspection method and system
CN105701085A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Network duplicate checking method and system
CN105701076A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Thesis plagiarism detection method and system
CN106294292A (en) * 2016-07-20 2017-01-04 腾讯科技(深圳)有限公司 Chapters and sections catalogue screening technique and device
WO2017080183A1 (en) * 2015-11-12 2017-05-18 北京奇虎科技有限公司 Network novel chapter list evaluation method and device
CN107085568A (en) * 2017-03-29 2017-08-22 腾讯科技(深圳)有限公司 A kind of text similarity method of discrimination and device
WO2021159760A1 (en) * 2020-09-09 2021-08-19 平安科技(深圳)有限公司 Article truncation point setting method and apparatus, and computer device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122909A (en) * 2006-08-10 2008-02-13 株式会社日立制作所 Text message indexing unit and text message indexing method
US20110055332A1 (en) * 2009-08-28 2011-03-03 Stein Christopher A Comparing similarity between documents for filtering unwanted documents
CN102081598A (en) * 2011-01-27 2011-06-01 北京邮电大学 Method for detecting duplicated texts
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN103077157A (en) * 2013-01-22 2013-05-01 清华大学 Method and device for visualizing text set similarity

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7711312B2 (en) * 2005-02-03 2010-05-04 Educational Testing Service Method and system for detecting off-topic essays without topic-specific training
JP5379138B2 (en) * 2007-08-23 2013-12-25 グーグル・インコーポレーテッド Creating an area dictionary
CN103020022B (en) * 2012-11-20 2016-01-27 北京航空航天大学 A kind of Chinese unknown word identification system and method based on improving Information Entropy Features

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122909A (en) * 2006-08-10 2008-02-13 株式会社日立制作所 Text message indexing unit and text message indexing method
US20110055332A1 (en) * 2009-08-28 2011-03-03 Stein Christopher A Comparing similarity between documents for filtering unwanted documents
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN102081598A (en) * 2011-01-27 2011-06-01 北京邮电大学 Method for detecting duplicated texts
CN103077157A (en) * 2013-01-22 2013-05-01 清华大学 Method and device for visualizing text set similarity

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
丁溪源: "基于大规模语料的中文新词抽取算法的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李文翔 等: "《基于内容主题的语料库系统设计与实现》", 《计算机应用研究》 *
聂金慧 等: "中文新词提取与过滤研究综述", 《中国科技论文在线》 *
阮一峰: "TF-IDF与余弦相似性的应用(二):找出相似文章", 《HTTP://WWW.RUANYIFENG.COM/BLOG/2013/03/COSINE_SIMILARITY.HTML》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017080183A1 (en) * 2015-11-12 2017-05-18 北京奇虎科技有限公司 Network novel chapter list evaluation method and device
CN105677641A (en) * 2016-01-13 2016-06-15 夏峰 Paper self-inspection method and system
CN105701085A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Network duplicate checking method and system
CN105701076A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Thesis plagiarism detection method and system
CN105677641B (en) * 2016-01-13 2018-03-16 夏峰 A kind of paper self checking method and system
CN105701076B (en) * 2016-01-13 2018-05-22 湖南通远网络科技有限公司 A kind of paper plagiarizes detection method and system
CN105701085B (en) * 2016-01-13 2018-05-22 湖南通远网络科技有限公司 A kind of network duplicate checking method and system
CN106294292A (en) * 2016-07-20 2017-01-04 腾讯科技(深圳)有限公司 Chapters and sections catalogue screening technique and device
CN106294292B (en) * 2016-07-20 2020-12-25 腾讯科技(深圳)有限公司 Chapter catalog screening method and device
CN107085568A (en) * 2017-03-29 2017-08-22 腾讯科技(深圳)有限公司 A kind of text similarity method of discrimination and device
CN107085568B (en) * 2017-03-29 2022-11-22 腾讯科技(深圳)有限公司 Text similarity distinguishing method and device
WO2021159760A1 (en) * 2020-09-09 2021-08-19 平安科技(深圳)有限公司 Article truncation point setting method and apparatus, and computer device

Also Published As

Publication number Publication date
CN110347931A (en) 2019-10-18

Similar Documents

Publication Publication Date Title
CN104239285A (en) New article chapter detecting method and device
CN107437038B (en) Webpage tampering detection method and device
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
US20140172415A1 (en) Apparatus, system, and method of providing sentiment analysis result based on text
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN101630312A (en) Clustering method for question sentences in question-and-answer platform and system thereof
CN103390051A (en) Topic detection and tracking method based on microblog data
CN103530365A (en) Method and system for acquiring downloading link of resources
CN103577558A (en) Device and method for optimizing search ranking of frequently asked question and answer pairs
CN103164698A (en) Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested
CN102144229A (en) System for extracting term from document containing text segment
CN103577556A (en) Device and method for obtaining association degree of question and answer pair
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN105224604A (en) A kind of microblogging incident detection method based on heap optimization and pick-up unit thereof
CN104102658A (en) Method and device for mining text contents
CN104572877A (en) Detection method and detection system of game public opinion
CN103365879A (en) Method and device for obtaining page similarity
CN106598997B (en) Method and device for calculating text theme attribution degree
CN103577557A (en) Device and method for determining capturing frequency of network resource point
CN103324641B (en) Information record recommendation method and device
CN106168968A (en) A kind of Website classification method and device
CN101576872B (en) Chinese text processing method and device thereof
CN104572720A (en) Webpage information duplicate eliminating method and device and computer-readable storage medium
CN109815337A (en) Determine the method and device of article category
CN111324725B (en) Topic acquisition method, terminal and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20141224