CN106909669B

CN106909669B - Method and device for detecting promotion information

Info

Publication number: CN106909669B
Application number: CN201710113764.5A
Authority: CN
Inventors: 张德斌
Original assignee: Beijing Time Ltd By Share Ltd
Current assignee: Beijing time Ltd.
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2020-02-11
Anticipated expiration: 2037-02-28
Also published as: CN106909669A

Abstract

The invention discloses a method and a device for detecting popularization information, which relate to the technical field of text filtering processing, and the method comprises the following steps: acquiring a preset sample set, and extracting information units contained in each sample in the sample set; counting the occurrence frequency of each information unit in the sample set, and determining the information units with the occurrence frequency larger than a preset first threshold value as candidate feature units; respectively counting the distribution condition of the candidate feature unit at each document position aiming at each candidate feature unit, and determining whether the candidate feature unit is a popularization feature unit according to the counting result; and detecting promotion information contained in the document according to the determined promotion feature unit. Therefore, the method and the device can effectively and accurately filter the advertising information or the spam promotion information, so that pure news content can be extracted by adopting a machine grabbing method, and the efficiency of assembling news from a media platform is greatly improved.

Description

Method and device for detecting promotion information

Technical Field

The invention relates to the technical field of text filtering processing, in particular to a method and a device for detecting popularization information.

Background

With the development of internet technology, the media age has come. Unlike traditional news media, news from the media platform has better timeliness and source universality, and the openness of the media platform itself enables each platform user to become both a reader of the news and a producer and publisher of the news. In the current situation, more and more breaking news are released in time from a media platform through WeChat, microblog and the like, and people are more and more accustomed to obtaining interesting news contents from the media platform. Meanwhile, news from the media platform is effectively spread through mutual forwarding among users.

However, in the process of implementing the present invention, the inventors found that at least the following problems exist in the prior art: in order to assemble news from the media platform for the convenience of reading by the user, a machine-crawling method can be adopted to collect news content from the media platform. However, since the news content from the media platform is often mixed with advertisement information or spam information, when the news content is captured by using the prior art, the advertisement information or spam information cannot be accurately filtered, so that the pure news content cannot be captured.

Disclosure of Invention

In view of the above, the present invention has been made to provide a method and apparatus for detecting popularization information that overcomes or at least partially solves the above problems.

According to an aspect of the present invention, there is provided a method for detecting promotion information, including: acquiring a preset sample set, and extracting information units contained in each sample in the sample set; counting the occurrence frequency of each information unit in the sample set, and determining the information units with the occurrence frequency larger than a preset first threshold value as candidate feature units; respectively counting the distribution condition of the candidate feature unit at each document position aiming at each candidate feature unit, and determining whether the candidate feature unit is a popularization feature unit according to the counting result; and detecting promotion information contained in the document according to the determined promotion feature unit.

According to another aspect of the present invention, there is provided a device for detecting promotional information, including: the information unit extraction module is used for acquiring a preset sample set and extracting information units contained in each sample in the sample set; the candidate unit determining module is used for counting the occurrence frequency of each information unit in the sample set and determining the information units with the occurrence frequency larger than a preset first threshold value as candidate characteristic units; the promotion unit determining module is used for respectively counting the distribution condition of the candidate characteristic unit at each document position aiming at each candidate characteristic unit and determining whether the candidate characteristic unit is a promotion characteristic unit or not according to the counting result; and the detection module is used for detecting the promotion information contained in the document according to the determined promotion feature unit.

Therefore, the invention provides a method and a device for detecting promotion information, which can effectively and accurately filter advertisement information or junk promotion information in the process of extracting news from a media platform by adopting a machine grabbing method by extracting information units in a preset sample set, determining candidate characteristic units in the information units according to the occurrence frequency of the information units in the sample set, determining the promotion characteristic units in the candidate characteristic units according to the distribution situation of the positions of the candidate characteristic units in each document and finally detecting the promotion information contained in a target document according to the screened promotion characteristic units, thereby realizing the effect of effectively and accurately filtering the advertisement information or the junk promotion information in the process of extracting the news from the media platform by adopting the machine grabbing method, extracting pure news content by adopting the machine grabbing method and greatly improving the efficiency of assembling the news from the media platform.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of a method for detecting popularization information according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for detecting popularization information according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for detecting promotional information according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a device for detecting promotional information according to a fourth embodiment of the present invention;

FIG. 5 is a histogram of the distribution of candidate feature units in a document associated with time according to an embodiment of the present invention;

FIG. 6 is a histogram of the distribution of candidate feature units in a document associated with advertisement information or spam information according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example one

Fig. 1 shows a method for detecting promotion information provided by the present invention, where the method includes:

step S110: and acquiring a preset sample set, and extracting information units contained in each sample in the sample set.

In order to facilitate a computer to identify sample news contents, it is first required to segment preset sample news contents containing advertisement information or spam information according to a certain rule, and extract information units contained in each sample from the sample news contents. The preset sample set refers to self-media news content which contains advertisement information or spam information and has certain representativeness, and the sample set is generally selected and set by a person skilled in the art according to experience. The information units are basic units forming the sample news content, and the form of the information units can be feature phrases generated after the sample news content is segmented, or words with certain features. The present invention is not limited to the specific setting rule of the preset sample set and the specific form of the information unit, and those skilled in the art can flexibly set the setting rule according to the actual situation.

Step S120: and counting the occurrence frequency of each information unit in the sample set, and determining the information units with the occurrence frequency larger than a preset first threshold value as candidate feature units.

Because the advertisement information and the spam information are information that is intentionally repeated by each news publisher from the media platform, different news contents from the same news publisher generally contain the same advertisement information or spam information. Counting the number of occurrences of the information unit extracted in step S110 in the sample set, and when the number of occurrences of a certain information unit exceeds a preset first threshold, indicating that the information unit is extremely suspected to belong to advertisement information or spam promotion information, thus determining the information unit as a candidate feature unit.

The preset first threshold value is determined according to the total situation of the number of times of repetition of the advertisement information or the spam information in the sample news content from the same news publisher, and when a certain information unit is higher than the number of times of repetition, the information unit is determined as a candidate characteristic unit with suspected advertisement information or spam information. The specific determination rule of the first threshold is not limited in particular, and those skilled in the art can flexibly determine the first threshold according to experimental data and experience.

Step S130: and respectively counting the distribution condition of the candidate feature unit at each document position aiming at each candidate feature unit, and determining whether the candidate feature unit is a popularization feature unit according to the counting result.

After the preliminary screening in step S120, most of the information units containing advertisement information or spam information are determined as candidate feature units, but some information units containing normal news content whose repetition number exceeds the first threshold are also determined as candidate feature units.

Through a large number of experiments and repeated comparison, the inventor of the invention finds that the candidate feature units containing normal news content are the contents which are not repeated by the news publisher intentionally, so that the position distribution condition in the sample is generally uniform; and the candidate feature units containing the advertisement information or the spam information belong to the content which is intentionally repeated by the news publisher, so the position distribution condition in the sample is concentrated. According to the finding, the invention further screens the candidate characteristic units by adopting the position distribution condition of the candidate characteristic units in the sample, and determines the candidate characteristic units with more concentrated position distribution as popularization characteristic units.

Step S140: and detecting promotion information contained in the document according to the determined promotion feature unit.

Through the processing of the steps, the popularization characteristic units extracted from the preset sample set can be obtained, and then the to-be-detected documents obtained by the machine grabbing method are identified through the popularization characteristic units, so that the corresponding popularization information contained in the to-be-monitored documents is effectively screened out, and finally the screened popularization information is removed from the to-be-detected documents, and relatively pure news content can be obtained.

Therefore, according to the method for detecting the popularization information, the information units in the preset sample set are extracted, the candidate characteristic units in the information units are determined according to the occurrence frequency of the information units in the sample set, the popularization characteristic units in the candidate characteristic units are determined according to the distribution situation of the positions of the candidate characteristic units in each document, and the popularization information contained in the target document is detected according to the selected popularization characteristic units, so that the effect of effectively and accurately filtering the advertisement information or the junk popularization information in the process of extracting the news from the media platform by adopting a machine grabbing method is achieved, the pure news content can be extracted by adopting the machine grabbing method, and the efficiency of news assembly from the media platform is greatly improved.

Example two

Fig. 2 shows a method for detecting promotion information provided by the present invention, where the method includes:

step S210: and acquiring a preset sample set, and extracting information units contained in each sample in the sample set.

In order to facilitate a computer to identify sample news contents, it is first required to segment preset sample news contents containing advertisement information or spam information according to a certain rule, and extract information units contained in each sample from the sample news contents. Because the same news is repeated for many times, the duplication elimination processing is performed before the preset sample set is obtained, so that the calculation amount of the obtained sample set can be effectively reduced, and the obtaining efficiency is improved.

The detailed deduplication processing comprises the steps of calculating the similarity among the titles of all candidate samples, and performing deduplication on the candidate samples of which the similarity among the titles is larger than a preset similarity threshold; and inquiring a keyword set corresponding to each candidate sample aiming at the candidate samples with the similarity between the titles not larger than a preset similarity threshold, and if the number of the same keywords contained in the keyword sets corresponding to the two candidate samples is larger than a preset number threshold, performing duplicate elimination aiming at the two candidate samples. Preferably, the similarity between the titles of the candidate samples is calculated through a maximum common subsequence algorithm, the keyword set corresponding to each candidate sample is determined according to the Inverse Document Frequency (IDF) of each vocabulary obtained after the candidate sample is subjected to word segmentation, and the quantity threshold is determined according to a jaccard similarity algorithm.

To facilitate understanding of the above, the deduplication processing process is described in detail below by way of a specific example: 1. performing Chinese word segmentation and word stop operation on the titles and the text contents of all sample articles; 2. counting the word frequency (TF) of each word in each sample article in a distributed calculation mode, calculating the corresponding Inverse Document Frequency (IDF), and then calculating the TF-IDF fraction of each word; 3. extracting the first 20 words in the title word segmentation result (the number of the keywords is only a specific value in the specific example, in other embodiments, a person skilled in the art can set the number of the keywords according to actual conditions) to form a keyword set, and when the title word segmentation result is less than 20 words, the remaining keywords are sequentially supplemented by high words in the text in which TF IDF scores are arranged from high to low; 4. establishing sub-buckets (Bucket tables, which are a data range division mode with finer granularity) by using keyword sets of all articles, wherein the sub-buckets can add an additional structure to a Table, so that the structure can be utilized when query operation is processed, and higher query processing efficiency is obtained); 5. when the similarity of each article is calculated, firstly, finding 20 sub-buckets corresponding to the article (each sub-bucket corresponds to one keyword, and each article has 20 keywords, so each article corresponds to 20 sub-buckets), then, performing similarity calculation on the article and titles of all articles in the buckets by using a maximum common subsequence algorithm, and when the title similarity exceeds 0.75(0.75 is a preset similarity threshold value in the specific example, which is set by a person skilled in the art according to the actual situation), judging that the two articles are sample articles with the same content, and performing deduplication operation; 6. when the title similarity is not greater than 0.75, the similarity of 20 keywords in each article is compared pairwise, and when the number of similar keywords exceeds 16 (16 is a preset number threshold determined by the jaccard similarity algorithm in this specific example, namely 20 words each, the number of the same words is x, and the jaccard similarity is x/(20-x +20-x + x) — 0.66, so 16), it can be determined that the two articles are sample characters with the same content, and the deduplication operation is performed.

In the above specific example, when comparing the similarity of two articles, the similarity of titles is compared first, and then the similarity of keywords is compared, because on one hand, the calculation amount of titles is small, the calculation speed is high, and meanwhile, in general, most of the articles with similar contents are similar; on the other hand, if only the keywords are used for similarity comparison or only the titles are used for similarity comparison, a comparison bottleneck exists, and the comparison result is not accurate enough.

In the process of implementing the invention, the inventor finds that the data calculation amount can be effectively reduced by establishing the sub-buckets through the keyword set. When the sub-buckets are established according to the keywords, the algorithm complexity is O (n ^2), wherein n is the total number of sample articles, when the sub-buckets are established according to the keywords, the algorithm complexity is O (k ^ m ^2), wherein k is the total number of sample keywords, m is the average number of articles under each sub-bucket of the keywords, k < < n and m < < n, and when n is one hundred million (namely the number of the sample articles is one hundred million), the corresponding k is only tens of thousands, so the algorithm complexity after the sub-buckets are established is lower. Meanwhile, because the main key of each sub-bucket is a unique keyword in the keyword set, articles in the same sub-bucket may have similarity, and articles which are not in the same sub-bucket do not necessarily have any similar keyword, and can be directly excluded, so that the calculation amount is further reduced. In addition, as long as the keyword similarity of every two articles exceeds a preset quantity threshold, the articles can be judged to be similar articles, calculation is stopped, and then duplication elimination processing is carried out, the similar articles can be found faster and earlier by adopting a bucket dividing establishing mode, and calculation is stopped in advance, namely, an algorithm adopting the bucket dividing establishing mode is easier to trend towards the optimal complexity rather than the maximum complexity O (k m 2).

In the process of implementing the present invention, the inventor also finds that, when calculating the jaccard similarity of the keyword, a data structure can be adopted to change time in space and optimize the calculation speed: first, an index with the size of 65536 bits is constructed (because 65536 bits can represent all chinese kanji characters according to the chinese coding rule), the first word of each word in the keyword set of each article is used as the serial number of the index position, other words are used as the attribute value of the index position, each parent index can have a plurality of child indexes, each child index has an attribute value to indicate which article the parent index belongs to (here, the attribute value of a child index is represented by a binary number, that is, there are M articles whose binary number has M bits, the parent index belongs to which article, and the corresponding binary bit of that article is 1). When the similarity of keywords of every two articles needs to be calculated, every two articles do not need to be calculated, and only repeated words are searched in the keyword data structure of the same sub-bucket. Using an array with the same M bits (each bit of the array corresponds to an article, that is, M articles correspond to an M-bit array), where the initial value of each bit in the array is 0, comparing all sub-indexes under the same parent index, if there is a similar sub-index, taking out the binary number in the sub-index representing the article belonging to the sub-index, then adding 1 to the array position of the corresponding article, and determining whether each bit value in the array is greater than 16 (that is, the above-mentioned preset number threshold), when a certain bit value in the array is greater than 16, that is, it is said that the article corresponding to the bit value is a similar article, the calculation may be stopped and the deduplication processing may be performed. Two groups of 20 words are needed to be compared one by one in the past, so that M articles in a sub-bucket need to be calculated M x 400 times at worst; after improvement, only 20 times of quick query are needed, whether the sub indexes under 20 father indexes are similar or not is compared, at the moment, only 20 × M times of query is needed at worst, and in practical situations, the number of the sub indexes under each father index is far less than M, so the calculated amount is reduced in a multiple manner, and the algorithm can find the repeated articles earlier and faster.

After the deduplication processing is completed, information units included in each sample in the sample set are extracted. Specifically, in this embodiment, the article content may be segmented by punctuation marks and line feed blanks, so as to obtain information units in the sample. For example, "press two-dimensional code 'identify' focus, more surprises, etc" can be split into two information units, which are "press two-dimensional code 'identify' focus" and "more surprises, etc", respectively. In other embodiments, other rules may also be used to segment the article content to extract the information unit, which is not specifically limited in the present invention and can be flexibly set by those skilled in the art.

Step S220: and counting the occurrence frequency of each information unit in the sample set, and determining the information units with the occurrence frequency larger than a preset first threshold value as candidate feature units.

In the process of implementing the present invention, the inventor finds that, as can be seen from the analysis of the historical data, the advertisement information or spam information contained in the articles published by each news publisher in a period of time is basically the same, and then the information units associated with the advertisement information or spam information are also necessarily repeated with high frequency. The critical value of the number of repetitions of distinguishing the information unit associated with the advertisement information or the spam information from the general information unit, which is the above-mentioned preset first threshold value, can be obtained through a large number of statistical analyses. And screening all information units through the preset first threshold value, and determining the information units with the occurrence frequency larger than the preset first threshold value as candidate characteristic units.

However, because the preset first threshold is an experience threshold, all normal repeated contents cannot be filtered, the position distribution of the normally repeated news phrases and the position distribution characteristics of the advertisement phrases are considered later, the L0 norm constraint is adopted, the normal contents are filtered more accurately, the accurate weights of the news advertisement phrases, the position distribution repetition times and the like are obtained, and finally, the news popularization information identification model is constructed by using the data.

Step S230: and respectively counting the distribution condition of the candidate feature unit at each document position aiming at each candidate feature unit, and determining whether the candidate feature unit is a popularization feature unit according to the counting result.

Most of the information units of the normal content may be roughly filtered out through the filtering in step S220, but among the remaining information units (i.e., candidate feature units), there may be information units associated with time included in the normal news, in addition to the information units associated with the advertisement information or spam information. The inventor finds out through statistical analysis that, in the candidate feature units, the candidate feature units associated with time are relatively uniform in position distribution in the document because the candidate feature units are not content which is intentionally repeated by people (as shown in fig. 5); and the candidate feature units associated with the advertisement information or the spam information are artificially and deliberately repeated contents, so that the position distribution in the document is more concentrated (as shown in fig. 6). Therefore, the popularization feature unit can be further effectively screened out by counting the distribution situation of the candidate feature unit at each document position.

In particular, further screening may be performed by the L0 norm constraint of the distribution. Firstly, dividing document contents into a plurality of document positions according to a preset position division rule, wherein the preset position division rule comprises a division rule based on paragraph granularity and a division rule based on sentence granularity; then, setting a vector for representing the distribution condition of the candidate feature unit at each document position, wherein each element in the vector corresponds to each document position respectively; if the distribution quantity of the candidate feature unit at the specified document position is greater than a preset distribution threshold, the element value of the element corresponding to the specified document position is nonzero, and if the distribution quantity of the candidate feature unit at the specified document position is not greater than the preset distribution threshold, the element value of the element corresponding to the specified document position is zero, wherein the distribution quantity of the candidate feature unit at the specified document position comprises the occurrence frequency and/or the occurrence probability of the candidate feature unit at the specified document position; and finally, when the number of the non-zero elements in the vector is larger than a preset element threshold value, determining the candidate feature unit as a popularization feature unit.

In the process of implementing the invention, the inventor considers four position division rules, namely paragraph granularity distribution percentage, sentence granularity distribution percentage, paragraph granularity positive and negative ordering and sentence granularity positive and negative ordering. Through a large number of experiments, the inventor finds that the article is similar to a public article, and the popularization information is mainly concentrated at the head or the tail of the article; the total amount of different content articles, paragraphs or sentences is polymorphic, and also appears in the first section or the last sections, and the percentage can be greatly different; the same tail promotion information, if the contents are similar, the paragraph numbers are almost consistent; the tail promotion information often prefers to use very short paragraphs, in which case the effect of using the paragraph granularity positive and negative ordering rule is the best. In a specific application, because the information publisher often places the promotion information at a prominent position at the beginning of the article (i.e. the first few sentences of the first paragraph) or concentratedly typeset at the end of the article. Therefore, if the positions of the promotion information of the two articles edited and typeset in the same way are at the head of the article (for example, the first section), the position distribution situation of the candidate feature units can adopt forward sequencing counting, namely the candidate feature units are concentrated on the first section and can be marked as + 1; similarly, when the promotion information is typeset at the tail part (for example, the last section) of the article by the editing habit, because the number of paragraphs of each article is large, the statistical difference of the distribution situation is large due to the forward sequencing counting, if 20 paragraphs exist in one article, the last section is marked as +20, and if 30 paragraphs exist in one article, the last section is marked as +30, at this time, the reverse sequencing counting needs to be adopted, and no matter how many paragraphs of the article exist, the last section is marked as-1, so that the distribution statistical result has no large deviation, and therefore, the position distribution situation can be more accurately reflected by adopting a positive and negative sequencing mode. In addition, the inventor researches and discovers that the positive and negative ordering of the paragraph granularity is the most accurate (because the typesetting is mainly carried out on the paragraphs, the paragraphs can better reflect the edition and typesetting willingness, and the sentences are only the line writing habit or the writing level of the author writing the article). Therefore, in the embodiment, the accuracy can be further improved by adopting the paragraph granularity positive and negative ordering rule.

The preset distribution threshold and the preset element threshold are determined through a large number of tests, specifically, different distribution thresholds and different element thresholds are required to be respectively obtained, when different values are compared, the separation effect of the candidate feature unit corresponding to the normal content and the candidate feature unit corresponding to the advertisement information is obtained, and finally, the value with the best separation effect is determined as the preset distribution threshold and the preset element threshold. In the implementation process of the present invention, through a lot of experiments, the inventor finds that, when the preset first threshold in step S220 is 20, the preset distribution threshold is 10, and the preset element threshold is 3, the separation effect between the candidate feature unit corresponding to the normal content and the candidate feature unit corresponding to the advertisement information is the best. At this time, when the number of occurrences of a candidate feature unit at a position in the article exceeds 10, the value of the vector element corresponding to the position is not 0, and conversely, the position is 0. This results in the L0 norm values n for the mapping (x <10, y 0; x >10, y x) for different candidate feature units, where n is the number of vectors y0, y1 … yi that is not 0. And when n > is 3 (namely the element threshold is 3), judging the candidate feature unit as the popularization feature unit.

Step S240: and detecting promotion information contained in the document according to the determined promotion feature unit.

Specifically, according to the determined popularization feature units and the distribution conditions of the popularization feature units at all document positions, corresponding document detection models are set, and popularization information contained in the documents is detected according to the document detection models.

The step of setting the corresponding document detection model according to the determined popularization feature unit and the distribution condition of the popularization feature unit at each document position specifically includes setting model parameters contained in the document detection model and weight values corresponding to each model parameter according to the determined popularization feature unit, the occurrence probability of the popularization feature unit at each document position and a preset position weight. The above formula of the probability of occurrence is p ═ k/n, where n is the total number of occurrences of the promotional unit in the document, and k is the number of occurrences of the promotional unit at that location. Because the advertisement information or the spam promotion information often appears at a specific position of an article, different position weights need to be given to different positions of the promotion feature unit appearing in the document, and it should be noted that the specific position weight needs to be determined through a large number of experiments, and the position weight of the specific position where the advertisement information or the spam promotion information often appears should be higher than the position weight of other positions in the document, so that the probability of mistakenly deleting normal content can be reduced.

The method comprises the steps of detecting popularization information contained in a document according to a document detection model, wherein the step of detecting the popularization information contained in the document specifically comprises the steps of searching information units matched with model parameters contained in the document detection model from all information units contained in the document to be detected; and determining the score of each searched information unit according to the document position of the information unit in the document to be detected and/or the weight value of the model parameter matched with the information unit, and determining whether the information unit is promotion information or not according to the score. The above-mentioned calculation formula of the score is that the probability of occurrence of the information unit at each document position is multiplied by a preset position weight, because the position weight corresponding to a specific position where advertisement information or spam promotion information often appears is higher, and therefore, the information unit with the higher score is most likely to be promotion information.

Step S250: and deleting the document according to the detected document position of the promotion information.

When the detected document position where the promotion information is located belongs to the head of the document, deleting the promotion information and the paragraph contents before the promotion information; when the detected document position of the promotion information belongs to the tail of the document, deleting the promotion information and the paragraph contents behind the promotion information; and deleting the sentence where the promotion information is located when the detected document position where the promotion information is located belongs to the middle part of the document. Through the deleting operation, the advertisement information or the spam promotion information contained in the news content captured by the machine can be effectively removed, so that the pure news content is obtained, and the assembly of the news from the media platform is facilitated.

Step S260: and updating the document detection model according to the promotion information contained in the detected document.

The document detection model comprises a deep learning model, particularly a convolutional neural network model in the deep learning model can be adopted, and in specific application, the convolutional neural network model can be fed back according to the actual detection result of the popularization information every time, so that the document detection model is continuously updated, the identification accuracy can be continuously improved, and the identification efficiency of the popularization information is improved.

Therefore, according to the method for detecting the popularization information, a certain amount of operation of the method is simplified by carrying out duplicate elimination on sample data, then the information units in the preset sample set are extracted, the candidate characteristic units in the information units are determined according to the occurrence frequency of the information units in the sample set, then the popularization characteristic units in the candidate characteristic units are determined by adopting an L0 norm constraint algorithm according to the distribution condition of the positions of the candidate characteristic units in each document, finally a document detection model is established according to the screened popularization characteristic units, and the popularization information contained in the detection target document is detected by using the document detection model, so that the popularization information in the target document is obtained. By utilizing the acquired popularization information, the target document captured by the machine can be deleted to obtain pure news content, so that the news assembly work from a media platform is facilitated. And when the document detection model adopts a deep learning model, the actual detection result of the popularization information each time can be fed back to the document detection model, so that the model can be continuously learned and updated to adapt to development and improve the accuracy of the popularization information.

EXAMPLE III

Fig. 3 shows a device for detecting promotional information provided by the present invention, the device comprising: an information unit extraction module 310, a candidate unit determination module 320, a promotional unit determination module 330, and a detection module 340.

The information unit extracting module 310 is configured to obtain a preset sample set, and extract information units included in each sample in the sample set.

In order to facilitate the detection apparatus to identify the sample news content, the information unit extraction module 310 first needs to segment the preset sample news content containing the advertisement information or the spam information according to a certain rule, and extract the information units contained in each sample. The preset sample set refers to self-media news content which contains advertisement information or spam information and has certain representativeness, and the sample set is generally selected and set by a person skilled in the art according to experience. The information units are basic units forming the sample news content, and the form of the information units can be feature phrases generated after the sample news content is segmented, or words with certain features. The present invention is not limited to the specific setting rule of the preset sample set and the specific form of the information unit, and those skilled in the art can flexibly set the setting rule according to the actual situation.

And a candidate unit determining module 320, configured to count the occurrence frequency of each information unit in the sample set, and determine an information unit whose occurrence frequency is greater than a preset first threshold as a candidate feature unit.

Because the advertisement information and the spam information are information that is intentionally repeated by each news publisher from the media platform, different news contents from the same news publisher generally contain the same advertisement information or spam information. The candidate unit determining module 320 performs statistics on the occurrence frequency of the information units extracted by the information unit extracting module 310 in the sample set, and when the occurrence frequency of a certain information unit exceeds a preset first threshold, it indicates that the information unit is greatly suspected to belong to advertisement information or spam information, so that the information unit is determined as a candidate feature unit.

And the popularization unit determining module 330 is configured to separately count, for each candidate feature unit, a distribution condition of the candidate feature unit at each document position, and determine whether the candidate feature unit is a popularization feature unit according to a statistical result.

After the preliminary screening by the candidate unit determining module 320, most of the information units containing the advertisement information or the spam information are determined as candidate feature units, but some of the information units containing normal news content with repetition times exceeding the first threshold are also determined as candidate feature units.

Through a large number of experiments and repeated comparison, the inventor of the invention finds that the candidate feature units containing normal news content are the contents which are not repeated by the news publisher intentionally, so that the position distribution condition in the sample is generally uniform; and the candidate feature units containing the advertisement information or the spam information belong to the content which is intentionally repeated by the news publisher, so the position distribution condition in the sample is concentrated. According to the finding, the popularization unit determining module 330 further screens the candidate feature units by using the position distribution of the candidate feature units in the sample, and determines the candidate feature units with more concentrated position distribution as popularization feature units.

And the detecting module 340 is configured to detect promotion information included in the document according to the determined promotion feature unit.

The popularization feature units extracted from the preset sample set can be obtained through the processing of the popularization unit determining module 330, then the detection module 340 identifies the to-be-detected documents obtained by the machine capture method through the popularization feature units, so that the corresponding popularization information contained in the to-be-monitored documents is effectively screened out, and finally the screened popularization information is removed from the to-be-detected documents, and relatively pure news content can be obtained.

The specific structure and operation principle of each module described above may refer to the description of the corresponding part in the method embodiment, and are not described herein again.

Therefore, the device for detecting the popularization information provided by the invention extracts the information units in the preset sample set, determines the candidate characteristic units in the information units according to the occurrence frequency of the information units in the sample set, determines the popularization characteristic units in the candidate characteristic units according to the distribution situation of the positions of the candidate characteristic units in each document, and finally detects the popularization information contained in the target document according to the selected popularization characteristic units, so that the effect of effectively and accurately filtering the advertisement information or the junk popularization information in the process of extracting the news from the media platform by adopting a machine grabbing method is realized, the pure news content can be extracted by adopting the machine grabbing method, and the efficiency of news assembly from the media platform is greatly improved.

Example four

Fig. 4 shows a device for detecting promotional information provided by the present invention, the device comprising: the information unit extraction module 410, the candidate unit determination module 420, the promotion unit determination module 430, the detection module 440, the update module 450, and the pruning module 460, wherein the promotion unit determination module 430 further includes a vector sub-module 431, a determination sub-module 432, and a document division sub-module 433.

The information unit extracting module 410 is configured to obtain a preset sample set, and extract information units included in each sample in the sample set.

In order to facilitate the detection device to identify the sample news content, firstly, the preset sample news content containing the advertisement information or the spam information needs to be segmented according to a certain rule, and information units contained in each sample are extracted from the sample news content. Since the same news is repeated for many times, the deduplication processing is performed before the preset sample set is obtained, so that the calculation amount of the obtained sample set can be effectively reduced, and the obtaining efficiency is improved, so that the information unit extraction module 410 needs to perform deduplication processing on a plurality of candidate samples, and obtain the sample set according to the deduplication processed candidate samples.

Specifically, the information unit extracting module 410 needs to calculate the similarity between the titles of the candidate samples, and perform deduplication on the candidate samples whose similarity between the titles is greater than a preset similarity threshold; and inquiring a keyword set corresponding to each candidate sample aiming at the candidate samples with the similarity between the titles not larger than a preset similarity threshold, and if the number of the same keywords contained in the keyword sets corresponding to the two candidate samples is larger than a preset number threshold, performing duplicate elimination aiming at the two candidate samples. Preferably, the similarity between the titles of the candidate samples is calculated through a maximum common subsequence algorithm, the keyword set corresponding to each candidate sample is determined according to the Inverse Document Frequency (IDF) of each vocabulary obtained after the candidate sample is subjected to word segmentation, and the quantity threshold is determined according to a jaccard similarity algorithm.

After the deduplication processing is completed, the information unit extraction module 410 extracts the information units included in each sample in the sample set. Specifically, in this embodiment, the article content may be segmented by punctuation marks and line feed blanks, so as to obtain information units in the sample. For example, "press two-dimensional code 'identify' focus, more surprises, etc" can be split into two information units, which are "press two-dimensional code 'identify' focus" and "more surprises, etc", respectively. In other embodiments, other rules may also be used to segment the article content to extract the information unit, which is not specifically limited in the present invention and can be flexibly set by those skilled in the art.

And the candidate unit determining module 420 is configured to count the occurrence frequency of each information unit in the sample set, and determine the information unit with the occurrence frequency greater than a preset first threshold as a candidate feature unit.

In the process of implementing the present invention, the inventor finds that, as can be seen from the analysis of the historical data, the advertisement information or spam information contained in the articles published by each news publisher in a period of time is basically the same, and then the information units associated with the advertisement information or spam information are also necessarily repeated with high frequency. The critical value of the number of repetitions of distinguishing the information unit associated with the advertisement information or the spam information from the general information unit, which is the above-mentioned preset first threshold value, can be obtained through a large number of statistical analyses. The candidate unit determining module 420 filters all information units according to the preset first threshold, and determines the information units with the occurrence frequency greater than the preset first threshold as candidate feature units.

And a popularization unit determining module 430, configured to separately count, for each candidate feature unit, a distribution condition of the candidate feature unit at each document position, and determine whether the candidate feature unit is a popularization feature unit according to a statistical result.

Most of the information units of the normal content may be roughly filtered out through the filtering of the candidate unit determination module 420, but among the remaining information units (i.e., candidate feature units), there may be information units associated with time included in the normal news in addition to the information units associated with the advertisement information or the spam information. The inventor finds out through statistical analysis that, in the candidate feature units, the candidate feature units associated with time are relatively uniform in position distribution in the document because the candidate feature units are not content which is intentionally repeated by people (as shown in fig. 5); and the candidate feature units associated with the advertisement information or the spam information are artificially and deliberately repeated contents, so that the position distribution in the document is more concentrated (as shown in fig. 6). Therefore, the popularization feature unit can be further effectively screened out by counting the distribution situation of the candidate feature unit at each document position.

Specifically, the promotion unit determination module 430 includes a vector submodule 431, a determination submodule 432 and a document division submodule 433, where the vector submodule 431 is configured to set a vector for representing a distribution situation of the candidate feature unit at each document position; wherein each element in the vector corresponds to each document position respectively; if the distribution quantity of the candidate feature units at the specified document position is greater than a preset distribution threshold value, the element value of the element corresponding to the specified document position is nonzero; if the distribution quantity of the candidate feature units at the specified document position is not greater than a preset distribution threshold value, the element value of the element corresponding to the specified document position is zero; the determining submodule 432 is configured to determine that the candidate feature unit is a popularization feature unit when the number of non-zero elements in the vector is greater than a preset element threshold; the document dividing submodule 433 is configured to divide the document content into a plurality of document positions according to a preset position dividing rule; wherein, the preset position division rule comprises: paragraph granularity based partitioning rules, and sentence granularity based partitioning rules; and the distribution quantity of the candidate feature units at the specified document position comprises: the number of occurrences, and/or the probability of occurrence of the candidate feature cell at the specified document location.

And the detecting module 440 is configured to detect promotion information included in the document according to the determined promotion feature unit.

Specifically, the detection module 440 needs to set a corresponding document detection model according to the determined popularization feature units and the distribution conditions of the popularization feature units at each document position, and detect the popularization information included in the document according to the document detection model. Further, the detection module 440 needs to set model parameters included in the document detection model and weight values corresponding to the model parameters according to the determined popularization feature units, the occurrence probabilities of the popularization feature units at the document positions, and preset position weights; then searching information units matched with model parameters contained in the document detection model from all information units contained in the document to be detected; and determining the score of each searched information unit according to the document position of the information unit in the document to be detected and/or the weight value of the model parameter matched with the information unit, and determining whether the information unit is the promotion information or not according to the score.

The present invention may include an update module 450 for updating the document detection model based on the promotion information contained in the detected document. The document detection model comprises a deep learning model, especially a convolutional neural network model in the deep learning model can be adopted, and in specific application, the updating module 450 can also feed back the convolutional neural network model according to the actual detection result of the popularization information every time, so that the document detection model is continuously updated, the identification accuracy can be continuously improved, and the identification efficiency of the popularization information is improved.

The present invention may further include a deleting module 460 for deleting the document according to the detected document location of the promotion information. When the detected document position where the promotion information is located belongs to the head of the document, deleting the promotion information and the paragraph contents before the promotion information; when the detected document position of the promotion information belongs to the tail of the document, deleting the promotion information and the paragraph contents behind the promotion information; and deleting the sentence where the promotion information is located when the detected document position where the promotion information is located belongs to the middle part of the document. Through deleting module 460, advertisement information or spam information contained in the news content captured by the machine can be effectively removed, so that pure news content can be obtained, and the assembly of news from a media platform is facilitated.

Therefore, according to the detection device for the popularization information, a certain amount of operation of the method is simplified by carrying out duplicate elimination on sample data, then the information units in the preset sample set are extracted, the candidate characteristic units in the information units are determined according to the occurrence frequency of the information units in the sample set, then the popularization characteristic units in the candidate characteristic units are determined according to the distribution situation of the positions of the candidate characteristic units in each document by adopting an L0 norm constraint algorithm, finally a document detection model is established according to the screened popularization characteristic units, and the popularization information contained in the detection target document is detected by using the document detection model, so that the popularization information in the target document is obtained. By utilizing the acquired popularization information, the target document captured by the machine can be deleted to obtain pure news content, so that the news assembly work from a media platform is facilitated. And when the document detection model adopts a deep learning model, the actual detection result of the popularization information each time can be fed back to the document detection model, so that the model can be continuously learned and updated to adapt to development and improve the accuracy of the popularization information.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the apparatus for detecting promotional information according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method for detecting promotion information comprises the following steps:

acquiring a preset sample set, and extracting information units contained in each sample in the sample set;

counting the occurrence frequency of each information unit in the sample set, and determining the information units with the occurrence frequency larger than a preset first threshold value as candidate feature units;

respectively counting the distribution condition of the candidate feature unit at each document position aiming at each candidate feature unit, and determining whether the candidate feature unit is a popularization feature unit according to the counting result;

detecting promotion information contained in the document according to the determined promotion feature unit;

the step of respectively counting the distribution of the candidate feature unit at each document position and determining whether the candidate feature unit is a popularization feature unit according to the counting result specifically includes:

setting a vector for representing the distribution situation of the candidate feature unit at each document position; wherein each element in the vector corresponds to each document position, respectively;

if the distribution quantity of the candidate feature units at the specified document position is greater than a preset distribution threshold value, the element value of the element corresponding to the specified document position is nonzero; if the distribution quantity of the candidate feature units at the specified document position is not greater than a preset distribution threshold value, the element value of the element corresponding to the specified document position is zero;

and when the number of the non-zero elements in the vector is larger than a preset element threshold value, determining the candidate feature unit as a popularization feature unit.

2. The method according to claim 1, wherein the step of setting a vector for representing the distribution of the candidate feature units at each document position further comprises the steps of: dividing the document content into a plurality of document positions according to a preset position division rule; wherein the preset position division rule comprises: paragraph granularity based partitioning rules, and sentence granularity based partitioning rules;

and the distribution quantity of the candidate feature units at the specified document position comprises: the number of occurrences, and/or the probability of occurrence, of the candidate feature cell at a specified document location.

3. The method according to claim 2, wherein the step of obtaining a preset set of samples comprises in particular:

and carrying out deduplication processing on the plurality of candidate samples, and obtaining the sample set according to the candidate samples subjected to deduplication processing.

4. The method according to claim 3, wherein the step of de-duplicating the plurality of candidate samples specifically comprises:

calculating the similarity between the titles of the candidate samples, and eliminating the duplication of the candidate samples of which the similarity between the titles is greater than a preset similarity threshold;

and inquiring a keyword set corresponding to each candidate sample aiming at the candidate samples with the similarity between the titles not larger than a preset similarity threshold, and if the number of the same keywords contained in the keyword sets corresponding to the two candidate samples is larger than a preset number threshold, performing duplicate elimination on the two candidate samples.

5. The method according to claim 4, wherein the step of calculating the similarity between the titles of the respective candidate samples specifically comprises: calculating the similarity between the titles of the candidate samples through a maximum common subsequence algorithm;

the keyword set corresponding to each candidate sample is determined according to the reverse file frequency of each vocabulary obtained after the candidate sample is subjected to word segmentation; the quantity threshold is determined according to a Jacard similarity algorithm.

6. The method according to claim 5, wherein the step of detecting the promotion information contained in the document according to the determined promotion feature unit specifically comprises:

and setting a corresponding document detection model according to the determined popularization characteristic units and the distribution condition of the popularization characteristic units at each document position, and detecting popularization information contained in the document according to the document detection model.

7. The method according to claim 6, wherein the step of setting a corresponding document detection model according to the determined popularization feature units and their distribution at each document location specifically comprises:

and setting model parameters contained in the document detection model and weight values corresponding to the model parameters according to the determined popularization characteristic units, the occurrence probability of the popularization characteristic units at each document position and preset position weights.

8. The method according to claim 7, wherein the step of detecting the promotion information contained in the document according to the document detection model specifically includes:

searching information units matched with model parameters contained in the document detection model from all information units contained in the document to be detected;

and determining the score of each searched information unit according to the document position of the information unit in the document to be detected and/or the weight value of the model parameter matched with the information unit, and determining whether the information unit is promotion information or not according to the score.

9. The method of claim 8, wherein the method further comprises the steps of: updating the document detection model according to promotion information contained in the detected document; wherein the document detection model comprises: and (5) deeply learning the model.

10. The method according to any one of claims 1-9, wherein after the step of detecting promotional information contained in the document according to the determined promotional feature unit, further comprising the steps of:

deleting the document according to the detected document position of the promotion information;

when the detected document position where the promotion information is located belongs to the head of the document, deleting the promotion information and the paragraph contents before the promotion information; when the detected document position of the promotion information belongs to the tail of the document, deleting the promotion information and the paragraph contents behind the promotion information; and deleting the sentence where the promotion information is located when the detected document position where the promotion information is located belongs to the middle part of the document.

11. A promotional information detection apparatus, comprising:

the information unit extraction module is used for acquiring a preset sample set and extracting information units contained in each sample in the sample set;

the candidate unit determining module is used for counting the occurrence frequency of each information unit in the sample set and determining the information units with the occurrence frequency larger than a preset first threshold value as candidate feature units;

the promotion unit determining module is used for respectively counting the distribution condition of the candidate characteristic unit at each document position aiming at each candidate characteristic unit and determining whether the candidate characteristic unit is a promotion characteristic unit or not according to the counting result;

the detection module is used for detecting promotion information contained in the document according to the determined promotion feature unit;

wherein the promotion unit determination module further comprises:

the vector submodule is used for setting a vector for expressing the distribution situation of the candidate feature unit at each document position; wherein each element in the vector corresponds to each document position, respectively; if the distribution quantity of the candidate feature units at the specified document position is greater than a preset distribution threshold value, the element value of the element corresponding to the specified document position is nonzero; if the distribution quantity of the candidate feature units at the specified document position is not greater than a preset distribution threshold value, the element value of the element corresponding to the specified document position is zero;

and the determining submodule is used for determining the candidate feature unit as a popularization feature unit when the number of the nonzero elements in the vector is larger than a preset element threshold value.

12. The apparatus of claim 11, wherein the promotion unit determination module further comprises:

the document dividing submodule is used for dividing the document content into a plurality of document positions according to a preset position dividing rule;

wherein the preset position division rule comprises: paragraph granularity based partitioning rules, and sentence granularity based partitioning rules; and the distribution quantity of the candidate feature units at the specified document position comprises: the number of occurrences, and/or the probability of occurrence, of the candidate feature cell at a specified document location.

13. The apparatus of claim 12, wherein the information unit extraction module is further configured to:

14. The apparatus of claim 13, wherein the information element extraction module is specifically configured to:

15. The apparatus of claim 14, wherein the information element extraction module is specifically configured to: calculating the similarity between the titles of the candidate samples through a maximum common subsequence algorithm;

16. The apparatus of claim 15, wherein the detection module is specifically configured to:

17. The apparatus of claim 16, wherein the detection module is specifically configured to:

18. The apparatus of claim 17, wherein the detection module is specifically configured to:

19. The apparatus of claim 18, wherein the apparatus further comprises:

the updating module is used for updating the document detection model according to the promotion information contained in the detected document; wherein the document detection model comprises: and (5) deeply learning the model.

20. The apparatus of any of claims 11-19, wherein the apparatus further comprises:

the deleting module is used for deleting the document according to the detected document position of the promotion information;