CN106909669A - The detection method and device of a kind of promotion message - Google Patents
The detection method and device of a kind of promotion message Download PDFInfo
- Publication number
- CN106909669A CN106909669A CN201710113764.5A CN201710113764A CN106909669A CN 106909669 A CN106909669 A CN 106909669A CN 201710113764 A CN201710113764 A CN 201710113764A CN 106909669 A CN106909669 A CN 106909669A
- Authority
- CN
- China
- Prior art keywords
- unit
- document
- candidate feature
- candidate
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the detection method and device of a kind of promotion message, it is related to the text filtering processing technology field, the method to include:Default sample set is obtained, the information unit that each sample in sample set is included is extracted;Occurrence number of each information unit in sample set is counted, number of times is will appear from and is defined as candidate feature unit more than the information unit of default first threshold;For each candidate feature unit, distribution situation of the candidate feature unit in each documents location is counted respectively, determine whether the candidate feature unit is promotional features unit according to statistics;The promotion message included in document is detected according to fixed promotional features unit.As can be seen here, the present invention is capable of the effect of efficiently and accurately filtering advertisements information or rubbish promotion message so that can also extract pure news content using machine grasping means, drastically increases efficiency of the compilation from media platform news.
Description
Technical field
The present invention relates to text filtering processing technology field, and in particular to the detection method and device of a kind of promotion message.
Background technology
With the development of Internet technology, arrived from Media Era.It is different from traditional news media media, from media platform
News there is the popularity in more preferable promptness and source, and opening from media platform in itself causes that each platform is used
Family can both turn into the reader of news, it is also possible to the producer and publisher as news.It is more next for present case
More breaking news are able to issue in time by wechat, microblogging etc. from media platform, and people are also increasingly accustomed to from from matchmaker
Body platform obtains oneself news content interested.At the same time, by the mutual forwarding between user, from the new of media platform
News has also obtained effective propagation.
But, inventor realize it is of the invention during, find at least there are the following problems in the prior art:In order to
The news collected from media platform, facilitates user to read, and the news from media platform can be collected using the method for machine crawl
Content.But, because being often mingled with advertising message or rubbish promotion message from the news content of media platform, adopt
When news content carried out with prior art capturing, it is impossible to filtering advertisements information or rubbish promotion message exactly so that cannot
Grab pure news content.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on
State the detection method and device of the promotion message of problem.
According to an aspect of the invention, there is provided a kind of detection method of promotion message, including:Obtain default sample
Set, extracts the information unit that each sample in sample set is included;Each information unit is counted in sample set
Occurrence number, will appear from number of times and is defined as candidate feature unit more than the information unit of default first threshold;For each time
Feature unit is selected, distribution situation of the candidate feature unit in each documents location is counted respectively, being determined according to statistics should
Whether candidate feature unit is promotional features unit;The popularization letter included in document is detected according to fixed promotional features unit
Breath.
According to another aspect of the present invention, there is provided a kind of detection means of promotion message, including:Information unit extracts mould
Block, for obtaining default sample set, extracts the information unit that each sample in sample set is included;Candidate unit is true
Cover half block, for counting occurrence number of each information unit in sample set, will appear from number of times more than default first threshold
The information unit of value is defined as candidate feature unit;Unit determining module is promoted, for for each candidate feature unit, difference
Distribution situation of the candidate feature unit in each documents location is counted, whether the candidate feature unit is determined according to statistics
It is promotional features unit;Detection module, for detecting the promotion message included in document according to fixed promotional features unit.
As can be seen here, the invention provides the detection method and device of a kind of promotion message, by extracting default sample set
Information unit in conjunction, and occurrence number according to information unit in sample set determines the candidate feature in information unit
Unit, then the popularization according to candidate feature unit during the distribution situation of position determines candidate feature unit in each document is special
Unit is levied, finally according to the promotion message included in the promotional features unit detection destination document for filtering out, it is achieved thereby that
Effective during media platform news and accurate filtering advertisements information is extracted from using machine grasping means or rubbish promotes letter
The effect of breath so that can also extract pure news content using machine grasping means, drastically increases compilation from media
The efficiency of platform news.
Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention,
And can be practiced according to the content of specification, and in order to allow the above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by specific embodiment of the invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area
Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 is a kind of flow chart of the detection method of promotion message that the embodiment of the present invention one is provided;
Fig. 2 is a kind of flow chart of the detection method of promotion message that the embodiment of the present invention two is provided;
Fig. 3 is a kind of structural representation of the detection means of promotion message that the embodiment of the present invention three is provided;
Fig. 4 is a kind of structural representation of the detection means of promotion message that the embodiment of the present invention four is provided;
Fig. 5 is straight with the candidate feature unit position distribution situation in a document that time correlation joins in the embodiment of the present invention
Fang Tu;
Fig. 6 is that the candidate feature unit that is associated with advertising message or rubbish promotion message is in a document in the embodiment of the present invention
Position distribution situation histogram.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
Limited.Conversely, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure
Complete conveys to those skilled in the art.
Embodiment one
Fig. 1 shows a kind of detection method of promotion message of present invention offer, and the method includes:
Step S110:Default sample set is obtained, the information unit that each sample in sample set is included is extracted.
Computer is identified to sample news content for convenience, it is necessary first to according to certain rule, to default
Sample news content comprising advertising message or rubbish promotion message is split, and therefrom extracts what each sample was included
Information unit.Wherein, default sample set refers to comprising advertising message or rubbish promotion message and with certain representativeness
From Media News content, the sample set is typically rule of thumb selected and set by those skilled in the art.And it is above-mentioned
Information unit be constitute sample news content base unit, its form typically can be sample news content be divided after produce
Raw feature phrase, or the words with certain feature.Specific setting for presetting sample set is regular and above-mentioned
The concrete form of information unit, the present invention is not especially limited, and those skilled in the art can flexibly set according to actual conditions.
Step S120:Occurrence number of each information unit in sample set is counted, number of times is will appear from more than default
The information unit of first threshold is defined as candidate feature unit.
Because advertising message and rubbish promotion message are each deliberately being repeated from each news briefing person of media platform
Information, therefore, generally comprise identical advertising message in the different news contents from same news briefing person or rubbish is pushed away
Guangxin ceases.The statistics of the occurrence number in sample set is carried out to the information unit that step S110 is extracted, when certain information list
When the occurrence number of unit exceedes default first threshold, illustrate that the information unit has great suspicion to belong to advertising message or rubbish
Promotion message, therefore, the information unit is defined as candidate feature unit.
Wherein, default first threshold is from same news briefing person according to advertising message or rubbish promotion message
Sample news content in number of repetition general status determine, when certain information unit be higher than the number of repetition when, just will
The information unit is defined as the candidate feature unit with advertising message or rubbish promotion message suspicion.The first threshold it is specific
It is determined that regular, the present invention is not especially limited, and those skilled in the art can flexibly determine according to test data and experience.
Step S130:For each candidate feature unit, the candidate feature unit is counted respectively in each documents location
Distribution situation, determines whether the candidate feature unit is promotional features unit according to statistics.
By the way that after the preliminary screening of step S120, information unit of the major part comprising advertising message or rubbish promotion message is all
Candidate feature unit can be confirmed as, but some numbers of repetition exceed the information list comprising normal news content of first threshold
Unit can also be confirmed as candidate feature unit.
The present inventor is by lot of experiments and compares discovery repeatedly, the candidate feature list comprising normal news content
Because being the non-content for deliberately repeating of news briefing person, position distribution situation in the sample typically can be than more uniform for unit;
And the candidate feature unit comprising advertising message or rubbish promotion message belongs to the content that news briefing person deliberately repeats, so
Position distribution situation in sample can compare concentration.According to this discovery, the present invention using candidate feature unit in the sample
Position distribution situation is further screened to candidate feature unit, and position distribution compares the candidate feature unit of concentration
It is defined as promotional features unit.
Step S140:The promotion message included in document is detected according to fixed promotional features unit.
By the treatment of above-mentioned steps, can obtain being concluded from default sample set the promotional features list for extracting
Unit, is then identified, so that effectively by above-mentioned promotional features unit to the document to be detected that machine grasping means is obtained
The corresponding promotion message included in document to be monitored is filtered out, the popularization letter for filtering out finally is removed from document to be detected
Breath, it is possible to obtain relatively pure news content.
As can be seen here, the detection method of a kind of promotion message that the present invention is provided, by the default sample set of extraction
Information unit, and occurrence number according to information unit in sample set determines the candidate feature unit in information unit,
Then the promotional features list according to candidate feature unit during the distribution situation of position determines candidate feature unit in each document
Unit, finally according to the promotion message included in the promotional features unit detection destination document for filtering out, it is achieved thereby that using
Machine grasping means extracts from effective during media platform news and accurate filtering advertisements information or rubbish promotion message
Effect so that can also extract pure news content using machine grasping means, drastically increases from media platform news
The efficiency of compilation.
Embodiment two
Fig. 2 shows a kind of detection method of promotion message of present invention offer, and the method includes:
Step S210:Default sample set is obtained, the information unit that each sample in sample set is included is extracted.
Computer is identified to sample news content for convenience, it is necessary first to according to certain rule, to default
Sample news content comprising advertising message or rubbish promotion message is split, and therefrom extracts what each sample was included
Information unit.Because there is the situation that same piece news is repeated quickly and easily as many times as required, disappeared before default sample set is obtained
Process again, can effectively reduce the amount of calculation for obtaining sample set, improve and obtain efficiency, therefore obtain the step of default sample set
Suddenly specifically include to disappear multiple candidate samples and process again, sample set is obtained according to the candidate samples after the treatment again that disappears.
The specific treatment again that disappears includes calculating the similarity between the title of each candidate samples, for the phase between title
Like degree the weight that disappears is carried out more than the candidate samples of default similarity threshold;It is not more than default phase for the similarity between title
Like the candidate samples for spending threshold value, the corresponding keyword set of each candidate samples is inquired about, if the pass corresponding to two candidate samples
The quantity of the same keyword included in keyword set is more than default amount threshold, then disappeared for two candidate samples
Weight.Wherein it is preferred to, the similarity between the title of each candidate samples is calculated by maximum common subsequence algorithm, and respectively
Keyword set corresponding to individual candidate samples is according to the reverse of each vocabulary for candidate samples obtained after word segmentation processing
Document-frequency (IDF) determines that above-mentioned amount threshold determines according to Jie Kade similarity algorithms.
The above is understood for convenience, elaborates the weight processing procedure that disappears with a specific example below:1st, to institute
The title and body matter for having sample article carry out Chinese word segmentation and go stop words to operate;2nd, calculation statistics is every in a distributed manner
The word frequency (TF) of each word in piece sample article simultaneously calculates corresponding reverse document-frequency (IDF), and each word is calculated afterwards
TF*IDF fractions;3rd, (the keyword quantity is only specific in this specific example to preceding 20 words for extracting in title word segmentation result
Value, in other embodiments, those skilled in the art can set the keyword quantity according to actual conditions), constitute crucial
Set of words, when title word segmentation result is less than 20 words, remaining keyword arranges knot from high to low by TF*IDF fractions in text
Participle high in fruit is supplemented successively;4th, with the keyword set of all articles set up a point bucket (Bucket Table, be it is a kind of more
Fine-grained data area dividing mode, point bucket can add extra structure to table, allow to be utilized during treatment inquiry operation
The structure, so as to obtain query processing efficiency higher), wherein, the major key of each barrel is one in keyword set unique
Keyword, the article in such bucket is possible to be similar;5th, in the similarity of every article of calculating, the piece is first found
The corresponding 20 points of buckets of article are (because each point of bucket corresponds to a keyword, because every article has 20 keywords, institute
So that every article corresponds to 20 points of buckets), then using maximum common subsequence algorithm to all articles in this article and bucket
Title carry out Similarity Measure, when title similarity more than 0.75 (0.75 be this specific example in default similarity threshold,
The threshold value is set by those skilled in the art according to actual conditions) when, that is, judge that two articles are content identical sample article,
Disappeared and operate again;6th, when title similarity is not more than 0.75,20 similarities of keyword in more every article two-by-two,
When similar keyword quantity more than 16 (16 be this specific example in the default number that is determined by Jie Kade similarity algorithms
Amount threshold value, i.e., respective 20 words, identical word quantity is x, and Jie Kade similarities are x/ (20-x+20-x+x)=0.66, so
For 16) when, you can judge that two articles are content identical sample word, disappeared and operate again.
In above-mentioned specific example, when article similarity two-by-two is compared, first compare title similarity, then compare keyword
Similarity, because the amount of calculation of one side title is small, fast operation, while under normal circumstances, the similar article mark of content
Topic major part is all similar;On the other hand, if only carrying out similarity-rough set with keyword, or only carried out with title similar
Degree compares, and can exist and compare bottleneck, and comparative result is not accurate enough, therefore, the present invention is used after first comparing title and compares keyword
Mode calculate similarity, two ways complements each other, and complements one another.
Realize it is of the invention during inventor find that setting up a point bucket by keyword set can effectively reduce data
Amount of calculation.When not setting up a point bucket and directly calculating crucial Word similarity, algorithm complex is O (n^2), and wherein n is sample article
Sum, when according to keyword foundation point bucket, algorithm complex is O (k*m^2), and wherein k is sample keyword sum, and m is every
The average of article, k under individual keyword point bucket<<N and m<<N is right when n is 100,000,000 (when sample article quantity is 100,000,000)
The k for answering only has tens of thousands of, and the algorithm complex hence set up after point bucket is lower.Meanwhile, because each point of major key of bucket is keyword
A unique keyword in set, therefore article in same point of bucket is only possible to similitude, not in same point of bucket
Article necessarily without any similar keyword, can directly be excluded, further reduce amount of calculation.In addition, because
As long as the crucial Word similarity of article is laggard more than be judged to similar article by predetermined number threshold value and stopping calculating two-by-two
Row disappears and process again, by the way of point bucket is set up can faster earlier find similar article, stop calculating in advance, i.e., using building
The algorithm of vertical point of bucket mode is easier to tend to optimal complexity, rather than maximum complexity O (k*m^2).
Realize it is of the invention during inventor also found, calculate keyword Jie Kade similarities when, can adopt
A kind of data structure is used, is traded space for time, optimize calculating speed:Build first index that size is 65536 (because
According to Chinese character code rule, 65536 positions can represent all Chinese characters), will be every in every keyword set of article
One lead-in of word as index position sequence number, other words as the index position property value, each father index can be with
There are multiple subindexs, each subindex has a property value to represent that father index belongs to which article (category of subindex herein
Property value use a binary number representation, that is, there is the M article binary number just to have M, the father indexes which article belonged to,
The corresponding binary digit of that article is for 1).When needing to calculate the crucial Word similarity of article two-by-two, it is no longer necessary to two-by-two
Calculate, as long as but searching dittograph in the same point of keyword data structure of bucket.Use a same M array
(each one article of correspondence of array, i.e. M article correspondence M bit array), the initial value of each units is in the array
0, all subindexs under same father is indexed are compared, and if there is similar subindex, take out expression in the subindex
The binary number of article is subordinated to, then adds 1 in the array position of correspondence article, and judge in the array whether is each bit value
More than 16 (default amount thresholds i.e. mentioned above), when a certain bit value of array is more than 16, that is, the numerical value pair is illustrated
The article answered is similar article, can stop calculating and being disappeared processing again.Two groups of 20 words are previously required to compare one by one, then
M article is worst in one point of bucket needs to calculate M*400 times;After improvement, it is only necessary to quick search 20 times, compare 20 father's indexes
Under subindex whether have similar, now worst need to inquire about 20*M times, and under actual conditions, the sub- rope under each father index
Argument amount is much smaller than M, so amount of calculation can also reduce into multiple, this algorithm can quickly find duplicate articles earlier.
After the above-mentioned treatment again that disappears is completed, the information unit that each sample in sample set is included is extracted.Specifically,
In the present embodiment, article content can be split by punctuation mark and line feed blank, so as to obtain the letter in sample
Interest statement unit.For example " Quick Response Code ' identification ' concern is pinned, more pleasantly surprised to wait you " and can split and obtain two information units, point
Wei " not pin Quick Response Code ' identification ' concern " and " more pleasantly surprised wait you ".In other embodiments, it would however also be possible to employ other
Rule carries out segmentation to article content and extracts information unit, and the present invention is not especially limited to this, and those skilled in the art can be with
Flexibly setting.
Step S220:Occurrence number of each information unit in sample set is counted, number of times is will appear from more than default
The information unit of first threshold is defined as candidate feature unit.
Realize it is of the invention during, the inventors discovered that, by the analysis to historical data, each news hair
Cloth person advertising message or rubbish promotion message substantially identical for including in the article of issue within a period of time, then with it is wide
Announcement information or rubbish promotion message association information unit also necessarily high frequency repeat.Can be drawn by a large amount of statistical analyses
The critical value of the number of repetition that the information unit associated with advertising message or rubbish promotion message is distinguished with general information unit,
The critical value is above-mentioned default first threshold.All of information unit is screened by the default first threshold, will
Occurrence number is defined as candidate feature unit more than the information unit of the default first threshold.
But because default first threshold is empirical value, all of normal duplicate contents can not be filtered out, so
The position distribution and the position distribution feature of advertisement phrase of the normal news phrase for repeating are considered below, using L0 norm constraints,
Normal content is more accurately filtered out, accurate news paper advertising phrase and position distribution number of repetition equal weight is obtained, finally used
These data build news promotion message identification model.
Step S230:For each candidate feature unit, the candidate feature unit is counted respectively in each documents location
Distribution situation, determines whether the candidate feature unit is promotional features unit according to statistics.
By the screening of step S220, the information unit of most normal content can be roughly screened out, but it is remaining
Information unit (i.e. candidate feature unit) in, except the information unit associated with advertising message or rubbish promotion message, may be used also
Can there is the information unit joined with time correlation included in normal news.Inventor is had found by statistical analysis, special in candidate
In levying unit, the candidate feature unit that joins with time correlation because and non-artificial content deliberately repeatedly, in a document
Position distribution situation is than more uniform (as shown in Figure 5);And the candidate feature unit associated with advertising message or rubbish promotion message
It is the artificial content for deliberately repeating, so position distribution situation in a document compares concentration (as shown in Figure 6).Therefore, pass through
Statistics candidate feature unit effectively can further filter out promotional features unit in the distribution situation of each documents location.
Specifically, can further be screened by the L0 norm constraints being distributed.First, divided according to default position
Document content is divided into multiple documents locations by rule, wherein, default position division rule includes drawing based on paragraph granularity
Divider is then and the division rule based on sentence granularity;Then, it is provided for representing the candidate feature unit in each document position
The vector of the distribution situation put, wherein, each element in vector corresponds respectively to each documents location;If the candidate feature list
Unit is more than default distribution threshold value in the distributed quantity of specified documents location, then the unit of the element corresponding to the specified documents location
Element value non-zero, if the candidate feature unit is specifying the distributed quantity of documents location to be not more than default distribution threshold value, this refers to
The element value for determining the element corresponding to documents location is zero, wherein, candidate feature unit is specifying the distributed quantity of documents location
The occurrence number, and/or probability of occurrence of documents location are being specified including candidate feature unit;Finally, when nonzero element in vector
Number be more than default element threshold value when, determine the candidate feature unit be promotional features unit.
Realize it is of the invention during, inventor considered four kinds of position division rules, was respectively paragraph size distribution
The positive and negative sequence of percentage, sentence size distribution percentage, paragraph granularity and the positive and negative sequence of sentence granularity.By lot of experiments, hair
A person of good sense has found similar public number article, and promotion message is concentrated mainly on article head or afterbody;Different content article, paragraph or sentence
Sub- total amount is polymorphic, is equally present in first paragraph or last several sections, and percentage has very big difference;Same afterbody promotes letter
Breath, if content is similar, then paragraph number is almost consistent;Afterbody promotion message is often liked with very short paragraph, this
In the case of, using the best results of the positive and negative ordering rule of paragraph granularity.In a particular application, because information publisher will usually push away
Guangxin breath is placed on article and starts full position (i.e. former sentences of first paragraph) or concentrate typesetting in the afterbody of article.Thus,
Two articles of same editing and composing, if the position of promotion message is in article head (such as first paragraph), candidate feature list
First position distribution situation can be counted using positive sequence, that is, concentrating on first paragraph can be designated as+1;Similarly, when editor's custom will
Promotion message typesetting is at article afterbody (such as final stage), because every article paragraph quantity difference is larger, using forward direction
Sequence counting can cause distribution situation statistical difference away from larger, and when such as an article has 20 sections, final stage is designated as+20, when one
When article has 30 sections, then final stage is designated as+30, is now accomplished by being counted using converse sequencing, then no matter article paragraph number is
How much, final stage is designated as -1 so that distribution statisticses result does not have relatively large deviation, therefore, the energy by the way of positive and negative sequence
It is enough more accurately to reflect position distribution situation.In addition, studied through inventor finding, the positive and negative sequence of paragraph granularity is the most accurate
(because typesetting is more based on paragraph, therefore paragraph can more embody editing and composing wish;And sentence is author writes the style of writing of article
Custom or writing level).As can be seen here, in the present embodiment, one can be entered using the positive and negative ordering rule of paragraph granularity
Step lifting accuracy rate.
Above-mentioned default distribution threshold value and default element threshold value are required to be determined by lot of experiments, specifically, needed
Different distribution threshold values and element threshold value are taken respectively, and when comparing different values, the corresponding candidate feature unit of normal content
The separating effect of candidate feature unit corresponding with advertising message, is finally defined as default point by the best value of separating effect
Cloth threshold value and default element threshold value.In implementation process of the present invention, inventor is had found by lot of experiments, when in step S220
Default first threshold be 20 when, default distribution threshold value be 10, default element threshold value be 3 when, normal content is corresponding
The separating effect of candidate feature unit candidate feature unit corresponding with advertising message is optimal.Now, when a candidate feature list
When the occurrence number of unit certain position in article is more than 10, the corresponding vector element value in the position is not 0, conversely, the position is
0.Thus obtain the corresponding mapping (x of different candidate feature units<10, y=0;x>10, y=x) L0 norm values n, n be to
Amount y0, y1 ... yi, in for 0 number.Work as n>When=3 (i.e. element threshold value is 3), it is special to promote to judge the candidate feature unit
Levy unit.
Step S240:The promotion message included in document is detected according to fixed promotional features unit.
Specifically, according to fixed promotional features unit and its distribution situation in each documents location, correspondence is set
Document detection model, according to the promotion message included in document detection model inspection document.
Wherein, according to fixed promotional features unit and its distribution situation in each documents location, set corresponding
The step of document detection model, specifically includes according to fixed promotional features unit and its general in the appearance of each documents location
Rate and default position weight, are set corresponding to the model parameter and each model parameter included in document detection model
Weighted value.The computing formula of above-mentioned probability of occurrence be p=k/n, wherein n be the promotional features unit occur in a document it is total
Number of times, k is the number of times that the promotional features unit occurs in the position.Because advertising message or rubbish promotion message are occurred frequently in
The ad-hoc location of article, so need the diverse location occurred in a document to promotional features unit to assign different positions weighing
Value, it should be noted that specific position weight needs to determine by lot of experiments, and advertising message or rubbish promotion message are normal
The position weight of the ad-hoc location for often occurring should be higher than the position weight of other positions in text, could so reduce and delete by mistake just
The probability of normal content.
Wherein, specifically included according to the step of the promotion message included in document detection model inspection document to be detected
The information unit matched with the model parameter included in document detection model is searched in each information unit that document is included;
For each information unit for finding, documents location according to the information unit in the document to be detected and/or with
The weighted value of the model parameter that the information unit matches, determines the score value of the information unit, and the information list is determined according to score value
Whether unit is promotion message.The computing formula of above-mentioned score value is default for information unit is multiplied by the probability of occurrence of each documents location
Position weight because the corresponding position weight of ad-hoc location that advertising message or rubbish promotion message usually occur is higher, because
This, last fraction information unit higher is particularly likely that promotion message.
Step S250:Documents location according to where the promotion message for detecting, deletes document.
Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the promotion message
And its paragraph content before is deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document,
Paragraph content to the promotion message and its afterwards is deleted;When the documents location where the promotion message for detecting belongs to
During the middle part of document, the sentence where the promotion message is deleted.Operation is deleted by above-mentioned, can effectively be removed
The advertising message or rubbish promotion message included in the news content of machine crawl, so as to get pure news content, side
To the compilation from media platform news.
Step S260:Document detection model is updated according to the promotion message included in the document for detecting.
Wherein, document detection model includes deep learning model, it is particularly possible to using the convolution god in deep learning model
It is in a particular application, acceptable according to the actually detected result of promotion message each time through network model, to the convolutional Neural net
Network model is fed back, so as to constantly update document detection model, can improve constantly identification accuracy, is improved and is promoted letter
The recognition efficiency of breath.
As can be seen here, the detection method of a kind of promotion message that the present invention is provided, first pass through carries out the weight that disappears to sample data
Treatment, simplifies certain operand of this method, then by extracting the information unit in default sample set, and according to information
Occurrence number of the unit in sample set determines the candidate feature unit in information unit, then according to candidate feature unit
The distribution situation of position in each document, using L0 norm constraint algorithms, so that it is determined that the popularization in candidate feature unit is special
Unit is levied, document detection model is set up finally according to the promotional features unit for filtering out, and using the document detection model to inspection
Survey the promotion message included in destination document to be detected, so as to get the promotion message in destination document.Using getting
Promotion message, the destination document that can be grabbed to machine is deleted, to obtain pure news content so that it is convenient from
The news compilation work of media platform.And when document detection model uses deep learning model, can also be by each time
The actually detected result of promotion message feeds back to document detection model, enables the model constantly to learn to constantly update, to adapt to development,
Improve the accuracy of promotion message.
Embodiment three
Fig. 3 shows a kind of detection means of promotion message of present invention offer, and the device includes:Information unit extracts mould
Block 310, candidate unit determining module 320, popularization unit determining module 330 and detection module 340.
Information unit extraction module 310, for obtaining default sample set, extracts each sample institute in sample set
Comprising information unit.
Detection means is identified to sample news content for convenience, and information unit extraction module 310 is firstly the need of root
According to certain rule, the default sample news content comprising advertising message or rubbish promotion message is split, and therefrom
Extract the information unit that each sample is included.Wherein, default sample set refers to be promoted comprising advertising message or rubbish
Information and with certain representational from Media News content, the sample set it is general by those skilled in the art rule of thumb
Selected and set.And above-mentioned information unit is the base unit for constituting sample news content, its form typically can be
Feature phrase, or the words with certain feature that sample news content is produced after being divided.For presetting sample set
The specific setting rule and the concrete form of above- mentioned information unit closed, the present invention are not especially limited, and those skilled in the art can
Flexibly set with according to actual conditions.
Candidate unit determining module 320, for counting occurrence number of each information unit in sample set, will appear from
Number of times is defined as candidate feature unit more than the information unit of default first threshold.
Because advertising message and rubbish promotion message are each deliberately being repeated from each news briefing person of media platform
Information, therefore, generally comprise identical advertising message in the different news contents from same news briefing person or rubbish is pushed away
Guangxin ceases.Candidate unit determining module 320 is carried out in sample set to the information unit that information unit extraction module 310 is extracted
The statistics of middle occurrence number, when the occurrence number of certain information unit exceedes default first threshold, illustrates the information unit
There is great suspicion to belong to advertising message or rubbish promotion message, therefore, the information unit is defined as candidate feature unit.
Wherein, default first threshold is from same news briefing person according to advertising message or rubbish promotion message
Sample news content in number of repetition general status determine, when certain information unit be higher than the number of repetition when, just will
The information unit is defined as the candidate feature unit with advertising message or rubbish promotion message suspicion.The first threshold it is specific
It is determined that regular, the present invention is not especially limited, and those skilled in the art can flexibly determine according to test data and experience.
Unit determining module 330 is promoted, is existed for for each candidate feature unit, counting the candidate feature unit respectively
The distribution situation of each documents location, determines whether the candidate feature unit is promotional features unit according to statistics.
By the way that after the preliminary screening of candidate unit determining module 320, major part includes advertising message or rubbish promotion message
Information unit can all be confirmed as candidate feature unit, but some numbers of repetition exceed first threshold comprising normal news
The information unit of content can also be confirmed as candidate feature unit.
The present inventor is by lot of experiments and compares discovery repeatedly, the candidate feature list comprising normal news content
Because being the non-content for deliberately repeating of news briefing person, position distribution situation in the sample typically can be than more uniform for unit;
And the candidate feature unit comprising advertising message or rubbish promotion message belongs to the content that news briefing person deliberately repeats, so
Position distribution situation in sample can compare concentration.According to this discovery, promote unit determining module 330 and use candidate feature list
Unit position distribution situation in the sample is further screened to candidate feature unit, and position distribution is compared into concentration
Candidate feature unit is defined as promotional features unit.
Detection module 340, for detecting the promotion message included in document according to fixed promotional features unit.
By promoting the treatment of unit determining module 330, can obtain concluding what is extracted from default sample set
Promotional features unit, the text to be detected that then detection module 340 is obtained by above-mentioned promotional features unit to machine grasping means
Shelves are identified, so that the corresponding promotion message included in effectively filtering out document to be monitored, finally from document to be detected
It is middle to remove the promotion message for filtering out, it is possible to obtain relatively pure news content.
Concrete structure and operation principle on above-mentioned modules can refer to the description of appropriate section in embodiment of the method,
Here is omitted.
As can be seen here, the detection means of a kind of promotion message that the present invention is provided, by the default sample set of extraction
Information unit, and occurrence number according to information unit in sample set determines the candidate feature unit in information unit,
Then the promotional features list according to candidate feature unit during the distribution situation of position determines candidate feature unit in each document
Unit, finally according to the promotion message included in the promotional features unit detection destination document for filtering out, it is achieved thereby that using
Machine grasping means extracts from effective during media platform news and accurate filtering advertisements information or rubbish promotion message
Effect so that can also extract pure news content using machine grasping means, drastically increases from media platform news
The efficiency of compilation.
Example IV
Fig. 4 shows a kind of detection means of promotion message of present invention offer, and the device includes:Information unit extracts mould
Block 410, candidate unit determining module 420, promote unit determining module 430, detection module 440, update module 450 and delete mould
Block 460, wherein, promote unit determining module 430 and further include that vectorial submodule 431, determination sub-module 432 and document are divided
Submodule 433.
Information unit extraction module 410, it is each in the extraction sample set for for obtaining default sample set
The information unit that individual sample is included.
Detection means is identified to sample news content for convenience, it is necessary first to according to certain rule, to default
The sample news content comprising advertising message or rubbish promotion message split, and therefrom extract each sample and included
Information unit.Because there is the situation that same piece news is repeated quickly and easily as many times as required, carried out before default sample set is obtained
Disappear and process again, can effectively reduce the amount of calculation for obtaining sample set, improve and obtain efficiency, therefore information unit extraction module 410
Needs are disappeared to multiple candidate samples and process again, and sample set is obtained according to the candidate samples after the treatment again that disappears.
Specifically, information unit extraction module 410 needs to calculate the similarity between the title of each candidate samples, pin
The weight that disappears is carried out more than the candidate samples of default similarity threshold to the similarity between title;For the similarity between title
The candidate samples of no more than default similarity threshold, inquire about the corresponding keyword set of each candidate samples, if two candidates
The quantity of the same keyword included in the keyword set corresponding to sample is more than default amount threshold, then for two times
Sampling originally carries out the weight that disappears.Wherein it is preferred to, calculated between the title of each candidate samples by maximum common subsequence algorithm
Similarity, and keyword set corresponding to each candidate samples according to candidate samples are carried out obtain after word segmentation processing each
The reverse document-frequency (IDF) of vocabulary determines that above-mentioned amount threshold determines according to Jie Kade similarity algorithms.
After the above-mentioned treatment again that disappears is completed, each sample that information unit extraction module 410 is extracted in sample set is wrapped
The information unit for containing.Specifically, in the present embodiment, article content can be divided by punctuation mark and line feed blank
Cut, so as to obtain the information unit in sample.For example " pinning Quick Response Code ' identification ' concern, more pleasantly surprised to wait you " can divide
Cut and obtain two information units, respectively " pin Quick Response Code ' identification ' concern " and " more pleasantly surprised wait you ".In other realities
In applying example, it would however also be possible to employ other rules carry out segmentation to article content and extract information unit, the present invention does not make specific limit to this
Fixed, those skilled in the art can flexibly set.
Candidate unit determining module 420, for counting occurrence number of each information unit in sample set, will appear from
Number of times is defined as candidate feature unit more than the information unit of default first threshold.
Realize it is of the invention during, the inventors discovered that, by the analysis to historical data, each news hair
Cloth person advertising message or rubbish promotion message substantially identical for including in the article of issue within a period of time, then with it is wide
Announcement information or rubbish promotion message association information unit also necessarily high frequency repeat.Can be drawn by a large amount of statistical analyses
The critical value of the number of repetition that the information unit associated with advertising message or rubbish promotion message is distinguished with general information unit,
The critical value is above-mentioned default first threshold.Candidate unit determining module 420 is by the default first threshold to all of
Information unit is screened, and be will appear from number of times and is defined as candidate feature unit more than the information unit of the default first threshold.
Unit determining module 430 is promoted, is existed for for each candidate feature unit, counting the candidate feature unit respectively
The distribution situation of each documents location, determines whether the candidate feature unit is promotional features unit according to statistics.
By the screening of candidate unit determining module 420, the information list of most normal content can be roughly screened out
Unit, but in remaining information unit (i.e. candidate feature unit), except the letter associated with advertising message or rubbish promotion message
Interest statement unit, it is also possible to there is the information unit joined with time correlation included in normal news.Inventor is sent out by statistical analysis
It is existing, in candidate feature unit, the candidate feature unit that joins with time correlation because and non-artificial content deliberately repeatedly,
Position distribution situation in a document is than more uniform (as shown in Figure 5);And the time associated with advertising message or rubbish promotion message
It is the artificial content for deliberately repeating to select feature unit, so position distribution situation in a document compares concentration (as shown in Figure 6).
Therefore, popularization spy effectively can further be filtered out in the distribution situation of each documents location by counting candidate feature unit
Levy unit.
Specifically, promoting unit determining module 430 includes that vectorial submodule 431, determination sub-module 432 and document divide son
Module 433, wherein, vectorial submodule 431 is used to be provided for representing distribution of the candidate feature unit in each documents location
The vector of situation;Wherein, each element in vector corresponds respectively to each documents location;If the candidate feature unit is being specified
The distributed quantity of documents location is more than default distribution threshold value, then the element value of the element corresponding to the specified documents location is non-
Zero;If the candidate feature unit is specifying the distributed quantity of documents location to be not more than default distribution threshold value, the specified document
The element value of the element corresponding to position is zero;Determination sub-module 432 is used for the number when nonzero element in vector more than default
Element threshold value when, determine the candidate feature unit be promotional features unit;Document divides submodule 433 to be used for according to default
Document content is divided into multiple documents locations by position division rule;Wherein, above-mentioned default position division rule includes:Base
In the division rule of paragraph granularity and the division rule based on sentence granularity;And above-mentioned candidate feature unit is in specified document
The distributed quantity of position includes:Candidate feature unit is specifying the occurrence number, and/or probability of occurrence of documents location.
Detection module 440, for detecting the promotion message included in document according to fixed promotional features unit.
Specifically, detection module 440 was needed according to fixed promotional features unit and its dividing in each documents location
Cloth situation, sets corresponding document detection model, according to the promotion message included in document detection model inspection document.Further
Ground, detection module 440 is needed according to fixed promotional features unit and its in the probability of occurrence of each documents location and pre-
If position weight, the model parameter that includes in document detection model and the weighted value corresponding to each model parameter are set;
Then the model parameter phase searched from each information unit that document to be detected is included and included in document detection model
The information unit of matching;For each information unit for finding, according to document of the information unit in document to be detected
Position and/or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, true according to score value
Whether the fixed information unit is promotion message.
The present invention can include update module 450, for updating text according to the promotion message included in the document for detecting
Shelves detection model.Wherein, document detection model includes deep learning model, it is particularly possible to using the convolution in deep learning model
Neural network model, in a particular application, update module 450 is acceptable according to the actually detected result of promotion message each time,
The convolutional neural networks model is fed back, so as to constantly update document detection model, identification can be improved constantly accurate
True property, improves the recognition efficiency of promotion message.
The present invention can also include deleting module 460, right for the documents location according to where the promotion message for detecting
Document is deleted.Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the popularization
Information and its paragraph content before are deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document
When, paragraph content to the promotion message and its afterwards is deleted;Documents location where the promotion message for detecting
When belonging to the middle part of document, the sentence where the promotion message is deleted.By deleting module 460, can effectively go
The advertising message or rubbish promotion message included in the news content captured except machine, so as to get pure news content,
Facilitate to the compilation from media platform news.
Concrete structure and operation principle on above-mentioned modules can refer to the description of appropriate section in embodiment of the method,
Here is omitted.
As can be seen here, the detection means of a kind of promotion message that the present invention is provided, first pass through carries out the weight that disappears to sample data
Treatment, simplifies certain operand of this method, then by extracting the information unit in default sample set, and according to information
Occurrence number of the unit in sample set determines the candidate feature unit in information unit, then according to candidate feature unit
The distribution situation of position in each document, using L0 norm constraint algorithms, so that it is determined that the popularization in candidate feature unit is special
Unit is levied, document detection model is set up finally according to the promotional features unit for filtering out, and using the document detection model to inspection
Survey the promotion message included in destination document to be detected, so as to get the promotion message in destination document.Using getting
Promotion message, the destination document that can be grabbed to machine is deleted, to obtain pure news content so that it is convenient from
The news compilation work of media platform.And when document detection model uses deep learning model, can also be by each time
The actually detected result of promotion message feeds back to document detection model, enables the model constantly to learn to constantly update, to adapt to development,
Improve the accuracy of promotion message.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein.
Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system
Structure be obvious.Additionally, the present invention is not also directed to any certain programmed language.It is understood that, it is possible to use it is various
Programming language realizes the content of invention described herein, and the description done to language-specific above is to disclose this hair
Bright preferred forms.
In specification mentioned herein, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be put into practice in the case of without these details.In some instances, known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify one or more that the disclosure and helping understands in each inventive aspect, exist
Above to the description of exemplary embodiment of the invention in, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor
The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, and wherein each claim is in itself
All as separate embodiments of the invention.
Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment
Unit or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit exclude each other, can use any
Combine to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power
Profit is required, summary and accompanying drawing) disclosed in each feature can the alternative features of or similar purpose identical, equivalent by offer carry out generation
Replace.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection is appointed
One of meaning mode can be used in any combination.
All parts embodiment of the invention can be realized with hardware, or be run with one or more processor
Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor (DSP) are come in the detection means for realizing promotion message according to embodiments of the present invention
The some or all functions of some or all parts.The present invention is also implemented as performing method as described herein
Some or all equipment or program of device (for example, computer program and computer program product).Such reality
Existing program of the invention can be stored on a computer-readable medium, or can have the form of one or more signal.
Such signal can be downloaded from internet website and obtained, or be provided on carrier signal, or in any other form
There is provided.
It should be noted that above-described embodiment the present invention will be described rather than limiting the invention, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol being located between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not
Element listed in the claims or step.Word "a" or "an" before element is not excluded the presence of as multiple
Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer
It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch
To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame
Claim.
The invention discloses:A1. a kind of detection method of promotion message, including:
Default sample set is obtained, the information unit that each sample in the sample set is included is extracted;
Occurrence number of each information unit in the sample set is counted, number of times is will appear from more than default first threshold
The information unit of value is defined as candidate feature unit;
For each candidate feature unit, distribution situation of the candidate feature unit in each documents location is counted respectively,
Determine whether the candidate feature unit is promotional features unit according to statistics;
The promotion message included in document is detected according to fixed promotional features unit.
A2. the method according to A1, wherein, it is described count respectively the candidate feature unit each documents location point
Cloth situation, determines that the step of whether the candidate feature unit is promotional features unit specifically includes according to statistics:
It is provided for representing vector of the candidate feature unit in the distribution situation of each documents location;Wherein, it is described to
Each element in amount corresponds respectively to each documents location;
If the candidate feature unit is specifying the distributed quantity of documents location more than default distribution threshold value, the specified text
The element value non-zero of the element corresponding to file location;If the candidate feature unit is not more than in the distributed quantity of specified documents location
Default distribution threshold value, then the element value of the element corresponding to the specified documents location is zero;
When the number of nonzero element in the vector is more than default element threshold value, the candidate feature unit is determined to push away
Wide feature unit.
A3. the method according to A2, wherein, it is described to be provided for representing the candidate feature unit in each documents location
Distribution situation vectorial step before, further include step:Document content is drawn according to default position division rule
It is divided into multiple documents locations;Wherein, the default position division rule includes:Division rule based on paragraph granularity and
Division rule based on sentence granularity;
And the candidate feature unit includes in the distributed quantity of specified documents location:The candidate feature unit is being specified
The occurrence number, and/or probability of occurrence of documents location.
A4. according to any described methods of A1-A3, wherein, specifically include the step of the acquisition default sample set:
Multiple candidate samples is disappeared and process again, the sample set is obtained according to the candidate samples after the treatment again that disappears.
A5. the method according to A4, wherein, it is described multiple candidate samples are disappeared treatment again the step of specifically wrap
Include:
The similarity between the title of each candidate samples is calculated, for the similarity between title more than default similar
Spending the candidate samples of threshold value carries out the weight that disappears;
It is not more than the candidate samples of default similarity threshold for the similarity between title, inquires about each candidate samples
Corresponding keyword set, if the quantity of the same keyword included in keyword set corresponding to two candidate samples is big
In default amount threshold, then the weight that disappears is carried out for described two candidate samples.
A6. the method according to A5, wherein, the step of similarity between the title for calculating each candidate samples
Specifically include:The similarity between the title of each candidate samples is calculated by maximum common subsequence algorithm;
And the keyword set corresponding to each candidate samples is each according to carry out obtaining after word segmentation processing to candidate samples
The reverse document-frequency of individual vocabulary determines;The amount threshold determines according to Jie Kade similarity algorithms.
A7. according to any described methods of A1-A6, wherein, it is described that document is detected according to fixed promotional features unit
In include promotion message the step of specifically include:
According to fixed promotional features unit and its distribution situation in each documents location, corresponding document inspection is set
Model is surveyed, according to the promotion message included in the document detection model inspection document.
A8. the method according to A7, wherein, it is described according to fixed promotional features unit and its in each document position
The step of distribution situation put, setting corresponding document detection model, specifically includes:
According to the fixed promotional features unit and its probability of occurrence in each documents location and default position
Weight is put, the weighted value corresponding to the model parameter and each model parameter included in the document detection model is set.
A9. the method according to A8, wherein, it is described according to the popularization included in the document detection model inspection document
The step of information, specifically includes:
The mould searched from each information unit that document to be detected is included and included in the document detection model
The information unit that shape parameter matches;
For each information unit for finding, according to documents location of the information unit in the document to be detected
And/or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, being determined according to score value should
Whether information unit is promotion message.
A10. the method according to A8 or A9, wherein, methods described further includes step:According to the document for detecting
In the promotion message that includes update the document detection model;Wherein, the document detection model includes:Deep learning model.
A11. according to any described methods of A1-A10, wherein, it is described that text is detected according to fixed promotional features unit
After the step of promotion message included in shelves, step is further included:
Documents location according to where the promotion message for detecting, deletes the document;
Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the promotion message
And its paragraph content before is deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document,
Paragraph content to the promotion message and its afterwards is deleted;When the documents location where the promotion message for detecting belongs to
During the middle part of document, the sentence where the promotion message is deleted.
The invention also discloses:B12. a kind of detection means of promotion message, including:
Information unit extraction module, for obtaining default sample set, extracts each sample in the sample set
Comprising information unit;
Candidate unit determining module, for counting occurrence number of each information unit in the sample set, will go out
Occurrence number is defined as candidate feature unit more than the information unit of default first threshold;
Unit determining module is promoted, for for each candidate feature unit, counting the candidate feature unit respectively each
The distribution situation of individual documents location, determines whether the candidate feature unit is promotional features unit according to statistics;
Detection module, for detecting the promotion message included in document according to fixed promotional features unit.
B13. the device according to B12, wherein, the popularization unit determining module is specifically included:
Vectorial submodule, for be provided for represent the candidate feature unit each documents location distribution situation to
Amount;Wherein, each element in the vector corresponds respectively to each documents location;If the candidate feature unit is in specified document
The distributed quantity of position is more than default distribution threshold value, then the element value non-zero of the element corresponding to the specified documents location;If
The candidate feature unit is not more than default distribution threshold value in the distributed quantity of specified documents location, then the specified documents location institute
The element value of corresponding element is zero;
Determination sub-module, for when the number of nonzero element in the vector is more than default element threshold value, it is determined that should
Candidate feature unit is promotional features unit.
B14. the device according to B13, wherein, the popularization unit determining module is further included:
Document divides submodule, for document content to be divided into multiple document positions according to default position division rule
Put;
Wherein, the default position division rule includes:Division rule based on paragraph granularity and based on sentence grain
The division rule of degree;And the candidate feature unit includes in the distributed quantity of specified documents location:The candidate feature unit
Specifying the occurrence number, and/or probability of occurrence of documents location.
B15. according to any described devices of B12-B14, wherein, described information unit extraction module is further used for:
Multiple candidate samples is disappeared and process again, the sample set is obtained according to the candidate samples after the treatment again that disappears.
B16. the device according to B15, wherein, described information unit extraction module specifically for:
The similarity between the title of each candidate samples is calculated, for the similarity between title more than default similar
Spending the candidate samples of threshold value carries out the weight that disappears;
It is not more than the candidate samples of default similarity threshold for the similarity between title, inquires about each candidate samples
Corresponding keyword set, if the quantity of the same keyword included in keyword set corresponding to two candidate samples is big
In default amount threshold, then the weight that disappears is carried out for described two candidate samples.
B17. the device according to B16, wherein, described information unit extraction module specifically for:It is public by maximum
Subsequence algorithm calculates the similarity between the title of each candidate samples;
And the keyword set corresponding to each candidate samples is each according to carry out obtaining after word segmentation processing to candidate samples
The reverse document-frequency of individual vocabulary determines;The amount threshold determines according to Jie Kade similarity algorithms.
B18. according to any described devices of B12-B18, wherein, the detection module specifically for:
According to fixed promotional features unit and its distribution situation in each documents location, corresponding document inspection is set
Model is surveyed, according to the promotion message included in the document detection model inspection document.
B19. the device according to B18, wherein, the detection module specifically for:
According to the fixed promotional features unit and its probability of occurrence in each documents location and default position
Weight is put, the weighted value corresponding to the model parameter and each model parameter included in the document detection model is set.
B20. the device according to B19, wherein, the detection module specifically for:
The mould searched from each information unit that document to be detected is included and included in the document detection model
The information unit that shape parameter matches;
For each information unit for finding, according to documents location of the information unit in the document to be detected
And/or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, being determined according to score value should
Whether information unit is promotion message.
B21. the device according to B19 or B20, wherein, described device is further included:
Update module, for updating the document detection model according to the promotion message included in the document for detecting;Its
In, the document detection model includes:Deep learning model.
B22. according to any described devices of B12-B21, wherein, described device is further included:
Module is deleted, for the documents location according to where the promotion message for detecting, the document is deleted;
Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the promotion message
And its paragraph content before is deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document,
Paragraph content to the promotion message and its afterwards is deleted;When the documents location where the promotion message for detecting belongs to
During the middle part of document, the sentence where the promotion message is deleted.
Claims (10)
1. a kind of detection method of promotion message, including:
Default sample set is obtained, the information unit that each sample in the sample set is included is extracted;
Occurrence number of each information unit in the sample set is counted, number of times is will appear from more than default first threshold
Information unit is defined as candidate feature unit;
For each candidate feature unit, distribution situation of the candidate feature unit in each documents location is counted respectively, according to
Statistics determines whether the candidate feature unit is promotional features unit;
The promotion message included in document is detected according to fixed promotional features unit.
2. method according to claim 1, wherein, the described candidate feature unit that counts respectively is in each documents location
Distribution situation, determines that the step of whether the candidate feature unit is promotional features unit specifically includes according to statistics:
It is provided for representing vector of the candidate feature unit in the distribution situation of each documents location;Wherein, in the vector
Each element correspond respectively to each documents location;
If the candidate feature unit is specifying the distributed quantity of documents location to be more than default distribution threshold value, the specified document position
Put the element value non-zero of corresponding element;If the candidate feature unit is not more than default in the distributed quantity of specified documents location
Distribution threshold value, then the element value of the element corresponding to the specified documents location is zero;
When the number of nonzero element in the vector is more than default element threshold value, it is special to promote to determine the candidate feature unit
Levy unit.
3. method according to claim 2, wherein, it is described to be provided for representing the candidate feature unit in each document position
Before the vectorial step of the distribution situation put, step is further included:According to default position division rule by document content
It is divided into multiple documents locations;Wherein, the default position division rule includes:Division rule based on paragraph granularity, with
And the division rule based on sentence granularity;
And the candidate feature unit includes in the distributed quantity of specified documents location:The candidate feature unit is in specified document
The occurrence number, and/or probability of occurrence of position.
4. according to any described methods of claim 1-3, wherein, specifically wrap the step of the acquisition default sample set
Include:
Multiple candidate samples is disappeared and process again, the sample set is obtained according to the candidate samples after the treatment again that disappears.
5. method according to claim 4, wherein, it is described multiple candidate samples are disappeared treatment again the step of specifically wrap
Include:
The similarity between the title of each candidate samples is calculated, default similarity threshold is more than for the similarity between title
The candidate samples of value carry out the weight that disappears;
It is not more than the candidate samples of default similarity threshold for the similarity between title, inquires about each candidate samples institute right
The keyword set answered, if the quantity of the same keyword included in keyword set corresponding to two candidate samples is more than pre-
If amount threshold, then disappear weight for described two candidate samples.
6. method according to claim 5, wherein, the step of the similarity between the title for calculating each candidate samples
Suddenly specifically include:The similarity between the title of each candidate samples is calculated by maximum common subsequence algorithm;
And the keyword set corresponding to each candidate samples is according to each word for carrying out being obtained after word segmentation processing to candidate samples
The reverse document-frequency for converging determines;The amount threshold determines according to Jie Kade similarity algorithms.
7. according to any described methods of claim 1-6, wherein, it is described that document is detected according to fixed promotional features unit
In include promotion message the step of specifically include:
According to fixed promotional features unit and its distribution situation in each documents location, corresponding document detection mould is set
Type, according to the promotion message included in the document detection model inspection document.
8. method according to claim 7, wherein, it is described according to fixed promotional features unit and its in each document
The step of distribution situation of position, setting corresponding document detection model, specifically includes:
Weighed according to the fixed promotional features unit and its in the probability of occurrence of each documents location and default position
Weight, sets the weighted value corresponding to the model parameter and each model parameter included in the document detection model.
9. method according to claim 8, wherein, it is described according to pushing away for being included in the document detection model inspection document
The step of Guangxin ceases specifically includes:
Searched from each information unit that document to be detected is included and joined with the model included in the document detection model
The information unit that number matches;
For each information unit for finding, according to documents location of the information unit in the document to be detected and/
Or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, the letter is determined according to score value
Whether interest statement unit is promotion message.
10. a kind of detection means of promotion message, including:
Information unit extraction module, for obtaining default sample set, each sample extracted in the sample set is wrapped
The information unit for containing;
Candidate unit determining module, for counting occurrence number of each information unit in the sample set, will appear from secondary
Number is defined as candidate feature unit more than the information unit of default first threshold;
Unit determining module is promoted, for for each candidate feature unit, counting the candidate feature unit respectively in each text
The distribution situation of file location, determines whether the candidate feature unit is promotional features unit according to statistics;
Detection module, for detecting the promotion message included in document according to fixed promotional features unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710113764.5A CN106909669B (en) | 2017-02-28 | 2017-02-28 | Method and device for detecting promotion information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710113764.5A CN106909669B (en) | 2017-02-28 | 2017-02-28 | Method and device for detecting promotion information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106909669A true CN106909669A (en) | 2017-06-30 |
CN106909669B CN106909669B (en) | 2020-02-11 |
Family
ID=59209410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710113764.5A Active CN106909669B (en) | 2017-02-28 | 2017-02-28 | Method and device for detecting promotion information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106909669B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108319582A (en) * | 2017-12-29 | 2018-07-24 | 北京城市网邻信息技术有限公司 | Processing method, device and the server of text message |
CN108805132A (en) * | 2018-06-01 | 2018-11-13 | 华中科技大学 | A kind of rubbish text filter method based on deep learning |
CN109815395A (en) * | 2018-12-26 | 2019-05-28 | 北京中科闻歌科技股份有限公司 | Webpage garbage information filtering method, device and storage medium |
CN110275993A (en) * | 2019-06-25 | 2019-09-24 | 苏州梦嘉信息技术有限公司 | The management system and method for product placement in wechat public platform article |
CN110362680A (en) * | 2019-06-14 | 2019-10-22 | 西安交通大学 | A kind of soft wide detection and advertisement abstracting method based on figure Crosslinking Structural |
CN111026850A (en) * | 2019-12-23 | 2020-04-17 | 园宝科技(武汉)有限公司 | Intellectual property matching technology of bidirectional coding representation of self-attention mechanism |
CN112905743A (en) * | 2021-02-20 | 2021-06-04 | 北京百度网讯科技有限公司 | Text object detection method and device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101251841A (en) * | 2007-05-17 | 2008-08-27 | 华东师范大学 | Method for establishing and searching feature matrix of Web document based on semantics |
CN101655866A (en) * | 2009-08-14 | 2010-02-24 | 北京中献电子技术开发中心 | Automatic decimation method of scientific and technical terminology |
CN102591854A (en) * | 2012-01-10 | 2012-07-18 | 凤凰在线(北京)信息技术有限公司 | Advertisement filtering system and advertisement filtering method specific to text characteristics |
CN102918532A (en) * | 2010-06-01 | 2013-02-06 | 微软公司 | Detection of junk in search result ranking |
CN103970801A (en) * | 2013-02-05 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for recognizing microblog advertisement blog articles |
CN104239539A (en) * | 2013-09-22 | 2014-12-24 | 中科嘉速(北京)并行软件有限公司 | Microblog information filtering method based on multi-information fusion |
CN104679730A (en) * | 2015-02-13 | 2015-06-03 | 刘秀磊 | Webpage summarization extraction method and device thereof |
-
2017
- 2017-02-28 CN CN201710113764.5A patent/CN106909669B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101251841A (en) * | 2007-05-17 | 2008-08-27 | 华东师范大学 | Method for establishing and searching feature matrix of Web document based on semantics |
CN101655866A (en) * | 2009-08-14 | 2010-02-24 | 北京中献电子技术开发中心 | Automatic decimation method of scientific and technical terminology |
CN102918532A (en) * | 2010-06-01 | 2013-02-06 | 微软公司 | Detection of junk in search result ranking |
CN102591854A (en) * | 2012-01-10 | 2012-07-18 | 凤凰在线(北京)信息技术有限公司 | Advertisement filtering system and advertisement filtering method specific to text characteristics |
CN103970801A (en) * | 2013-02-05 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for recognizing microblog advertisement blog articles |
CN104239539A (en) * | 2013-09-22 | 2014-12-24 | 中科嘉速(北京)并行软件有限公司 | Microblog information filtering method based on multi-information fusion |
CN104679730A (en) * | 2015-02-13 | 2015-06-03 | 刘秀磊 | Webpage summarization extraction method and device thereof |
Non-Patent Citations (1)
Title |
---|
耿崇等: "基于词位置与同现特征的中文自动文摘研究", 《第五届全国信息检索学术会议论文集》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108319582A (en) * | 2017-12-29 | 2018-07-24 | 北京城市网邻信息技术有限公司 | Processing method, device and the server of text message |
CN108805132A (en) * | 2018-06-01 | 2018-11-13 | 华中科技大学 | A kind of rubbish text filter method based on deep learning |
CN108805132B (en) * | 2018-06-01 | 2021-08-20 | 华中科技大学 | Rubbish text filtering method based on deep learning |
CN109815395A (en) * | 2018-12-26 | 2019-05-28 | 北京中科闻歌科技股份有限公司 | Webpage garbage information filtering method, device and storage medium |
CN109815395B (en) * | 2018-12-26 | 2021-06-08 | 北京中科闻歌科技股份有限公司 | Webpage spam filtering method and device and storage medium |
CN110362680A (en) * | 2019-06-14 | 2019-10-22 | 西安交通大学 | A kind of soft wide detection and advertisement abstracting method based on figure Crosslinking Structural |
CN110362680B (en) * | 2019-06-14 | 2021-07-13 | 西安交通大学 | Soft-wide detection and advertisement extraction method based on graph network structure analysis |
CN110275993A (en) * | 2019-06-25 | 2019-09-24 | 苏州梦嘉信息技术有限公司 | The management system and method for product placement in wechat public platform article |
CN111026850A (en) * | 2019-12-23 | 2020-04-17 | 园宝科技(武汉)有限公司 | Intellectual property matching technology of bidirectional coding representation of self-attention mechanism |
CN112905743A (en) * | 2021-02-20 | 2021-06-04 | 北京百度网讯科技有限公司 | Text object detection method and device, electronic equipment and storage medium |
CN112905743B (en) * | 2021-02-20 | 2023-08-01 | 北京百度网讯科技有限公司 | Text object detection method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106909669B (en) | 2020-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106909669A (en) | The detection method and device of a kind of promotion message | |
CN110705294B (en) | Named entity recognition model training method, named entity recognition method and named entity recognition device | |
CN104239539B (en) | A kind of micro-blog information filter method merged based on much information | |
CN106776574B (en) | User comment text mining method and device | |
CN110968684B (en) | Information processing method, device, equipment and storage medium | |
CN104008186B (en) | The method and apparatus that keyword is determined from target text | |
CN108073568A (en) | keyword extracting method and device | |
CN111967262A (en) | Method and device for determining entity tag | |
Handani et al. | Sentiment analysis for go-jek on google play store | |
CN102411563A (en) | Method, device and system for identifying target words | |
CN104462301B (en) | A kind for the treatment of method and apparatus of network data | |
CN111950254A (en) | Method, device and equipment for extracting word features of search sample and storage medium | |
CA3059929C (en) | Text searching method, apparatus, and non-transitory computer-readable storage medium | |
CN111860981B (en) | Enterprise national industry category prediction method and system based on LSTM deep learning | |
CN104503597B (en) | stroke input method, device and system | |
CN111860575B (en) | Method and device for processing object attribute information, electronic equipment and storage medium | |
CN108763496A (en) | A kind of sound state data fusion client segmentation algorithm based on grid and density | |
CN103324742B (en) | The method and apparatus of recommended keywords | |
CN112597283A (en) | Notification text information entity attribute extraction method, computer equipment and storage medium | |
CN104951435A (en) | Method and device for displaying keywords intelligently during chatting process | |
CN110019653B (en) | Social content representation method and system fusing text and tag network | |
Mohammed et al. | Classifying unsolicited bulk email (UBE) using python machine learning techniques | |
CN113268615A (en) | Resource label generation method and device, electronic equipment and storage medium | |
CN112380847A (en) | Interest point processing method and device, electronic equipment and storage medium | |
CN113642320A (en) | Method, device, equipment and medium for extracting document directory structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing Patentee after: Beijing time Ltd. Address before: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing Patentee before: BEIJING TIME Co.,Ltd. |