CN106909669A - The detection method and device of a kind of promotion message - Google Patents

The detection method and device of a kind of promotion message Download PDF

Info

Publication number
CN106909669A
CN106909669A CN201710113764.5A CN201710113764A CN106909669A CN 106909669 A CN106909669 A CN 106909669A CN 201710113764 A CN201710113764 A CN 201710113764A CN 106909669 A CN106909669 A CN 106909669A
Authority
CN
China
Prior art keywords
unit
document
candidate feature
candidate
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710113764.5A
Other languages
Chinese (zh)
Other versions
CN106909669B (en
Inventor
张德斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing time Ltd.
Original Assignee
Beijing Time Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Time Ltd By Share Ltd filed Critical Beijing Time Ltd By Share Ltd
Priority to CN201710113764.5A priority Critical patent/CN106909669B/en
Publication of CN106909669A publication Critical patent/CN106909669A/en
Application granted granted Critical
Publication of CN106909669B publication Critical patent/CN106909669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the detection method and device of a kind of promotion message, it is related to the text filtering processing technology field, the method to include:Default sample set is obtained, the information unit that each sample in sample set is included is extracted;Occurrence number of each information unit in sample set is counted, number of times is will appear from and is defined as candidate feature unit more than the information unit of default first threshold;For each candidate feature unit, distribution situation of the candidate feature unit in each documents location is counted respectively, determine whether the candidate feature unit is promotional features unit according to statistics;The promotion message included in document is detected according to fixed promotional features unit.As can be seen here, the present invention is capable of the effect of efficiently and accurately filtering advertisements information or rubbish promotion message so that can also extract pure news content using machine grasping means, drastically increases efficiency of the compilation from media platform news.

Description

The detection method and device of a kind of promotion message
Technical field
The present invention relates to text filtering processing technology field, and in particular to the detection method and device of a kind of promotion message.
Background technology
With the development of Internet technology, arrived from Media Era.It is different from traditional news media media, from media platform News there is the popularity in more preferable promptness and source, and opening from media platform in itself causes that each platform is used Family can both turn into the reader of news, it is also possible to the producer and publisher as news.It is more next for present case More breaking news are able to issue in time by wechat, microblogging etc. from media platform, and people are also increasingly accustomed to from from matchmaker Body platform obtains oneself news content interested.At the same time, by the mutual forwarding between user, from the new of media platform News has also obtained effective propagation.
But, inventor realize it is of the invention during, find at least there are the following problems in the prior art:In order to The news collected from media platform, facilitates user to read, and the news from media platform can be collected using the method for machine crawl Content.But, because being often mingled with advertising message or rubbish promotion message from the news content of media platform, adopt When news content carried out with prior art capturing, it is impossible to filtering advertisements information or rubbish promotion message exactly so that cannot Grab pure news content.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on State the detection method and device of the promotion message of problem.
According to an aspect of the invention, there is provided a kind of detection method of promotion message, including:Obtain default sample Set, extracts the information unit that each sample in sample set is included;Each information unit is counted in sample set Occurrence number, will appear from number of times and is defined as candidate feature unit more than the information unit of default first threshold;For each time Feature unit is selected, distribution situation of the candidate feature unit in each documents location is counted respectively, being determined according to statistics should Whether candidate feature unit is promotional features unit;The popularization letter included in document is detected according to fixed promotional features unit Breath.
According to another aspect of the present invention, there is provided a kind of detection means of promotion message, including:Information unit extracts mould Block, for obtaining default sample set, extracts the information unit that each sample in sample set is included;Candidate unit is true Cover half block, for counting occurrence number of each information unit in sample set, will appear from number of times more than default first threshold The information unit of value is defined as candidate feature unit;Unit determining module is promoted, for for each candidate feature unit, difference Distribution situation of the candidate feature unit in each documents location is counted, whether the candidate feature unit is determined according to statistics It is promotional features unit;Detection module, for detecting the promotion message included in document according to fixed promotional features unit.
As can be seen here, the invention provides the detection method and device of a kind of promotion message, by extracting default sample set Information unit in conjunction, and occurrence number according to information unit in sample set determines the candidate feature in information unit Unit, then the popularization according to candidate feature unit during the distribution situation of position determines candidate feature unit in each document is special Unit is levied, finally according to the promotion message included in the promotional features unit detection destination document for filtering out, it is achieved thereby that Effective during media platform news and accurate filtering advertisements information is extracted from using machine grasping means or rubbish promotes letter The effect of breath so that can also extract pure news content using machine grasping means, drastically increases compilation from media The efficiency of platform news.
Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention, And can be practiced according to the content of specification, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by specific embodiment of the invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 is a kind of flow chart of the detection method of promotion message that the embodiment of the present invention one is provided;
Fig. 2 is a kind of flow chart of the detection method of promotion message that the embodiment of the present invention two is provided;
Fig. 3 is a kind of structural representation of the detection means of promotion message that the embodiment of the present invention three is provided;
Fig. 4 is a kind of structural representation of the detection means of promotion message that the embodiment of the present invention four is provided;
Fig. 5 is straight with the candidate feature unit position distribution situation in a document that time correlation joins in the embodiment of the present invention Fang Tu;
Fig. 6 is that the candidate feature unit that is associated with advertising message or rubbish promotion message is in a document in the embodiment of the present invention Position distribution situation histogram.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.Conversely, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.
Embodiment one
Fig. 1 shows a kind of detection method of promotion message of present invention offer, and the method includes:
Step S110:Default sample set is obtained, the information unit that each sample in sample set is included is extracted.
Computer is identified to sample news content for convenience, it is necessary first to according to certain rule, to default Sample news content comprising advertising message or rubbish promotion message is split, and therefrom extracts what each sample was included Information unit.Wherein, default sample set refers to comprising advertising message or rubbish promotion message and with certain representativeness From Media News content, the sample set is typically rule of thumb selected and set by those skilled in the art.And it is above-mentioned Information unit be constitute sample news content base unit, its form typically can be sample news content be divided after produce Raw feature phrase, or the words with certain feature.Specific setting for presetting sample set is regular and above-mentioned The concrete form of information unit, the present invention is not especially limited, and those skilled in the art can flexibly set according to actual conditions.
Step S120:Occurrence number of each information unit in sample set is counted, number of times is will appear from more than default The information unit of first threshold is defined as candidate feature unit.
Because advertising message and rubbish promotion message are each deliberately being repeated from each news briefing person of media platform Information, therefore, generally comprise identical advertising message in the different news contents from same news briefing person or rubbish is pushed away Guangxin ceases.The statistics of the occurrence number in sample set is carried out to the information unit that step S110 is extracted, when certain information list When the occurrence number of unit exceedes default first threshold, illustrate that the information unit has great suspicion to belong to advertising message or rubbish Promotion message, therefore, the information unit is defined as candidate feature unit.
Wherein, default first threshold is from same news briefing person according to advertising message or rubbish promotion message Sample news content in number of repetition general status determine, when certain information unit be higher than the number of repetition when, just will The information unit is defined as the candidate feature unit with advertising message or rubbish promotion message suspicion.The first threshold it is specific It is determined that regular, the present invention is not especially limited, and those skilled in the art can flexibly determine according to test data and experience.
Step S130:For each candidate feature unit, the candidate feature unit is counted respectively in each documents location Distribution situation, determines whether the candidate feature unit is promotional features unit according to statistics.
By the way that after the preliminary screening of step S120, information unit of the major part comprising advertising message or rubbish promotion message is all Candidate feature unit can be confirmed as, but some numbers of repetition exceed the information list comprising normal news content of first threshold Unit can also be confirmed as candidate feature unit.
The present inventor is by lot of experiments and compares discovery repeatedly, the candidate feature list comprising normal news content Because being the non-content for deliberately repeating of news briefing person, position distribution situation in the sample typically can be than more uniform for unit; And the candidate feature unit comprising advertising message or rubbish promotion message belongs to the content that news briefing person deliberately repeats, so Position distribution situation in sample can compare concentration.According to this discovery, the present invention using candidate feature unit in the sample Position distribution situation is further screened to candidate feature unit, and position distribution compares the candidate feature unit of concentration It is defined as promotional features unit.
Step S140:The promotion message included in document is detected according to fixed promotional features unit.
By the treatment of above-mentioned steps, can obtain being concluded from default sample set the promotional features list for extracting Unit, is then identified, so that effectively by above-mentioned promotional features unit to the document to be detected that machine grasping means is obtained The corresponding promotion message included in document to be monitored is filtered out, the popularization letter for filtering out finally is removed from document to be detected Breath, it is possible to obtain relatively pure news content.
As can be seen here, the detection method of a kind of promotion message that the present invention is provided, by the default sample set of extraction Information unit, and occurrence number according to information unit in sample set determines the candidate feature unit in information unit, Then the promotional features list according to candidate feature unit during the distribution situation of position determines candidate feature unit in each document Unit, finally according to the promotion message included in the promotional features unit detection destination document for filtering out, it is achieved thereby that using Machine grasping means extracts from effective during media platform news and accurate filtering advertisements information or rubbish promotion message Effect so that can also extract pure news content using machine grasping means, drastically increases from media platform news The efficiency of compilation.
Embodiment two
Fig. 2 shows a kind of detection method of promotion message of present invention offer, and the method includes:
Step S210:Default sample set is obtained, the information unit that each sample in sample set is included is extracted.
Computer is identified to sample news content for convenience, it is necessary first to according to certain rule, to default Sample news content comprising advertising message or rubbish promotion message is split, and therefrom extracts what each sample was included Information unit.Because there is the situation that same piece news is repeated quickly and easily as many times as required, disappeared before default sample set is obtained Process again, can effectively reduce the amount of calculation for obtaining sample set, improve and obtain efficiency, therefore obtain the step of default sample set Suddenly specifically include to disappear multiple candidate samples and process again, sample set is obtained according to the candidate samples after the treatment again that disappears.
The specific treatment again that disappears includes calculating the similarity between the title of each candidate samples, for the phase between title Like degree the weight that disappears is carried out more than the candidate samples of default similarity threshold;It is not more than default phase for the similarity between title Like the candidate samples for spending threshold value, the corresponding keyword set of each candidate samples is inquired about, if the pass corresponding to two candidate samples The quantity of the same keyword included in keyword set is more than default amount threshold, then disappeared for two candidate samples Weight.Wherein it is preferred to, the similarity between the title of each candidate samples is calculated by maximum common subsequence algorithm, and respectively Keyword set corresponding to individual candidate samples is according to the reverse of each vocabulary for candidate samples obtained after word segmentation processing Document-frequency (IDF) determines that above-mentioned amount threshold determines according to Jie Kade similarity algorithms.
The above is understood for convenience, elaborates the weight processing procedure that disappears with a specific example below:1st, to institute The title and body matter for having sample article carry out Chinese word segmentation and go stop words to operate;2nd, calculation statistics is every in a distributed manner The word frequency (TF) of each word in piece sample article simultaneously calculates corresponding reverse document-frequency (IDF), and each word is calculated afterwards TF*IDF fractions;3rd, (the keyword quantity is only specific in this specific example to preceding 20 words for extracting in title word segmentation result Value, in other embodiments, those skilled in the art can set the keyword quantity according to actual conditions), constitute crucial Set of words, when title word segmentation result is less than 20 words, remaining keyword arranges knot from high to low by TF*IDF fractions in text Participle high in fruit is supplemented successively;4th, with the keyword set of all articles set up a point bucket (Bucket Table, be it is a kind of more Fine-grained data area dividing mode, point bucket can add extra structure to table, allow to be utilized during treatment inquiry operation The structure, so as to obtain query processing efficiency higher), wherein, the major key of each barrel is one in keyword set unique Keyword, the article in such bucket is possible to be similar;5th, in the similarity of every article of calculating, the piece is first found The corresponding 20 points of buckets of article are (because each point of bucket corresponds to a keyword, because every article has 20 keywords, institute So that every article corresponds to 20 points of buckets), then using maximum common subsequence algorithm to all articles in this article and bucket Title carry out Similarity Measure, when title similarity more than 0.75 (0.75 be this specific example in default similarity threshold, The threshold value is set by those skilled in the art according to actual conditions) when, that is, judge that two articles are content identical sample article, Disappeared and operate again;6th, when title similarity is not more than 0.75,20 similarities of keyword in more every article two-by-two, When similar keyword quantity more than 16 (16 be this specific example in the default number that is determined by Jie Kade similarity algorithms Amount threshold value, i.e., respective 20 words, identical word quantity is x, and Jie Kade similarities are x/ (20-x+20-x+x)=0.66, so For 16) when, you can judge that two articles are content identical sample word, disappeared and operate again.
In above-mentioned specific example, when article similarity two-by-two is compared, first compare title similarity, then compare keyword Similarity, because the amount of calculation of one side title is small, fast operation, while under normal circumstances, the similar article mark of content Topic major part is all similar;On the other hand, if only carrying out similarity-rough set with keyword, or only carried out with title similar Degree compares, and can exist and compare bottleneck, and comparative result is not accurate enough, therefore, the present invention is used after first comparing title and compares keyword Mode calculate similarity, two ways complements each other, and complements one another.
Realize it is of the invention during inventor find that setting up a point bucket by keyword set can effectively reduce data Amount of calculation.When not setting up a point bucket and directly calculating crucial Word similarity, algorithm complex is O (n^2), and wherein n is sample article Sum, when according to keyword foundation point bucket, algorithm complex is O (k*m^2), and wherein k is sample keyword sum, and m is every The average of article, k under individual keyword point bucket<<N and m<<N is right when n is 100,000,000 (when sample article quantity is 100,000,000) The k for answering only has tens of thousands of, and the algorithm complex hence set up after point bucket is lower.Meanwhile, because each point of major key of bucket is keyword A unique keyword in set, therefore article in same point of bucket is only possible to similitude, not in same point of bucket Article necessarily without any similar keyword, can directly be excluded, further reduce amount of calculation.In addition, because As long as the crucial Word similarity of article is laggard more than be judged to similar article by predetermined number threshold value and stopping calculating two-by-two Row disappears and process again, by the way of point bucket is set up can faster earlier find similar article, stop calculating in advance, i.e., using building The algorithm of vertical point of bucket mode is easier to tend to optimal complexity, rather than maximum complexity O (k*m^2).
Realize it is of the invention during inventor also found, calculate keyword Jie Kade similarities when, can adopt A kind of data structure is used, is traded space for time, optimize calculating speed:Build first index that size is 65536 (because According to Chinese character code rule, 65536 positions can represent all Chinese characters), will be every in every keyword set of article One lead-in of word as index position sequence number, other words as the index position property value, each father index can be with There are multiple subindexs, each subindex has a property value to represent that father index belongs to which article (category of subindex herein Property value use a binary number representation, that is, there is the M article binary number just to have M, the father indexes which article belonged to, The corresponding binary digit of that article is for 1).When needing to calculate the crucial Word similarity of article two-by-two, it is no longer necessary to two-by-two Calculate, as long as but searching dittograph in the same point of keyword data structure of bucket.Use a same M array (each one article of correspondence of array, i.e. M article correspondence M bit array), the initial value of each units is in the array 0, all subindexs under same father is indexed are compared, and if there is similar subindex, take out expression in the subindex The binary number of article is subordinated to, then adds 1 in the array position of correspondence article, and judge in the array whether is each bit value More than 16 (default amount thresholds i.e. mentioned above), when a certain bit value of array is more than 16, that is, the numerical value pair is illustrated The article answered is similar article, can stop calculating and being disappeared processing again.Two groups of 20 words are previously required to compare one by one, then M article is worst in one point of bucket needs to calculate M*400 times;After improvement, it is only necessary to quick search 20 times, compare 20 father's indexes Under subindex whether have similar, now worst need to inquire about 20*M times, and under actual conditions, the sub- rope under each father index Argument amount is much smaller than M, so amount of calculation can also reduce into multiple, this algorithm can quickly find duplicate articles earlier.
After the above-mentioned treatment again that disappears is completed, the information unit that each sample in sample set is included is extracted.Specifically, In the present embodiment, article content can be split by punctuation mark and line feed blank, so as to obtain the letter in sample Interest statement unit.For example " Quick Response Code ' identification ' concern is pinned, more pleasantly surprised to wait you " and can split and obtain two information units, point Wei " not pin Quick Response Code ' identification ' concern " and " more pleasantly surprised wait you ".In other embodiments, it would however also be possible to employ other Rule carries out segmentation to article content and extracts information unit, and the present invention is not especially limited to this, and those skilled in the art can be with Flexibly setting.
Step S220:Occurrence number of each information unit in sample set is counted, number of times is will appear from more than default The information unit of first threshold is defined as candidate feature unit.
Realize it is of the invention during, the inventors discovered that, by the analysis to historical data, each news hair Cloth person advertising message or rubbish promotion message substantially identical for including in the article of issue within a period of time, then with it is wide Announcement information or rubbish promotion message association information unit also necessarily high frequency repeat.Can be drawn by a large amount of statistical analyses The critical value of the number of repetition that the information unit associated with advertising message or rubbish promotion message is distinguished with general information unit, The critical value is above-mentioned default first threshold.All of information unit is screened by the default first threshold, will Occurrence number is defined as candidate feature unit more than the information unit of the default first threshold.
But because default first threshold is empirical value, all of normal duplicate contents can not be filtered out, so The position distribution and the position distribution feature of advertisement phrase of the normal news phrase for repeating are considered below, using L0 norm constraints, Normal content is more accurately filtered out, accurate news paper advertising phrase and position distribution number of repetition equal weight is obtained, finally used These data build news promotion message identification model.
Step S230:For each candidate feature unit, the candidate feature unit is counted respectively in each documents location Distribution situation, determines whether the candidate feature unit is promotional features unit according to statistics.
By the screening of step S220, the information unit of most normal content can be roughly screened out, but it is remaining Information unit (i.e. candidate feature unit) in, except the information unit associated with advertising message or rubbish promotion message, may be used also Can there is the information unit joined with time correlation included in normal news.Inventor is had found by statistical analysis, special in candidate In levying unit, the candidate feature unit that joins with time correlation because and non-artificial content deliberately repeatedly, in a document Position distribution situation is than more uniform (as shown in Figure 5);And the candidate feature unit associated with advertising message or rubbish promotion message It is the artificial content for deliberately repeating, so position distribution situation in a document compares concentration (as shown in Figure 6).Therefore, pass through Statistics candidate feature unit effectively can further filter out promotional features unit in the distribution situation of each documents location.
Specifically, can further be screened by the L0 norm constraints being distributed.First, divided according to default position Document content is divided into multiple documents locations by rule, wherein, default position division rule includes drawing based on paragraph granularity Divider is then and the division rule based on sentence granularity;Then, it is provided for representing the candidate feature unit in each document position The vector of the distribution situation put, wherein, each element in vector corresponds respectively to each documents location;If the candidate feature list Unit is more than default distribution threshold value in the distributed quantity of specified documents location, then the unit of the element corresponding to the specified documents location Element value non-zero, if the candidate feature unit is specifying the distributed quantity of documents location to be not more than default distribution threshold value, this refers to The element value for determining the element corresponding to documents location is zero, wherein, candidate feature unit is specifying the distributed quantity of documents location The occurrence number, and/or probability of occurrence of documents location are being specified including candidate feature unit;Finally, when nonzero element in vector Number be more than default element threshold value when, determine the candidate feature unit be promotional features unit.
Realize it is of the invention during, inventor considered four kinds of position division rules, was respectively paragraph size distribution The positive and negative sequence of percentage, sentence size distribution percentage, paragraph granularity and the positive and negative sequence of sentence granularity.By lot of experiments, hair A person of good sense has found similar public number article, and promotion message is concentrated mainly on article head or afterbody;Different content article, paragraph or sentence Sub- total amount is polymorphic, is equally present in first paragraph or last several sections, and percentage has very big difference;Same afterbody promotes letter Breath, if content is similar, then paragraph number is almost consistent;Afterbody promotion message is often liked with very short paragraph, this In the case of, using the best results of the positive and negative ordering rule of paragraph granularity.In a particular application, because information publisher will usually push away Guangxin breath is placed on article and starts full position (i.e. former sentences of first paragraph) or concentrate typesetting in the afterbody of article.Thus, Two articles of same editing and composing, if the position of promotion message is in article head (such as first paragraph), candidate feature list First position distribution situation can be counted using positive sequence, that is, concentrating on first paragraph can be designated as+1;Similarly, when editor's custom will Promotion message typesetting is at article afterbody (such as final stage), because every article paragraph quantity difference is larger, using forward direction Sequence counting can cause distribution situation statistical difference away from larger, and when such as an article has 20 sections, final stage is designated as+20, when one When article has 30 sections, then final stage is designated as+30, is now accomplished by being counted using converse sequencing, then no matter article paragraph number is How much, final stage is designated as -1 so that distribution statisticses result does not have relatively large deviation, therefore, the energy by the way of positive and negative sequence It is enough more accurately to reflect position distribution situation.In addition, studied through inventor finding, the positive and negative sequence of paragraph granularity is the most accurate (because typesetting is more based on paragraph, therefore paragraph can more embody editing and composing wish;And sentence is author writes the style of writing of article Custom or writing level).As can be seen here, in the present embodiment, one can be entered using the positive and negative ordering rule of paragraph granularity Step lifting accuracy rate.
Above-mentioned default distribution threshold value and default element threshold value are required to be determined by lot of experiments, specifically, needed Different distribution threshold values and element threshold value are taken respectively, and when comparing different values, the corresponding candidate feature unit of normal content The separating effect of candidate feature unit corresponding with advertising message, is finally defined as default point by the best value of separating effect Cloth threshold value and default element threshold value.In implementation process of the present invention, inventor is had found by lot of experiments, when in step S220 Default first threshold be 20 when, default distribution threshold value be 10, default element threshold value be 3 when, normal content is corresponding The separating effect of candidate feature unit candidate feature unit corresponding with advertising message is optimal.Now, when a candidate feature list When the occurrence number of unit certain position in article is more than 10, the corresponding vector element value in the position is not 0, conversely, the position is 0.Thus obtain the corresponding mapping (x of different candidate feature units<10, y=0;x>10, y=x) L0 norm values n, n be to Amount y0, y1 ... yi, in for 0 number.Work as n>When=3 (i.e. element threshold value is 3), it is special to promote to judge the candidate feature unit Levy unit.
Step S240:The promotion message included in document is detected according to fixed promotional features unit.
Specifically, according to fixed promotional features unit and its distribution situation in each documents location, correspondence is set Document detection model, according to the promotion message included in document detection model inspection document.
Wherein, according to fixed promotional features unit and its distribution situation in each documents location, set corresponding The step of document detection model, specifically includes according to fixed promotional features unit and its general in the appearance of each documents location Rate and default position weight, are set corresponding to the model parameter and each model parameter included in document detection model Weighted value.The computing formula of above-mentioned probability of occurrence be p=k/n, wherein n be the promotional features unit occur in a document it is total Number of times, k is the number of times that the promotional features unit occurs in the position.Because advertising message or rubbish promotion message are occurred frequently in The ad-hoc location of article, so need the diverse location occurred in a document to promotional features unit to assign different positions weighing Value, it should be noted that specific position weight needs to determine by lot of experiments, and advertising message or rubbish promotion message are normal The position weight of the ad-hoc location for often occurring should be higher than the position weight of other positions in text, could so reduce and delete by mistake just The probability of normal content.
Wherein, specifically included according to the step of the promotion message included in document detection model inspection document to be detected The information unit matched with the model parameter included in document detection model is searched in each information unit that document is included; For each information unit for finding, documents location according to the information unit in the document to be detected and/or with The weighted value of the model parameter that the information unit matches, determines the score value of the information unit, and the information list is determined according to score value Whether unit is promotion message.The computing formula of above-mentioned score value is default for information unit is multiplied by the probability of occurrence of each documents location Position weight because the corresponding position weight of ad-hoc location that advertising message or rubbish promotion message usually occur is higher, because This, last fraction information unit higher is particularly likely that promotion message.
Step S250:Documents location according to where the promotion message for detecting, deletes document.
Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the promotion message And its paragraph content before is deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document, Paragraph content to the promotion message and its afterwards is deleted;When the documents location where the promotion message for detecting belongs to During the middle part of document, the sentence where the promotion message is deleted.Operation is deleted by above-mentioned, can effectively be removed The advertising message or rubbish promotion message included in the news content of machine crawl, so as to get pure news content, side To the compilation from media platform news.
Step S260:Document detection model is updated according to the promotion message included in the document for detecting.
Wherein, document detection model includes deep learning model, it is particularly possible to using the convolution god in deep learning model It is in a particular application, acceptable according to the actually detected result of promotion message each time through network model, to the convolutional Neural net Network model is fed back, so as to constantly update document detection model, can improve constantly identification accuracy, is improved and is promoted letter The recognition efficiency of breath.
As can be seen here, the detection method of a kind of promotion message that the present invention is provided, first pass through carries out the weight that disappears to sample data Treatment, simplifies certain operand of this method, then by extracting the information unit in default sample set, and according to information Occurrence number of the unit in sample set determines the candidate feature unit in information unit, then according to candidate feature unit The distribution situation of position in each document, using L0 norm constraint algorithms, so that it is determined that the popularization in candidate feature unit is special Unit is levied, document detection model is set up finally according to the promotional features unit for filtering out, and using the document detection model to inspection Survey the promotion message included in destination document to be detected, so as to get the promotion message in destination document.Using getting Promotion message, the destination document that can be grabbed to machine is deleted, to obtain pure news content so that it is convenient from The news compilation work of media platform.And when document detection model uses deep learning model, can also be by each time The actually detected result of promotion message feeds back to document detection model, enables the model constantly to learn to constantly update, to adapt to development, Improve the accuracy of promotion message.
Embodiment three
Fig. 3 shows a kind of detection means of promotion message of present invention offer, and the device includes:Information unit extracts mould Block 310, candidate unit determining module 320, popularization unit determining module 330 and detection module 340.
Information unit extraction module 310, for obtaining default sample set, extracts each sample institute in sample set Comprising information unit.
Detection means is identified to sample news content for convenience, and information unit extraction module 310 is firstly the need of root According to certain rule, the default sample news content comprising advertising message or rubbish promotion message is split, and therefrom Extract the information unit that each sample is included.Wherein, default sample set refers to be promoted comprising advertising message or rubbish Information and with certain representational from Media News content, the sample set it is general by those skilled in the art rule of thumb Selected and set.And above-mentioned information unit is the base unit for constituting sample news content, its form typically can be Feature phrase, or the words with certain feature that sample news content is produced after being divided.For presetting sample set The specific setting rule and the concrete form of above- mentioned information unit closed, the present invention are not especially limited, and those skilled in the art can Flexibly set with according to actual conditions.
Candidate unit determining module 320, for counting occurrence number of each information unit in sample set, will appear from Number of times is defined as candidate feature unit more than the information unit of default first threshold.
Because advertising message and rubbish promotion message are each deliberately being repeated from each news briefing person of media platform Information, therefore, generally comprise identical advertising message in the different news contents from same news briefing person or rubbish is pushed away Guangxin ceases.Candidate unit determining module 320 is carried out in sample set to the information unit that information unit extraction module 310 is extracted The statistics of middle occurrence number, when the occurrence number of certain information unit exceedes default first threshold, illustrates the information unit There is great suspicion to belong to advertising message or rubbish promotion message, therefore, the information unit is defined as candidate feature unit.
Wherein, default first threshold is from same news briefing person according to advertising message or rubbish promotion message Sample news content in number of repetition general status determine, when certain information unit be higher than the number of repetition when, just will The information unit is defined as the candidate feature unit with advertising message or rubbish promotion message suspicion.The first threshold it is specific It is determined that regular, the present invention is not especially limited, and those skilled in the art can flexibly determine according to test data and experience.
Unit determining module 330 is promoted, is existed for for each candidate feature unit, counting the candidate feature unit respectively The distribution situation of each documents location, determines whether the candidate feature unit is promotional features unit according to statistics.
By the way that after the preliminary screening of candidate unit determining module 320, major part includes advertising message or rubbish promotion message Information unit can all be confirmed as candidate feature unit, but some numbers of repetition exceed first threshold comprising normal news The information unit of content can also be confirmed as candidate feature unit.
The present inventor is by lot of experiments and compares discovery repeatedly, the candidate feature list comprising normal news content Because being the non-content for deliberately repeating of news briefing person, position distribution situation in the sample typically can be than more uniform for unit; And the candidate feature unit comprising advertising message or rubbish promotion message belongs to the content that news briefing person deliberately repeats, so Position distribution situation in sample can compare concentration.According to this discovery, promote unit determining module 330 and use candidate feature list Unit position distribution situation in the sample is further screened to candidate feature unit, and position distribution is compared into concentration Candidate feature unit is defined as promotional features unit.
Detection module 340, for detecting the promotion message included in document according to fixed promotional features unit.
By promoting the treatment of unit determining module 330, can obtain concluding what is extracted from default sample set Promotional features unit, the text to be detected that then detection module 340 is obtained by above-mentioned promotional features unit to machine grasping means Shelves are identified, so that the corresponding promotion message included in effectively filtering out document to be monitored, finally from document to be detected It is middle to remove the promotion message for filtering out, it is possible to obtain relatively pure news content.
Concrete structure and operation principle on above-mentioned modules can refer to the description of appropriate section in embodiment of the method, Here is omitted.
As can be seen here, the detection means of a kind of promotion message that the present invention is provided, by the default sample set of extraction Information unit, and occurrence number according to information unit in sample set determines the candidate feature unit in information unit, Then the promotional features list according to candidate feature unit during the distribution situation of position determines candidate feature unit in each document Unit, finally according to the promotion message included in the promotional features unit detection destination document for filtering out, it is achieved thereby that using Machine grasping means extracts from effective during media platform news and accurate filtering advertisements information or rubbish promotion message Effect so that can also extract pure news content using machine grasping means, drastically increases from media platform news The efficiency of compilation.
Example IV
Fig. 4 shows a kind of detection means of promotion message of present invention offer, and the device includes:Information unit extracts mould Block 410, candidate unit determining module 420, promote unit determining module 430, detection module 440, update module 450 and delete mould Block 460, wherein, promote unit determining module 430 and further include that vectorial submodule 431, determination sub-module 432 and document are divided Submodule 433.
Information unit extraction module 410, it is each in the extraction sample set for for obtaining default sample set The information unit that individual sample is included.
Detection means is identified to sample news content for convenience, it is necessary first to according to certain rule, to default The sample news content comprising advertising message or rubbish promotion message split, and therefrom extract each sample and included Information unit.Because there is the situation that same piece news is repeated quickly and easily as many times as required, carried out before default sample set is obtained Disappear and process again, can effectively reduce the amount of calculation for obtaining sample set, improve and obtain efficiency, therefore information unit extraction module 410 Needs are disappeared to multiple candidate samples and process again, and sample set is obtained according to the candidate samples after the treatment again that disappears.
Specifically, information unit extraction module 410 needs to calculate the similarity between the title of each candidate samples, pin The weight that disappears is carried out more than the candidate samples of default similarity threshold to the similarity between title;For the similarity between title The candidate samples of no more than default similarity threshold, inquire about the corresponding keyword set of each candidate samples, if two candidates The quantity of the same keyword included in the keyword set corresponding to sample is more than default amount threshold, then for two times Sampling originally carries out the weight that disappears.Wherein it is preferred to, calculated between the title of each candidate samples by maximum common subsequence algorithm Similarity, and keyword set corresponding to each candidate samples according to candidate samples are carried out obtain after word segmentation processing each The reverse document-frequency (IDF) of vocabulary determines that above-mentioned amount threshold determines according to Jie Kade similarity algorithms.
After the above-mentioned treatment again that disappears is completed, each sample that information unit extraction module 410 is extracted in sample set is wrapped The information unit for containing.Specifically, in the present embodiment, article content can be divided by punctuation mark and line feed blank Cut, so as to obtain the information unit in sample.For example " pinning Quick Response Code ' identification ' concern, more pleasantly surprised to wait you " can divide Cut and obtain two information units, respectively " pin Quick Response Code ' identification ' concern " and " more pleasantly surprised wait you ".In other realities In applying example, it would however also be possible to employ other rules carry out segmentation to article content and extract information unit, the present invention does not make specific limit to this Fixed, those skilled in the art can flexibly set.
Candidate unit determining module 420, for counting occurrence number of each information unit in sample set, will appear from Number of times is defined as candidate feature unit more than the information unit of default first threshold.
Realize it is of the invention during, the inventors discovered that, by the analysis to historical data, each news hair Cloth person advertising message or rubbish promotion message substantially identical for including in the article of issue within a period of time, then with it is wide Announcement information or rubbish promotion message association information unit also necessarily high frequency repeat.Can be drawn by a large amount of statistical analyses The critical value of the number of repetition that the information unit associated with advertising message or rubbish promotion message is distinguished with general information unit, The critical value is above-mentioned default first threshold.Candidate unit determining module 420 is by the default first threshold to all of Information unit is screened, and be will appear from number of times and is defined as candidate feature unit more than the information unit of the default first threshold.
Unit determining module 430 is promoted, is existed for for each candidate feature unit, counting the candidate feature unit respectively The distribution situation of each documents location, determines whether the candidate feature unit is promotional features unit according to statistics.
By the screening of candidate unit determining module 420, the information list of most normal content can be roughly screened out Unit, but in remaining information unit (i.e. candidate feature unit), except the letter associated with advertising message or rubbish promotion message Interest statement unit, it is also possible to there is the information unit joined with time correlation included in normal news.Inventor is sent out by statistical analysis It is existing, in candidate feature unit, the candidate feature unit that joins with time correlation because and non-artificial content deliberately repeatedly, Position distribution situation in a document is than more uniform (as shown in Figure 5);And the time associated with advertising message or rubbish promotion message It is the artificial content for deliberately repeating to select feature unit, so position distribution situation in a document compares concentration (as shown in Figure 6). Therefore, popularization spy effectively can further be filtered out in the distribution situation of each documents location by counting candidate feature unit Levy unit.
Specifically, promoting unit determining module 430 includes that vectorial submodule 431, determination sub-module 432 and document divide son Module 433, wherein, vectorial submodule 431 is used to be provided for representing distribution of the candidate feature unit in each documents location The vector of situation;Wherein, each element in vector corresponds respectively to each documents location;If the candidate feature unit is being specified The distributed quantity of documents location is more than default distribution threshold value, then the element value of the element corresponding to the specified documents location is non- Zero;If the candidate feature unit is specifying the distributed quantity of documents location to be not more than default distribution threshold value, the specified document The element value of the element corresponding to position is zero;Determination sub-module 432 is used for the number when nonzero element in vector more than default Element threshold value when, determine the candidate feature unit be promotional features unit;Document divides submodule 433 to be used for according to default Document content is divided into multiple documents locations by position division rule;Wherein, above-mentioned default position division rule includes:Base In the division rule of paragraph granularity and the division rule based on sentence granularity;And above-mentioned candidate feature unit is in specified document The distributed quantity of position includes:Candidate feature unit is specifying the occurrence number, and/or probability of occurrence of documents location.
Detection module 440, for detecting the promotion message included in document according to fixed promotional features unit.
Specifically, detection module 440 was needed according to fixed promotional features unit and its dividing in each documents location Cloth situation, sets corresponding document detection model, according to the promotion message included in document detection model inspection document.Further Ground, detection module 440 is needed according to fixed promotional features unit and its in the probability of occurrence of each documents location and pre- If position weight, the model parameter that includes in document detection model and the weighted value corresponding to each model parameter are set; Then the model parameter phase searched from each information unit that document to be detected is included and included in document detection model The information unit of matching;For each information unit for finding, according to document of the information unit in document to be detected Position and/or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, true according to score value Whether the fixed information unit is promotion message.
The present invention can include update module 450, for updating text according to the promotion message included in the document for detecting Shelves detection model.Wherein, document detection model includes deep learning model, it is particularly possible to using the convolution in deep learning model Neural network model, in a particular application, update module 450 is acceptable according to the actually detected result of promotion message each time, The convolutional neural networks model is fed back, so as to constantly update document detection model, identification can be improved constantly accurate True property, improves the recognition efficiency of promotion message.
The present invention can also include deleting module 460, right for the documents location according to where the promotion message for detecting Document is deleted.Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the popularization Information and its paragraph content before are deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document When, paragraph content to the promotion message and its afterwards is deleted;Documents location where the promotion message for detecting When belonging to the middle part of document, the sentence where the promotion message is deleted.By deleting module 460, can effectively go The advertising message or rubbish promotion message included in the news content captured except machine, so as to get pure news content, Facilitate to the compilation from media platform news.
Concrete structure and operation principle on above-mentioned modules can refer to the description of appropriate section in embodiment of the method, Here is omitted.
As can be seen here, the detection means of a kind of promotion message that the present invention is provided, first pass through carries out the weight that disappears to sample data Treatment, simplifies certain operand of this method, then by extracting the information unit in default sample set, and according to information Occurrence number of the unit in sample set determines the candidate feature unit in information unit, then according to candidate feature unit The distribution situation of position in each document, using L0 norm constraint algorithms, so that it is determined that the popularization in candidate feature unit is special Unit is levied, document detection model is set up finally according to the promotional features unit for filtering out, and using the document detection model to inspection Survey the promotion message included in destination document to be detected, so as to get the promotion message in destination document.Using getting Promotion message, the destination document that can be grabbed to machine is deleted, to obtain pure news content so that it is convenient from The news compilation work of media platform.And when document detection model uses deep learning model, can also be by each time The actually detected result of promotion message feeds back to document detection model, enables the model constantly to learn to constantly update, to adapt to development, Improve the accuracy of promotion message.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system Structure be obvious.Additionally, the present invention is not also directed to any certain programmed language.It is understood that, it is possible to use it is various Programming language realizes the content of invention described herein, and the description done to language-specific above is to disclose this hair Bright preferred forms.
In specification mentioned herein, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be put into practice in the case of without these details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify one or more that the disclosure and helping understands in each inventive aspect, exist Above to the description of exemplary embodiment of the invention in, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, and wherein each claim is in itself All as separate embodiments of the invention.
Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Unit or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, can use any Combine to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can the alternative features of or similar purpose identical, equivalent by offer carry out generation Replace.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection is appointed One of meaning mode can be used in any combination.
All parts embodiment of the invention can be realized with hardware, or be run with one or more processor Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) are come in the detection means for realizing promotion message according to embodiments of the present invention The some or all functions of some or all parts.The present invention is also implemented as performing method as described herein Some or all equipment or program of device (for example, computer program and computer program product).Such reality Existing program of the invention can be stored on a computer-readable medium, or can have the form of one or more signal. Such signal can be downloaded from internet website and obtained, or be provided on carrier signal, or in any other form There is provided.
It should be noted that above-described embodiment the present invention will be described rather than limiting the invention, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol being located between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element is not excluded the presence of as multiple Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.
The invention discloses:A1. a kind of detection method of promotion message, including:
Default sample set is obtained, the information unit that each sample in the sample set is included is extracted;
Occurrence number of each information unit in the sample set is counted, number of times is will appear from more than default first threshold The information unit of value is defined as candidate feature unit;
For each candidate feature unit, distribution situation of the candidate feature unit in each documents location is counted respectively, Determine whether the candidate feature unit is promotional features unit according to statistics;
The promotion message included in document is detected according to fixed promotional features unit.
A2. the method according to A1, wherein, it is described count respectively the candidate feature unit each documents location point Cloth situation, determines that the step of whether the candidate feature unit is promotional features unit specifically includes according to statistics:
It is provided for representing vector of the candidate feature unit in the distribution situation of each documents location;Wherein, it is described to Each element in amount corresponds respectively to each documents location;
If the candidate feature unit is specifying the distributed quantity of documents location more than default distribution threshold value, the specified text The element value non-zero of the element corresponding to file location;If the candidate feature unit is not more than in the distributed quantity of specified documents location Default distribution threshold value, then the element value of the element corresponding to the specified documents location is zero;
When the number of nonzero element in the vector is more than default element threshold value, the candidate feature unit is determined to push away Wide feature unit.
A3. the method according to A2, wherein, it is described to be provided for representing the candidate feature unit in each documents location Distribution situation vectorial step before, further include step:Document content is drawn according to default position division rule It is divided into multiple documents locations;Wherein, the default position division rule includes:Division rule based on paragraph granularity and Division rule based on sentence granularity;
And the candidate feature unit includes in the distributed quantity of specified documents location:The candidate feature unit is being specified The occurrence number, and/or probability of occurrence of documents location.
A4. according to any described methods of A1-A3, wherein, specifically include the step of the acquisition default sample set:
Multiple candidate samples is disappeared and process again, the sample set is obtained according to the candidate samples after the treatment again that disappears.
A5. the method according to A4, wherein, it is described multiple candidate samples are disappeared treatment again the step of specifically wrap Include:
The similarity between the title of each candidate samples is calculated, for the similarity between title more than default similar Spending the candidate samples of threshold value carries out the weight that disappears;
It is not more than the candidate samples of default similarity threshold for the similarity between title, inquires about each candidate samples Corresponding keyword set, if the quantity of the same keyword included in keyword set corresponding to two candidate samples is big In default amount threshold, then the weight that disappears is carried out for described two candidate samples.
A6. the method according to A5, wherein, the step of similarity between the title for calculating each candidate samples Specifically include:The similarity between the title of each candidate samples is calculated by maximum common subsequence algorithm;
And the keyword set corresponding to each candidate samples is each according to carry out obtaining after word segmentation processing to candidate samples The reverse document-frequency of individual vocabulary determines;The amount threshold determines according to Jie Kade similarity algorithms.
A7. according to any described methods of A1-A6, wherein, it is described that document is detected according to fixed promotional features unit In include promotion message the step of specifically include:
According to fixed promotional features unit and its distribution situation in each documents location, corresponding document inspection is set Model is surveyed, according to the promotion message included in the document detection model inspection document.
A8. the method according to A7, wherein, it is described according to fixed promotional features unit and its in each document position The step of distribution situation put, setting corresponding document detection model, specifically includes:
According to the fixed promotional features unit and its probability of occurrence in each documents location and default position Weight is put, the weighted value corresponding to the model parameter and each model parameter included in the document detection model is set.
A9. the method according to A8, wherein, it is described according to the popularization included in the document detection model inspection document The step of information, specifically includes:
The mould searched from each information unit that document to be detected is included and included in the document detection model The information unit that shape parameter matches;
For each information unit for finding, according to documents location of the information unit in the document to be detected And/or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, being determined according to score value should Whether information unit is promotion message.
A10. the method according to A8 or A9, wherein, methods described further includes step:According to the document for detecting In the promotion message that includes update the document detection model;Wherein, the document detection model includes:Deep learning model.
A11. according to any described methods of A1-A10, wherein, it is described that text is detected according to fixed promotional features unit After the step of promotion message included in shelves, step is further included:
Documents location according to where the promotion message for detecting, deletes the document;
Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the promotion message And its paragraph content before is deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document, Paragraph content to the promotion message and its afterwards is deleted;When the documents location where the promotion message for detecting belongs to During the middle part of document, the sentence where the promotion message is deleted.
The invention also discloses:B12. a kind of detection means of promotion message, including:
Information unit extraction module, for obtaining default sample set, extracts each sample in the sample set Comprising information unit;
Candidate unit determining module, for counting occurrence number of each information unit in the sample set, will go out Occurrence number is defined as candidate feature unit more than the information unit of default first threshold;
Unit determining module is promoted, for for each candidate feature unit, counting the candidate feature unit respectively each The distribution situation of individual documents location, determines whether the candidate feature unit is promotional features unit according to statistics;
Detection module, for detecting the promotion message included in document according to fixed promotional features unit.
B13. the device according to B12, wherein, the popularization unit determining module is specifically included:
Vectorial submodule, for be provided for represent the candidate feature unit each documents location distribution situation to Amount;Wherein, each element in the vector corresponds respectively to each documents location;If the candidate feature unit is in specified document The distributed quantity of position is more than default distribution threshold value, then the element value non-zero of the element corresponding to the specified documents location;If The candidate feature unit is not more than default distribution threshold value in the distributed quantity of specified documents location, then the specified documents location institute The element value of corresponding element is zero;
Determination sub-module, for when the number of nonzero element in the vector is more than default element threshold value, it is determined that should Candidate feature unit is promotional features unit.
B14. the device according to B13, wherein, the popularization unit determining module is further included:
Document divides submodule, for document content to be divided into multiple document positions according to default position division rule Put;
Wherein, the default position division rule includes:Division rule based on paragraph granularity and based on sentence grain The division rule of degree;And the candidate feature unit includes in the distributed quantity of specified documents location:The candidate feature unit Specifying the occurrence number, and/or probability of occurrence of documents location.
B15. according to any described devices of B12-B14, wherein, described information unit extraction module is further used for:
Multiple candidate samples is disappeared and process again, the sample set is obtained according to the candidate samples after the treatment again that disappears.
B16. the device according to B15, wherein, described information unit extraction module specifically for:
The similarity between the title of each candidate samples is calculated, for the similarity between title more than default similar Spending the candidate samples of threshold value carries out the weight that disappears;
It is not more than the candidate samples of default similarity threshold for the similarity between title, inquires about each candidate samples Corresponding keyword set, if the quantity of the same keyword included in keyword set corresponding to two candidate samples is big In default amount threshold, then the weight that disappears is carried out for described two candidate samples.
B17. the device according to B16, wherein, described information unit extraction module specifically for:It is public by maximum Subsequence algorithm calculates the similarity between the title of each candidate samples;
And the keyword set corresponding to each candidate samples is each according to carry out obtaining after word segmentation processing to candidate samples The reverse document-frequency of individual vocabulary determines;The amount threshold determines according to Jie Kade similarity algorithms.
B18. according to any described devices of B12-B18, wherein, the detection module specifically for:
According to fixed promotional features unit and its distribution situation in each documents location, corresponding document inspection is set Model is surveyed, according to the promotion message included in the document detection model inspection document.
B19. the device according to B18, wherein, the detection module specifically for:
According to the fixed promotional features unit and its probability of occurrence in each documents location and default position Weight is put, the weighted value corresponding to the model parameter and each model parameter included in the document detection model is set.
B20. the device according to B19, wherein, the detection module specifically for:
The mould searched from each information unit that document to be detected is included and included in the document detection model The information unit that shape parameter matches;
For each information unit for finding, according to documents location of the information unit in the document to be detected And/or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, being determined according to score value should Whether information unit is promotion message.
B21. the device according to B19 or B20, wherein, described device is further included:
Update module, for updating the document detection model according to the promotion message included in the document for detecting;Its In, the document detection model includes:Deep learning model.
B22. according to any described devices of B12-B21, wherein, described device is further included:
Module is deleted, for the documents location according to where the promotion message for detecting, the document is deleted;
Wherein, when the documents location where the promotion message for detecting belongs to the stem of document, to the promotion message And its paragraph content before is deleted;When the documents location where the promotion message for detecting belongs to the afterbody of document, Paragraph content to the promotion message and its afterwards is deleted;When the documents location where the promotion message for detecting belongs to During the middle part of document, the sentence where the promotion message is deleted.

Claims (10)

1. a kind of detection method of promotion message, including:
Default sample set is obtained, the information unit that each sample in the sample set is included is extracted;
Occurrence number of each information unit in the sample set is counted, number of times is will appear from more than default first threshold Information unit is defined as candidate feature unit;
For each candidate feature unit, distribution situation of the candidate feature unit in each documents location is counted respectively, according to Statistics determines whether the candidate feature unit is promotional features unit;
The promotion message included in document is detected according to fixed promotional features unit.
2. method according to claim 1, wherein, the described candidate feature unit that counts respectively is in each documents location Distribution situation, determines that the step of whether the candidate feature unit is promotional features unit specifically includes according to statistics:
It is provided for representing vector of the candidate feature unit in the distribution situation of each documents location;Wherein, in the vector Each element correspond respectively to each documents location;
If the candidate feature unit is specifying the distributed quantity of documents location to be more than default distribution threshold value, the specified document position Put the element value non-zero of corresponding element;If the candidate feature unit is not more than default in the distributed quantity of specified documents location Distribution threshold value, then the element value of the element corresponding to the specified documents location is zero;
When the number of nonzero element in the vector is more than default element threshold value, it is special to promote to determine the candidate feature unit Levy unit.
3. method according to claim 2, wherein, it is described to be provided for representing the candidate feature unit in each document position Before the vectorial step of the distribution situation put, step is further included:According to default position division rule by document content It is divided into multiple documents locations;Wherein, the default position division rule includes:Division rule based on paragraph granularity, with And the division rule based on sentence granularity;
And the candidate feature unit includes in the distributed quantity of specified documents location:The candidate feature unit is in specified document The occurrence number, and/or probability of occurrence of position.
4. according to any described methods of claim 1-3, wherein, specifically wrap the step of the acquisition default sample set Include:
Multiple candidate samples is disappeared and process again, the sample set is obtained according to the candidate samples after the treatment again that disappears.
5. method according to claim 4, wherein, it is described multiple candidate samples are disappeared treatment again the step of specifically wrap Include:
The similarity between the title of each candidate samples is calculated, default similarity threshold is more than for the similarity between title The candidate samples of value carry out the weight that disappears;
It is not more than the candidate samples of default similarity threshold for the similarity between title, inquires about each candidate samples institute right The keyword set answered, if the quantity of the same keyword included in keyword set corresponding to two candidate samples is more than pre- If amount threshold, then disappear weight for described two candidate samples.
6. method according to claim 5, wherein, the step of the similarity between the title for calculating each candidate samples Suddenly specifically include:The similarity between the title of each candidate samples is calculated by maximum common subsequence algorithm;
And the keyword set corresponding to each candidate samples is according to each word for carrying out being obtained after word segmentation processing to candidate samples The reverse document-frequency for converging determines;The amount threshold determines according to Jie Kade similarity algorithms.
7. according to any described methods of claim 1-6, wherein, it is described that document is detected according to fixed promotional features unit In include promotion message the step of specifically include:
According to fixed promotional features unit and its distribution situation in each documents location, corresponding document detection mould is set Type, according to the promotion message included in the document detection model inspection document.
8. method according to claim 7, wherein, it is described according to fixed promotional features unit and its in each document The step of distribution situation of position, setting corresponding document detection model, specifically includes:
Weighed according to the fixed promotional features unit and its in the probability of occurrence of each documents location and default position Weight, sets the weighted value corresponding to the model parameter and each model parameter included in the document detection model.
9. method according to claim 8, wherein, it is described according to pushing away for being included in the document detection model inspection document The step of Guangxin ceases specifically includes:
Searched from each information unit that document to be detected is included and joined with the model included in the document detection model The information unit that number matches;
For each information unit for finding, according to documents location of the information unit in the document to be detected and/ Or the weighted value of the model parameter matched with the information unit, determine the score value of the information unit, the letter is determined according to score value Whether interest statement unit is promotion message.
10. a kind of detection means of promotion message, including:
Information unit extraction module, for obtaining default sample set, each sample extracted in the sample set is wrapped The information unit for containing;
Candidate unit determining module, for counting occurrence number of each information unit in the sample set, will appear from secondary Number is defined as candidate feature unit more than the information unit of default first threshold;
Unit determining module is promoted, for for each candidate feature unit, counting the candidate feature unit respectively in each text The distribution situation of file location, determines whether the candidate feature unit is promotional features unit according to statistics;
Detection module, for detecting the promotion message included in document according to fixed promotional features unit.
CN201710113764.5A 2017-02-28 2017-02-28 Method and device for detecting promotion information Active CN106909669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710113764.5A CN106909669B (en) 2017-02-28 2017-02-28 Method and device for detecting promotion information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710113764.5A CN106909669B (en) 2017-02-28 2017-02-28 Method and device for detecting promotion information

Publications (2)

Publication Number Publication Date
CN106909669A true CN106909669A (en) 2017-06-30
CN106909669B CN106909669B (en) 2020-02-11

Family

ID=59209410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710113764.5A Active CN106909669B (en) 2017-02-28 2017-02-28 Method and device for detecting promotion information

Country Status (1)

Country Link
CN (1) CN106909669B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319582A (en) * 2017-12-29 2018-07-24 北京城市网邻信息技术有限公司 Processing method, device and the server of text message
CN108805132A (en) * 2018-06-01 2018-11-13 华中科技大学 A kind of rubbish text filter method based on deep learning
CN109815395A (en) * 2018-12-26 2019-05-28 北京中科闻歌科技股份有限公司 Webpage garbage information filtering method, device and storage medium
CN110275993A (en) * 2019-06-25 2019-09-24 苏州梦嘉信息技术有限公司 The management system and method for product placement in wechat public platform article
CN110362680A (en) * 2019-06-14 2019-10-22 西安交通大学 A kind of soft wide detection and advertisement abstracting method based on figure Crosslinking Structural
CN111026850A (en) * 2019-12-23 2020-04-17 园宝科技(武汉)有限公司 Intellectual property matching technology of bidirectional coding representation of self-attention mechanism
CN112905743A (en) * 2021-02-20 2021-06-04 北京百度网讯科技有限公司 Text object detection method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN102918532A (en) * 2010-06-01 2013-02-06 微软公司 Detection of junk in search result ranking
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN104239539A (en) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 Microblog information filtering method based on multi-information fusion
CN104679730A (en) * 2015-02-13 2015-06-03 刘秀磊 Webpage summarization extraction method and device thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN102918532A (en) * 2010-06-01 2013-02-06 微软公司 Detection of junk in search result ranking
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN104239539A (en) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 Microblog information filtering method based on multi-information fusion
CN104679730A (en) * 2015-02-13 2015-06-03 刘秀磊 Webpage summarization extraction method and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
耿崇等: "基于词位置与同现特征的中文自动文摘研究", 《第五届全国信息检索学术会议论文集》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319582A (en) * 2017-12-29 2018-07-24 北京城市网邻信息技术有限公司 Processing method, device and the server of text message
CN108805132A (en) * 2018-06-01 2018-11-13 华中科技大学 A kind of rubbish text filter method based on deep learning
CN108805132B (en) * 2018-06-01 2021-08-20 华中科技大学 Rubbish text filtering method based on deep learning
CN109815395A (en) * 2018-12-26 2019-05-28 北京中科闻歌科技股份有限公司 Webpage garbage information filtering method, device and storage medium
CN109815395B (en) * 2018-12-26 2021-06-08 北京中科闻歌科技股份有限公司 Webpage spam filtering method and device and storage medium
CN110362680A (en) * 2019-06-14 2019-10-22 西安交通大学 A kind of soft wide detection and advertisement abstracting method based on figure Crosslinking Structural
CN110362680B (en) * 2019-06-14 2021-07-13 西安交通大学 Soft-wide detection and advertisement extraction method based on graph network structure analysis
CN110275993A (en) * 2019-06-25 2019-09-24 苏州梦嘉信息技术有限公司 The management system and method for product placement in wechat public platform article
CN111026850A (en) * 2019-12-23 2020-04-17 园宝科技(武汉)有限公司 Intellectual property matching technology of bidirectional coding representation of self-attention mechanism
CN112905743A (en) * 2021-02-20 2021-06-04 北京百度网讯科技有限公司 Text object detection method and device, electronic equipment and storage medium
CN112905743B (en) * 2021-02-20 2023-08-01 北京百度网讯科技有限公司 Text object detection method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106909669B (en) 2020-02-11

Similar Documents

Publication Publication Date Title
CN106909669A (en) The detection method and device of a kind of promotion message
CN110705294B (en) Named entity recognition model training method, named entity recognition method and named entity recognition device
CN104239539B (en) A kind of micro-blog information filter method merged based on much information
CN106776574B (en) User comment text mining method and device
CN110968684B (en) Information processing method, device, equipment and storage medium
CN104008186B (en) The method and apparatus that keyword is determined from target text
CN108073568A (en) keyword extracting method and device
CN111967262A (en) Method and device for determining entity tag
Handani et al. Sentiment analysis for go-jek on google play store
CN102411563A (en) Method, device and system for identifying target words
CN104462301B (en) A kind for the treatment of method and apparatus of network data
CN111950254A (en) Method, device and equipment for extracting word features of search sample and storage medium
CA3059929C (en) Text searching method, apparatus, and non-transitory computer-readable storage medium
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN104503597B (en) stroke input method, device and system
CN111860575B (en) Method and device for processing object attribute information, electronic equipment and storage medium
CN108763496A (en) A kind of sound state data fusion client segmentation algorithm based on grid and density
CN103324742B (en) The method and apparatus of recommended keywords
CN112597283A (en) Notification text information entity attribute extraction method, computer equipment and storage medium
CN104951435A (en) Method and device for displaying keywords intelligently during chatting process
CN110019653B (en) Social content representation method and system fusing text and tag network
Mohammed et al. Classifying unsolicited bulk email (UBE) using python machine learning techniques
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN112380847A (en) Interest point processing method and device, electronic equipment and storage medium
CN113642320A (en) Method, device, equipment and medium for extracting document directory structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing

Patentee after: Beijing time Ltd.

Address before: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing

Patentee before: BEIJING TIME Co.,Ltd.