CN103324617A

CN103324617A - Identification method and system for history waste information

Info

Publication number: CN103324617A
Application number: CN2012100744065A
Authority: CN
Inventors: 周斌; 刘婷婷
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2012-03-20
Filing date: 2012-03-20
Publication date: 2013-09-25

Abstract

The invention is suitable for the technical field of internet, and provides an identification method and system for history waste information. The method comprises the following steps: when receiving a request of browsing an appointed webpage, extracting content feature information of the appointed webpage, according to a feature recognition algorithm stored in a preset knowledge base, carrying out matching identification on the content feature information of the appointed webpage and the feature information stored in the preset knowledge base, acquiring an identification result, and according to the identification result, identifying whether the information in the appointed webpage belongs to the history waste information or not. Through the identification to the history waste information in the webpage based on read-check, the identification cost is enabled to be reduced, and the identification rate, the real-time performance and the adaptability are improved.

Description

A kind of recognition methods of historical rubbish message and system

Technical field

The invention belongs to the internet technique field, relate in particular to a kind of recognition methods, Apparatus and system of historical rubbish message.

Background technology

For the ease of understanding technical solution of the present invention, now following word is described:

PV (network browsing amount): PV is the abbreviation of web page browsing amount (Page View).Identify the page number an of visitor (0: 24 point) access websites in 24 hours.The same page of same visitor views website is not counted in the PV value.

Write operation: refer to contribute in the network application of content users such as blog, forum, message board, comments user's issue, the more operation of the contents such as new literacy, link, video, picture.

Read operation: refer to contribute in the network application of content users such as blog, forum, message board, comments, user's browsing page produces the operation of PV (network browsing amount).

Write audit: refer to contribute in the network application of content users such as blog, forum, message board, comments, the content that the user writes is examined and filtered.Writing audit triggers when user's update content.

Knowledge base: adopt machine learning algorithm etc., in the network applications such as blog, forum, message board, comment, the contents such as literal, link are carried out rubbish message when filtering, the set of the rule that draws through systematic training.

Historical rubbish message: refer in the network applications such as blog, forum, message board, comment the contents such as literal, link are carried out rubbish message when filtering, because the renewal speed hysteresis quality of knowledge base causes, the rubbish message that is not in time identified after the user delivers.

Day by day universal along with network contributes the network application of content such as blog article, comment, message etc. more and more to receive netizen and product development person's concern by the user.Under this background, also there is the part malefactor to utilize these to use the rubbish messages such as the political reaction class of issue, pornographic class, commercial paper.

Existing technology is mainly used and a kind ofly based on the mode of writing audit rubbish message is identified.This mode is utilized automatic identification algorithm, when user's update content the message of its issue is examined and is filtered, and recognizer comprises keyword identification, probability statistics, machine learning etc.Yet, because the rubbish message form in the network application often changes, no matter which kind of automatic identification algorithm, all need to safeguard the knowledge base of a real-time update, the rubbish message that just can guarantee neomorph can not leak recognition logic, and normal messages can not identified by mistake, also be that rubbish message on the network is along with time and hitting dynamics can produce various variations, so that learning process often has hysteresis quality, the historical rubbish message that causes for hysteresis quality, prior art often by manual or automanual mode to the data in whole webpages or be called historical data and scan, identifying historical rubbish message, and this mode exists cost high, reaction is slow, the problems such as adaptivity is poor.

Summary of the invention

The purpose of the embodiment of the invention is to provide a kind of recognition methods and system of historical rubbish message, be intended to solve because prior art can't realize automatically identification based on writing the historical rubbish message that stays after the audit, cause identifying that cost is high, discrimination is low, real-time and the poor problem of adaptivity.

The embodiment of the invention is achieved in that a kind of recognition methods of historical rubbish message, and described method comprises the steps:

When receiving the request of browsing named web page, extract the content characteristic information of described named web page;

According to the feature recognition algorithms of storing in the default knowledge base, the characteristic information of storing in the content characteristic information of described named web page and the described default knowledge base is mated identification, obtain recognition result;

According to described recognition result, whether the information of identifying in the described named web page belongs to historical rubbish message.

Another purpose of the embodiment of the invention is to provide a kind of recognition system of historical rubbish message, and described system comprises:

Feature extraction unit is used for extracting the content characteristic information of described named web page when receiving the request of browsing named web page;

The coupling recognition unit is used for the feature recognition algorithms of storing according to default knowledge base, and the characteristic information of storing in the content characteristic information of described named web page and the described default knowledge base is mated identification, obtains recognition result; And

Recognition unit is used for according to described recognition result, and whether the information of identifying in the described named web page belongs to historical rubbish message.

The embodiment of the invention is by when receiving the request of browsing named web page, the content characteristic information of this named web page of extract real-time, and according to the feature recognition algorithms of storing in the default knowledge base, the content characteristic information of this named web page is preset the characteristic information of storing in knowledge base with this mate identification, according to the recognition result that obtains, whether the information that identifies in this named web page belongs to historical rubbish message, having solved prior art only is to realize the removing of part rubbish message in the contribution content stage, and to historical rubbish message, must scan historical data comprehensively and carry out manual or automanual reset mode, cause identifying cost high, discrimination is low, the problem that real-time and adaptivity are poor, reduced the identification cost, improved discrimination, reached and identified preferably real-time and adaptivity.

Description of drawings

Fig. 1 is the realization flow figure of the recognition methods of the historical rubbish message that provides of first embodiment of the invention;

Fig. 2 is the structural drawing of the recognition system of the historical rubbish message that provides of second embodiment of the invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.

The embodiment of the invention is by need to browse a certain webpage etc. at every turn, this history rubbish message recognition methods all can be examined this webpage, after also namely the information in creating this webpage and webpage is write audit, again carry out once or repeatedly reading audit identification, so that more effective to the filtration of historical rubbish message, and with respect to methods such as existing manual examination and verification, the identification cost is lower, and real-time and adaptivity are higher.

Below in conjunction with specific embodiment specific implementation of the present invention is described in detail:

Read audit and refer to refer in the network applications such as blog, forum, message board, comment with to write audit relative, the contents such as the literal that the user is contributed, link are examined and the mode of filtering automatically.Read audit and when webpage produces PV (network browsing amount), automatically trigger, be included in the new content of operation triggerings such as clicking when browsing current web page etc.

Embodiment one:

Fig. 1 shows the realization flow of the recognition methods of the historical rubbish message that first embodiment of the invention provides, and details are as follows:

In step S101, when receiving the request of browsing named web page, extract the content characteristic information of this named web page.

Wherein, before execution in step S101, need to utilize prior art to create this named web page and this named web page is write audit, this named web page is through writing the webpage after the audit.

Particularly, the user can contribute in the network application of content users such as blog, forum, message board, comments, issue, the operation of update content, when the user triggers or during the content that begins to issue, upgrade, need to examine and filter the content that the user writes, also namely write audit, issue malice, bad or title rubbish message to prevent the user.And the content that the user writes is examined with the detailed process of filtering as utilizing prior art, based on knowledge base and related algorithm, the characteristic informations such as the literal of user's write operation issue, link, picture, video are identified.Quality that simultaneously can also recognition effect, selection can have the algorithm of higher discrimination or up-to-date recognizer is stored in knowledge base to this characteristic information, impels knowledge base update.In addition, because some rubbish message can't be differentiated by automatic identification algorithm, also can be upgraded knowledge base by the feature of extracting after the manual examination and verification and rule.Yet in actual applications, malicious user tends to the rubbish message of issue is transformed.The renewal of knowledge base tends to lag behind the variation of rubbish message.So message before the rubbish message that knowledge base identification makes new advances changes, can't automatically process, also namely utilize original knowledge base, often can't examine, filter the rubbish message of new generation, this part filter out or the unrecognized rubbish message that goes out namely becomes historical rubbish message.

Rubbish message in the network generally has, but is not limited to following characteristics:

1) has illegal link.The implication of illegal link is the web site url etc. that comprises advertisement, pornographic, political reaction class content;

2) has obvious rubbish message keyword.As: pornographic key word, political reaction key word, swindle class key word etc.;

3) has the keyword that does not significantly meet normal messages.As: special character ← ↑ ↓ etc.

In specific implementation process, owing to generally can have the rubbish message content that is not easy characteristic information extraction in the webpage, in order to prevent from missing rubbish message to be identified, after this receives the step of the request of browsing named web page, this extracts before the step of content characteristic information of this named web page, can also carry out pre-service to the content of this named web page, such as adopting the literal preprocess method, comprise: remove the space, newline, the English unified small letter that is converted to, the Chinese character code conversion, FJZ, Japanese turns the space, special symbol turns the space, whole-angle figure turns English digital, double byte character turns the asci code character, and Chinese figure turns English digital etc., thereby has realized the comprehensive extraction to the content characteristic information of named web page, wherein, the content characteristic packets of information purse rope page or leaf of this named web page link, key word, picture, in the characteristic informations such as video one or more.Further, this step S101 is to Content Feature Extractions such as the literal of user by contributing in the network applications such as blog, forum, message board, comment, links, this leaching process can trigger when webpage produces PV (network browsing amount) automatically, be included in the new content of the operation triggerings such as content of carrying out when browsing this named web page in the webpage clicking etc., as long as thereby the request that detects browsing page waited just realization identification in other words before user's browsing page, improved the real-time of identification.

In step S102, according to the feature recognition algorithms of storing in the default knowledge base, the characteristic information of storing in the default knowledge base of content characteristic information and this with this named web page mates identification, obtains recognition result.

Wherein, should preset knowledge base for this named web page being write the knowledge base after the knowledge base of using when examining is upgraded, the content of this renewal can comprise characteristic information, New Characteristics recognizer of new rubbish message etc., the characteristic information of the rubbish message that this is new can be new crucial character/word, special character, the illegal chained library of network etc., and this feature recognition algorithms can be the recognizers such as machine learning, Bayes, support vector machine.Concrete update method can be carried out feature information extraction and extract the rule that can be used for automatic identification algorithm etc. for rubbish message that manual examination and verification are gone out etc., and storage be somebody's turn to do the specific location of presetting knowledge base.

In specific implementation process, feature recognition algorithms or recognition rule based on rubbish message pre-stored in the default knowledge base, the content characteristic information of this named web page is preset the characteristic information of storing in knowledge base with this mate identification, for example pre-stored all key words or word in the default knowledge base of the key word in the content characteristic information of this named web page and this mated, perhaps all pre-stored web page interlinkages in the default knowledge base of the web page interlinkage in the content characteristic information of this named web page and this are mated etc., judge the key word that whether has identical or satisfied certain matching condition in this default knowledge base, link etc., to obtain recognition result, wherein this recognition result comprises total number of mating the content characteristic information of identifying this named web page successfully, this coupling is identified the number of each type content characteristic information in the content characteristic information of this named web page successfully etc., the key word of this named web page that for example, the match is successful, web page interlinkage, each number that the match is successful etc. in the types such as picture.

In step S103, according to this recognition result, whether the information of identifying in this named web page belongs to historical rubbish message.

In specific implementation process, this step S103 is specially: judge whether the number of mating the content characteristic information of specified type in the content characteristic information of identifying this named web page successfully surpasses the first predetermined threshold value, and/or whether the total number that should mate the content characteristic information of identifying this named web page successfully surpasses the second predetermined threshold value, be, judge that then the information in this named web page belongs to historical rubbish message, otherwise, judge that the information in this named web page does not belong to historical rubbish message.Wherein, this first predetermined threshold value can be identical or not identical with this second predetermined threshold value, a certain numerical value that sets in advance according to actual conditions for the user.This content characteristic information of mating specified type in the content characteristic information of identifying this named web page successfully can be a certain or polytype content characteristic message of appointment, the number that certain class picture appears in the content characteristic information of for example identify this named web page successfully when coupling is during above the first predetermined threshold value of presetting, or certain class picture of the content characteristic information of appearance this named web page that the match is successful, when the individual number average of video surpasses the first predetermined threshold value etc., think that then the information in this named web page belongs to historical rubbish message, further, can process the historical rubbish message or this named web page that identify by this history rubbish recognition methods, open this named web page etc. with the rubbish message or the total ban that reduce in this named web page.

In embodiments of the present invention, this history rubbish recognition methods is by when receiving the request of browsing named web page, extract the content characteristic information of this named web page, according to the feature recognition algorithms of storing in the default knowledge base, the content characteristic information of this named web page is preset the characteristic information of storing in knowledge base with this mate identification, further according to the recognition result that gets access to, whether the information of identifying in this named web page belongs to historical rubbish message, solved because prior art is just removed the part rubbish message based on writing audit, and to based on writing the historical rubbish message that stays after the audit, needing the whole historical datas of scanning to carry out artificial or semi-automatic identification removes, can't realize automatic identification, so that the identification cost is high, discrimination is low, the problem that real-time and adaptivity are poor, thereby the webpage of only user being browsed is examined, and not viewed webpage is generally without focus, can not examine, realized with lower identification cost, higher discrimination, the purpose of real-time and adaptivity identification rubbish message.

One of ordinary skill in the art will appreciate that all or part of step that realizes in above-described embodiment method is to come the relevant hardware of instruction to finish by program, described program can be stored in the computer read/write memory medium, described storage medium is such as ROM/RAM, disk, CD etc.

Embodiment two:

Fig. 2 shows the structure of the recognition system of the historical rubbish message that second embodiment of the invention provides, and for convenience of explanation, only shows the part relevant with the embodiment of the invention.

The recognition system of this history rubbish message comprises feature extraction unit 21, coupling recognition unit 22 and recognition unit 23, wherein:

Feature extraction unit 21 is used for extracting the content characteristic information of this named web page when receiving the request of browsing named web page.

In embodiments of the present invention, before triggering feature extraction unit 21, need to utilize prior art to create this named web page and this named web page is write audit, this named web page is for through writing the webpage after the audit, and the knowledge base of the knowledge base of should default knowledge base using when this named web page is write audit after upgrading; In the content characteristic packets of information purse rope page or leaf link of this named web page, key word, picture, the video information one or more.Before user's browsing page, wait in other words the content characteristic information that just realizes this named web page of extraction and identify further rubbish message as long as the application is the request that detects browsing page, can improve the real-time of rubbish message identification.

In addition, the recognition system of this history rubbish message also comprises pretreatment unit, is used in advance the content of this named web page being carried out pre-service before the content characteristic information of extracting this named web page.Such as adopting the literal preprocess method, comprise: remove space, newline, English unified small letter, Chinese character code conversion, the FJZ of being converted to, Japanese turns the space, and special symbol turns the space, and whole-angle figure turns English digital, double byte character turns the asci code character, Chinese figure turns English digital etc., thereby can prevent from missing difficult indiscernible rubbish message, realizes the comprehensive extraction to the content characteristic information of named web page.

Coupling recognition unit 22 is used for the feature recognition algorithms of storing according to default knowledge base, with the content characteristic information of this named web page with should default knowledge base in the characteristic information stored mate identification, obtain recognition result.

Wherein, should preset knowledge base for this named web page being write the knowledge base after the knowledge base of using when examining is upgraded, the content of this renewal can comprise characteristic information, New Characteristics recognizer of new rubbish message etc., the characteristic information of the rubbish message that this is new can be new crucial character/word, special character, the illegal chained library of network etc., and this feature recognition algorithms can be the recognizers such as machine learning, Bayes, support vector machine.Concrete update method can be carried out feature information extraction and extract the rule that can be used for automatic identification algorithm etc. for rubbish message that manual examination and verification are gone out etc., and storage be somebody's turn to do the specific location of presetting knowledge base.And this recognition result comprises total number of mating the content characteristic information of identifying this named web page successfully and/or the number of mating each type content characteristic information in the content characteristic information of identifying this named web page successfully.

Recognition unit 23 is used for according to this recognition result, and whether the information of identifying in this named web page belongs to historical rubbish message.

This recognition unit 23 specifically comprises coupling recognition unit 231 and identifying unit 232, wherein:

Coupling recognition unit 231, whether be used for judging the number of content characteristic information of content characteristic information specified type that this coupling identify this named web page successfully above the first predetermined threshold value, and/or be somebody's turn to do mate the content characteristic information of identifying this named web page successfully total number whether above the second predetermined threshold value; And

Identifying unit 232, be used for when this coupling recognition unit Output rusults when being, judge that the information in this named web page belongs to historical rubbish message.

In embodiments of the present invention, can utilize existing feature recognition algorithms, such as machine learning algorithm, again based on rubbish message pre-stored in the knowledge base, content characteristic information in this named web page is identified, obtain recognition result, such as the total number that can obtain the content characteristic information that belongs to rubbish message that identifies, the number of the content characteristic information of the rubbish message of each type etc., in addition, the rubbish message of this named web page of identifying of this place is based on and writes the historical rubbish message that audit stays.Then utilize 231 pairs of these recognition results of coupling recognition unit to judge, judge namely that also this coupling identifies the number of the content characteristic information of specified type in the content characteristic information of this named web page successfully and whether surpass the first predetermined threshold value, and/or whether the total number that should mate the content characteristic information of identifying this named web page successfully surpasses the second predetermined threshold value etc., thereby when recognition result is output as while being, identifying unit 232 can judge that the information in this named web page belongs to historical rubbish message, otherwise, information in this named web page does not belong to historical rubbish message

In embodiments of the present invention, the feature recognition algorithms of this history rubbish message recognition system by storing in the default knowledge base of feature extraction unit 21 bases, the content characteristic information of this named web page is preset the characteristic information of storing in knowledge base with this mate identification, obtain recognition result, thereby recognition unit 23 is according to this recognition result, whether the information of identifying in this named web page belongs to historical rubbish message, after also namely the information in creating this webpage and webpage is write audit, again read audit identification, so that more effective to the filtration of historical rubbish message, and with respect to methods such as existing manual examination and verification, the identification cost is lower, and real-time and adaptivity are higher.

The embodiment of the invention is by extracting the content characteristic information of certain named web page that need to browse, this content characteristic information is carried out the identification of historical rubbish message, judge whether the message in this named web page belongs to historical rubbish message, thereby this named web page is carried out respective handling, solved prior art often by the whole historical datas of scanning, and utilize manual or automanual mode that historical rubbish message is identified, can't automatically identify, cause occurring the identification cost high, reaction is slow, the problems such as adaptivity is poor, so that under the prerequisite that does not improve the identification cost, realized the raising adaptivity, the purpose of real-time and discrimination etc.

The above only is preferred embodiment of the present invention, not in order to limiting the present invention, all any modifications of doing within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. the recognition methods of a historical rubbish message is characterized in that, described method comprises the steps:

2. the method for claim 1 is characterized in that, described named web page is write audit webpage afterwards for process, and described default knowledge base is the knowledge base after the knowledge base of use was upgraded when described named web page was write audit.

3. method as claimed in claim 2 is characterized in that, one or more in the content characteristic packets of information purse rope page or leaf link of described named web page, key word, picture, the video information.

4. method as claimed in claim 3, it is characterized in that, described recognition result comprises total number of mating the content characteristic information of identifying described named web page successfully and/or the number of mating each type content characteristic information in the content characteristic information of identifying described named web page successfully.

5. method as claimed in claim 4 is characterized in that, and is described according to described recognition result, and whether the information in the described named web page identified is that the step of historical rubbish message is specially:

Judge that described coupling identifies the number of the content characteristic information of specified type in the content characteristic information of described named web page successfully and whether surpass the first predetermined threshold value, and/or whether described coupling identify total number of content characteristic information of described named web page successfully above the second predetermined threshold value;

Be, judge that then the information in the described named web page belongs to historical rubbish message, no, judge that then the information in the described named web page does not belong to historical rubbish message.

6. the recognition system of a historical rubbish message is characterized in that, described system comprises:

7. system as claimed in claim 6 is characterized in that, described named web page is write audit webpage afterwards for process, and described default knowledge base is the knowledge base after the knowledge base of use was upgraded when described named web page was write audit.

8. system as claimed in claim 7 is characterized in that, one or more in the content characteristic packets of information purse rope page or leaf link of described named web page, key word, picture, the video information.

9. system as claimed in claim 8, it is characterized in that, described recognition result comprises total number of mating the content characteristic information of identifying described named web page successfully and/or the number of mating each type content characteristic information in the content characteristic information of identifying described named web page successfully.

10. system as claimed in claim 9 is characterized in that, described recognition unit specifically comprises:

The coupling recognition unit, whether be used for judging the number of content characteristic information of content characteristic information specified type that described coupling identify described named web page successfully above the first predetermined threshold value, and/or whether described coupling identifies total number of content characteristic information of described named web page successfully above the second predetermined threshold value; And

Identifying unit, be used for when described coupling recognition unit Output rusults when being, judge that the information in the described named web page belongs to historical rubbish message.