CN102982048A - Method and device for assessing junk information mining rule - Google Patents

Method and device for assessing junk information mining rule Download PDF

Info

Publication number
CN102982048A
CN102982048A CN2011102642216A CN201110264221A CN102982048A CN 102982048 A CN102982048 A CN 102982048A CN 2011102642216 A CN2011102642216 A CN 2011102642216A CN 201110264221 A CN201110264221 A CN 201110264221A CN 102982048 A CN102982048 A CN 102982048A
Authority
CN
China
Prior art keywords
information
mining rule
evaluating
rule
junk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102642216A
Other languages
Chinese (zh)
Other versions
CN102982048B (en
Inventor
李彦宏
舒迅
帅帅
尹佳
罗亮
王波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110264221.6A priority Critical patent/CN102982048B/en
Publication of CN102982048A publication Critical patent/CN102982048A/en
Application granted granted Critical
Publication of CN102982048B publication Critical patent/CN102982048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention aims to provide a method and a device for assessing a junk information mining rule. The assessing device is used for obtaining the to-be-assessed mining rule and an information publish sample for assessing the mining rule, then, the mining rule is used for conducting junk information mining for the information publish sample, and at least one evaluation parameter corresponding to the mining rule is obtained. Compared with the prior art, according to the fact that at least one evaluation parameter corresponding to the to-be-evaluated mining rule is obtained, an index for assessing the mining rule is provided for an interactive platform manager, the mining rule is further optimized and updated to improve various evaluation parameters, junk information is accurately judged and processed by an interactive platform, and therefore normal work of the interactive platform is guaranteed.

Description

A kind of method and apparatus for assessment of the junk information mining rule
Technical field
The present invention relates to networking technology area, relate in particular to a kind of technology for assessment of the junk information mining rule.
Background technology
Along with development and the application of Internet technology, increasing user is by open interaction platform issue and receive bulk information, the interchange of the information of carrying out that makes full use of internet and resource sharing.But, comprising a large amount of junk information in these information, described junk information may be some information of issuing in batches, having illegal objective, takies a large amount of Internet resources, and has greatly caused network security hidden danger.Present open interaction platform has all been taked certain measure, by junk information is excavated, detects and process the junk information in this open interaction platform.But, because the interaction platform supvr can't know whether the junk information in the open interaction platform is excavated effectively, and then can't to excavate, detection mode optimize accordingly, thereby can't ensure the purpose of the cleaning of the saving of Internet resources, open interaction platform.
Therefore, how effectively to assess the junk information mining rule, become one of present problem demanding prompt solution.
Summary of the invention
The purpose of this invention is to provide a kind of method and apparatus for assessment of the junk information mining rule.
According to an aspect of the present invention, provide a kind of method for assessment of the junk information mining rule, wherein, the method may further comprise the steps:
A obtains mining rule to be assessed;
B obtains the information issue sample for assessment of described mining rule;
C carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample;
D in conjunction with described information issue sample, obtains at least one the evaluating corresponding with described mining rule according to described junk information.
According to a further aspect in the invention, also provide a kind of equipment for assessment of the junk information mining rule, wherein, this equipment comprises:
The Rule device is used for obtaining mining rule to be assessed;
The sample acquisition device is used for obtaining the information issue sample for assessment of described mining rule;
The junk information deriving means is used for based on described mining rule, described information issue sample is carried out junk information excavate, and obtains the junk information corresponding with described information issue sample;
Parameter obtaining device is used for according to described junk information, in conjunction with described information issue sample, obtains at least one the evaluating corresponding with described mining rule.
Compared with prior art, the present invention is by obtaining at least one the evaluating corresponding with mining rule to be assessed, the index that this mining rule is assessed is provided to the interaction platform supvr, and then can be optimized renewal to this mining rule, to improve described every evaluating, so that interaction platform can more accurately be judged junk information and it is processed, thereby ensure the normal operation of interaction platform, further reach the purpose of conserve network resources, the open interaction platform of cleaning.
Description of drawings
By reading the detailed description that non-limiting example is done of doing with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 illustrates according to the equipment synoptic diagram of one aspect of the invention for assessment of the junk information mining rule;
Fig. 2 illustrates the equipment synoptic diagram for assessment of the junk information mining rule in accordance with a preferred embodiment of the present invention;
Fig. 3 illustrates according to a further aspect of the present invention the method flow diagram for assessment of the junk information mining rule;
Fig. 4 illustrate in accordance with a preferred embodiment of the present invention for assessment of junk information mining rule method flow diagram.
Same or analogous Reference numeral represents same or analogous parts in the accompanying drawing.
Embodiment
Below in conjunction with accompanying drawing the present invention is described in further detail.
Fig. 1 illustrates according to the equipment synoptic diagram of one aspect of the invention for assessment of the junk information mining rule.Assessment apparatus 1 comprises Rule device 11, sample acquisition device 12, junk information deriving means 13, parameter obtaining device 14.At this, assessment apparatus 1 includes but not limited to the cloud that computing machine, network host, single network server, a plurality of webserver collection or a plurality of server consist of.At this, cloud is by consisting of based on a large amount of computing machines of cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is a kind of of Distributed Calculation, a super virtual machine that is comprised of the loosely-coupled computing machine collection of a group.
Particularly, Rule device 11 obtains mining rule to be assessed.More specifically, Rule device 11 regularly or answer Event triggered to obtain in real time mining rule to be assessed, request such as the mining rule to be assessed that sends by network equipments such as the real-time listening webservers, to obtain mining rule to be assessed, perhaps pass through termly the communication mode of agreement, such as communication protocols such as http, https, directly other parts or the third party device from assessment apparatus 1 reads mining rule to be assessed.For example, hypothesis evaluation equipment 1 is the webserver, the Rule device 11 of this webserver is used for another webserver that junk information is excavated by real-time listening, obtain the http request that the mining rule based on to be assessed that this another webserver sends by http communication protocol is packaged into, this Rule device 11 is resolved this http request, and obtains mining rule to be assessed wherein.And for example, Rule device 11 is pressed some cycles, obtains the request of mining rule to be assessed by calling predetermined application programming interface (API) to the third party device transmission termly, and receives the mining rule to be assessed that this third party device returns.Those skilled in the art will be understood that the above-mentioned mode of mining rule to be assessed of obtaining is only for giving an example; other existing or modes of obtaining mining rule to be assessed that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Sample acquisition device 12 obtains the information issue sample for assessment of described mining rule.Particularly, sample acquisition device 12 is by releasing news such as extracting randomly many according to the communication protocol of making an appointment from the network interdynamic platform, or from information issue Sample Storehouse, obtain many and release news, wherein, these release news and indicate in advance the junk information sign, distinguishing it as junk information or normal information, and these many released news as the information issue sample of the mining rule of obtaining for assessment of Rule device 11.Wherein, described junk information sign is used for identifying every whether release news is real junk information.At this, information issue sample includes but not limited to: 1) many release news and content, such as a plurality of models and the content thereof in the Web Community; 2) junk information sign.At this, information issue Sample Storehouse is used for storing many and releases news and the junk information sign, includes but not limited to relational database, memory storage, harddisk memory etc.For example, suppose that releasing news in the network interdynamic platform is kept in the webserver, sample acquisition device 12 is according to the communication protocol of making an appointment, such as http, the communication protocols such as https, send the request of obtaining for assessment of the information issue sample of mining rule to this webserver, and many of accepting in the network interdynamic platform that this webserver obtains at random indicate releasing news of junk information sign, information issue sample as the mining rule of obtaining for assessment of Rule device 11, wherein, described network interdynamic platform includes but not limited to: Web Community, mhkc, blog, meagre, news analysis, message interactive etc.Again for example, sample acquisition device 12 obtains real junk information and non-spam according to a certain percentage from information issue Sample Storehouse, and with its information issue sample as the mining rule of obtaining for assessment of Rule device 11.Those skilled in the art will be understood that the mode of above-mentioned obtaining information issue sample is only for giving an example; the mode of other obtaining informations existing or that may occur from now on issue samples is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Those skilled in the art will be understood that above-mentioned Rule device 11 and the execution sequence of sample acquisition device 12 only are that in practice, they can be carried out with random order for example, such as parallel or serial.Those skilled in the art also will be understood that, the execution sequence that the Rule device 11 that only illustrates for simplicity's sake among Fig. 1 is carried out prior to sample acquisition device 12, but this omission far and away with can not affect to the present invention carry out clear, be disclosed as prerequisite fully.
Then, junk information deriving means 13 carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample.Particularly, the mining rule that junk information deriving means 13 rule-based deriving means 11 obtain, whether the information issue frequency such as an information publisher ID surpasses predetermined frequency threshold value, whether the information publisher is in blacklist, whether comprise rubbish vocabulary etc. in the content that releases news, discriminatory analysis is carried out in releasing news in the information issue sample that sample acquisition device 12 is obtained, for example release news and satisfy as when each mining rule or whole mining rule when one or more, judge that then this one or more releases news and be junk information, thereby obtain the whole junk information in this information issue sample.
For example, then this releases news and is junk information for information publisher ID comprises rubbish vocabulary in blacklist or in releasing news if suppose mining rule that Rule device 11 obtains; Subsequently, comprise that three release news in the information issue sample that sample acquisition device 12 obtains, its content is respectively:
A " certificates handling calls 13811112222 ",
B " everybody is happy ",
C " I wish and can make friends ";
Then, based on these two mining rule, 13 pairs these three of junk information deriving means release news and carry out discriminatory analysis, the content of a that releases news is carried out string matching in the rubbish dictionary, to obtain " certificates handling " as rubbish vocabulary, and the information publisher ID of the c that releases news judges that then a that releases news is junk information with the c that releases news in this information issue sample in blacklist.
Again for example, if suppose mining rule that Rule device 11 obtains surpass predetermined frequency threshold value for the frequency of a same content that releases news of information publisher ID issue and release news in comprise rubbish vocabulary then this release news and is junk information; Subsequently, comprise that 20 release news in the information issue sample that sample acquisition device 12 obtains, wherein 10 contents that release news are: " head store is sold all kinds of slimming drugs, favorable price ", and information publisher ID is identical, and within 1 minute, send; Then, junk information deriving means 13 releases news to these 10 based on these two mining rule and analyzes, to determine that these 10 contents that release news are identical and by same information publisher ID issue, thereby can judge these ten 10 continuously issues that release news and be same information, the frequency of information issue be 10 beats/mins greater than 5 beats/mins of predetermined frequency threshold values, simultaneously junk information deriving means 13 carries out string matching with it in the rubbish dictionary, and accordingly acquisition " sale ", " slimming drugs " are rubbish vocabulary, and then junk information deriving means 13 obtains in this information issue sample these 10 and releases news and be junk information.At this, the vocabulary of rubbish described in the illustrated embodiment includes but not limited to banned word, infringement word, indecency word, political nature, agitative word, advertising words etc., the dictionary of rubbish described in the illustrated embodiment is used for storage rubbish vocabulary, include but not limited to relational database, memory storage, harddisk memory etc.Those skilled in the art will be understood that the above-mentioned mode of junk information of obtaining is only for giving an example; other existing or modes of obtaining junk information that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, parameter obtaining device 14 in conjunction with described information issue sample, is obtained at least one the evaluating corresponding with described mining rule according to described junk information.Particularly, parameter obtaining device 14 is excavated the junk information of obtaining according to junk information deriving means 13 by junk information, and a plurality of the releasing news and the junk information sign that comprises in the information of obtaining in conjunction with the sample acquisition device 12 issue sample, analyse and compare, thereby obtain junk information quantity and non-spam quantity real in these junk information, and then parameter obtaining device 14 is according to the quantity that releases news in the information issue sample, to obtain at least one evaluating, the recall rate of mining rule as described.Wherein, described evaluating includes but not limited to: the 1) recall rate corresponding with described mining rule, computing formula are " real junk information quantity in the real junk information quantity of recall rate=excavate to obtain by junk information/information issue sample "; 2) accuracy rate corresponding with described mining rule, computing formula are " the real junk information quantity of accuracy rate=excavate to obtain by junk information/excavate by junk information the junk information quantity that obtains ".For example, suppose that comprising 500 in the information issue sample releases news, wherein its quantity that releases news for real junk information is shown is 100 for junk information sign, and it is 80 that junk information deriving means 13 excavates the junk information quantity that obtains from this information issue sample by junk information; Then, parameter obtaining device 14 is according to this information issue sample, those are excavated junk information real in acquisition junk information and this information issue sample by junk information compares, excavate the real junk information quantity of acquisition as 40 to obtain those by junk information, and then parameter obtaining device 14 is by formula " accuracy rate=excavate the real junk information quantity that obtains/excavate by the junk information junk information quantity that obtains by junk information ", calculating and obtaining the evaluating accuracy rate is 50% (=40/80), by formula " real junk information quantity in the real junk information quantity of recall rate=excavate to obtain by junk information/information issue sample ", calculating and obtaining recall rate is 40% (=40/100).Those skilled in the art will be understood that the above-mentioned mode of evaluating of obtaining is only for giving an example; other existing or modes of obtaining evaluating that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, can be to work continuously between Rule device 11, sample acquisition device 12, junk information deriving means 13 and the parameter obtaining device 14.Particularly, Rule device 11 obtains mining rule to be assessed; Sample acquisition device 12 obtains the information issue sample for assessment of described mining rule; Then, junk information deriving means 13 carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample; Then, parameter obtaining device 14 in conjunction with described information issue sample, is obtained at least one the evaluating corresponding with described mining rule according to described junk information.At this, it will be understood by those skilled in the art that " continuing " refers to that each device requires to carry out the obtaining of the obtaining of mining rule to be assessed, information issue sample, obtaining of junk information and obtaining of evaluating according to the mode of operation of setting or adjust in real time respectively, until Rule device 11 stops to obtain mining rule to be assessed in a long time.
At this, need to prove that the example that the every numerical value in giving an example acts on as an illustration is only for understanding the present invention, the True Data during not as practical application.If no special instructions, the function of other local numerical value that occur for simplicity's sake, repeats no more with identical herein herein.
Preferably, sample acquisition device 12 obtains the information issue sample corresponding with described mining rule according to described mining rule from information issue Sample Storehouse.Particularly, the mining rule that sample acquisition device 12 obtains according to Rule device 11, for example by in information issue Sample Storehouse, carrying out matching inquiry, when obtaining arbitrary mining rule and information, coupling issues the mining rule indicated of releasing news in the Sample Storehouse when corresponding, obtain this and release news, and with releasing news as information issue sample that all matching inquiries obtain; Perhaps by in information issue Sample Storehouse, inquiring about, issue sample to obtain junk information some or that successfully do not excavated by those mining rule in the past as information.For example, if then this releases news and is junk information in blacklist for information publisher ID to suppose mining rule that Rule device 11 obtains, then, sample acquisition device 12 is according to this mining rule, in blacklist, choose at random several information publisher ID, and in information issue Sample Storehouse, carry out matching inquiry according to these ID, obtaining some releases news, perhaps information is issued in the Sample Storehouse all information publisher ID of releasing news and in blacklist, carried out matching inquiry, obtain 200 information publisher ID in blacklist with coupling, and obtain accordingly corresponding with these 200 information publisher ID some and release news, to issue sample as described information.Again for example, Rule device 11 obtains mining rule, then, sample acquisition device 12 carries out matching inquiry with the mining rule ID that arbitrary mining rule identifies in information issue Sample Storehouse, obtain the junk information corresponding with this mining rule ID, and whether corresponding the digging according to rule of this mining rule ID excavates this junk information successfully, and then extract not by its corresponding whole junk information of successfully excavating according to rule of digging, and the general wherein issues sample as information in the junk information of certain proportion (such as 50%).Those skilled in the art will be understood that the mode of above-mentioned obtaining information issue sample is only for giving an example; the mode of other obtaining informations existing or that may occur from now on issue samples is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Fig. 2 illustrates the equipment synoptic diagram for assessment of the junk information mining rule in accordance with a preferred embodiment of the present invention, and parameter obtaining device 14 ' also comprises as a result acquiring unit 141 ' and parameter acquiring unit 142 '.Particularly, acquiring unit 141 ' compares analysis with default actual junk information and described junk information in the described information issue sample as a result, obtains the comparative analysis result corresponding with described junk information; Then, parameter acquiring unit 142 ' is obtained described at least one evaluating according to described comparative analysis result.At this, install 11 '-13 ' identically with reference to the content of the described device of Fig. 1 11-13 with the front shown in Fig. 2, for simplicity's sake, be contained in this with way of reference, do not give unnecessary details and do not do.
More specifically, default actual junk information and junk information deriving means 13 ' dig the junk information that certificate obtains based on mining rule and compare one by one analysis in the information issue sample that acquiring unit 141 ' obtains sample acquisition device 12 ' as a result, to obtain the comparative analysis result corresponding with those junk information, wherein, comparative analysis is the result include but not limited to: 1) real junk information quantity in those junk information, 2) non-spam quantity in those junk information, 3) content distributed keyword in the non-spam in those junk information, 4) information publisher's credit appraisal grade of non-spam in those junk information, the 5) information publisher's of those real junk information the information issue frequency etc.For example, suppose that releasing news in the information issue sample that sample acquisition device 12 ' obtains is 20, the real junk information quantity during this releases news is 10; Then, junk information deriving means 13 ' is 6 from the junk information quantity that this information issue sample excavates acquisition based on mining rule; Subsequently, acquiring unit 141 ' is according to this information issue sample as a result, those are excavated junk information real in acquisition junk information and this information issue sample based on mining rule compares, excavate the real junk information quantity of acquisition as 4 to obtain those based on mining rule, and to obtain those real junk information be same information publisher ID issue, and this information publisher's the information issue frequency is 4 beats/mins.
Then, the comparative analysis result that parameter acquiring unit 142 ' is obtained according to acquiring unit 141 ' as a result calculates by formula and to obtain at least one evaluating, the accuracy rate corresponding such as the mining rule of obtaining with Rule device 11.For example, connect example, releasing news in the information issue sample that sample acquisition device 12 ' obtains is 20, real junk information quantity during this releases news is 10, it is 6 that junk information deriving means 13 ' excavates the junk information that obtains based on mining rule, acquiring unit 141 ' determines that real junk information quantity is 4 as a result, it is 67% (=4/6) that parameter acquiring unit 142 ' is calculated acquisition evaluating accuracy rate by formula, and calculating the acquisition recall rate by formula is 40% (=4/10).
Those skilled in the art will be understood that the above-mentioned comparative analysis result that obtains only is for example with the mode of obtaining evaluating; other existing or modes of obtaining the comparative analysis result or obtaining evaluating that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, each carries out junk information to described information issue sample and excavates described mining rule at least based on following:
-information issue the frequency;
-information presenting substance;
-information publisher's historical behavior record;
-information publisher's attribute.
1) particularly, the described information issue frequency includes but not limited to: an information publisher's information is issued frequency, has the information that the releases news issue frequency of identical content, is issued the frequency etc. from the information of same IP address.For example, comprising 10 in the information issue sample releases news, junk information deriving means 13 ' releases news to these 10 and analyzes, in 1 minute, issued by same information publisher ID to determine that these 10 in releasing news 6 release news, the frequency that this information publisher releases news be 10 beats/mins greater than 5 beats/mins of predetermined frequency threshold values, thereby can judge that these 6 release news and are junk information.
2) described information presenting substance includes but not limited to: the rubbish vocabulary that comprises in the information presenting substance, a plurality of releasing news have identical information presenting substance etc.For example, comprise 3 in the information issue sample and release news, these 3 contents that release news are respectively:
A " certificates handling calls 13811112222 ",
B " everybody is happy ",
C " I wish and can make friends ";
Junk information deriving means 13 ' carries out string matching with these 3 contents that release news in the rubbish dictionary, with " certificates handling " rubbish vocabulary among a that obtains to release news, and judge that accordingly a that releases news is junk information.
3) described information publisher's historical behavior record includes but not limited to: information publisher's history release news content, information publisher's history interocclusal record when releasing news, and information publisher's historical online hours etc.For example, junk information deriving means 13 ' is issued one in sample information publisher ID that releases news with information and carry out matching inquiry in the historical behavior database, the history that obtains this information publisher release news the time all at 1:00 AM to 6:00 AM, and this information publisher's history releases news and comprises rubbish vocabulary in the content, judges then that this releases news to be junk information.Wherein, the historical behavior database in the illustrated embodiment is used for storage information publisher's historical behavior record, includes but not limited to relational database, memory storage, harddisk memory etc.
4) described information publisher's attribute includes but not limited to: the information publisher whether in blacklist, the personal background information that pre-enters of information publisher.For example, whole information publisher ID that release news that junk information deriving means 13 ' is issued information in the sample carry out matching inquiry in blacklist, obtain two information publishers that release news in blacklist, judge that then these two release news and are junk information.
Those skilled in the art will be understood that can separately be used for not only that based on above-mentioned four information issue sample is carried out junk information excavates, and can also excavate in conjunction with being used for that information issue sample is carried out junk information.Those skilled in the art will be understood that the mining rule of above-mentioned junk information is only for giving an example; the mining rule of other junk information existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
In a preferred embodiment (with reference to Fig. 2), assessment apparatus 1 also comprise rule optimization device (not shown), and this rule optimization device is optimized described mining rule according to described evaluating.Referring to Fig. 2 the preferred embodiment is described in detail, wherein, Rule device 11 ' obtains mining rule to be assessed; Sample acquisition device 12 ' obtains the information issue sample for assessment of described mining rule; Junk information deriving means 13 ' carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample; Parameter obtaining device 14 ' middle as a result acquiring unit 141 ' compares analysis with default actual junk information and described junk information in the described information issue sample, obtains the comparative analysis result corresponding with described junk information; Parameter acquiring unit 142 ' is obtained described at least one evaluating according to described comparative analysis result in the parameter obtaining device 14 '; Its detailed process is with aforementioned identical with reference to the performed process of Rule device 11 ' among the described embodiment of Fig. 2, sample acquisition device 12 ', junk information deriving means 13 ' and parameter obtaining device 14 ', for simplicity's sake, be contained in this with way of reference, do not give unnecessary details and do not do.
Particularly, the evaluating that the rule optimization device obtains according to parameter acquiring unit 142 ', such as the accuracy rate corresponding with mining rule, optimize described mining rule, for example when accuracy rate in the evaluating during less than default accuracy rate threshold value,, the high information publisher of credit rating all do not carry out the junk information excavation by being released news by adjusting mining rule, to improve accuracy rate.For example, suppose that parameter acquiring unit 142 ' calculates by formula that accuracy rate is 50% in the evaluating obtain, rule optimization device judging nicety rate 50% is less than default accuracy rate threshold value 60%, then this rule optimization device is adjusted mining rule for releasing news of the high information publisher of credit rating all not carried out the junk information excavation, to improve the accuracy rate in the evaluating.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for giving an example; the mode of other optimization mining rule existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, described rule optimization device also can according to described evaluating, in conjunction with described comparative analysis result, be optimized described mining rule.Particularly, the evaluating that the rule optimization device obtains according to parameter acquiring unit 142 ', such as the recall rate corresponding with mining rule, and the comparative analysis result who obtains according to acquiring unit 141 ' as a result, optimize described mining rule, for example when recall rate during less than default recall rate threshold value, the mode of optimizing includes but not limited to: the information of excavating the mining rule of junk information by reducing being used for shown in the comparative analysis result is issued frequency threshold value, or reduce rubbish vocabulary accumulated quantity threshold value etc., to improve recall rate.For example, suppose that recall rate is 40% in the evaluating that parameter acquiring unit 142 ' obtains, and less than default recall rate threshold value 50%, then the rule optimization device is 4 beats/mins according to the information issue frequency average that acquiring unit 141 ' as a result obtains the information publisher of junk information among the comparative analysis result, accordingly information is issued frequency threshold value and be reduced to 4 beats/mins from 5 beats/mins, to improve recall rate.Again for example, suppose that recall rate is less than default recall rate threshold value in the evaluating that parameter acquiring unit 142 ' obtains, then the rule optimization device obtains the 2/bar of rubbish vocabulary average that comprises in the junk information content among the comparative analysis result according to acquiring unit 141 ' as a result, rubbish vocabulary accumulated quantity threshold value with the junk information content is reduced to 2/bar from 3/bar accordingly, to improve recall rate.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for giving an example; the mode of other optimization mining rule existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
More preferably, described rule optimization device also can according to described evaluating, in conjunction with the parameter precedence information that presets of described evaluating, be optimized described mining rule.Particularly, the rule optimization device such as recall rate and accuracy rate, and according to default parameter precedence information, is higher than recall rate such as accuracy rate priority also according to described evaluating, selects suitable mode to optimize mining rule, to improve this evaluating.For example, suppose that accuracy rate is 50% in the evaluating that parameter acquiring unit 142 ' obtains, and less than default accuracy rate threshold value 60%, recall rate is 40%, and less than default recall rate threshold value 50%, then the default accuracy rate priority of rule optimization device basis is higher than the parameter precedence information of recall rate, and adjusting mining rule is that releasing news of high-quality user issue all do not excavated, and then the raising accuracy rate.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for giving an example; the mode of other optimization mining rule existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
More preferably, assessment apparatus 1 also comprises priority update device (not shown), and this priority update device can according to described evaluating, upgrade described parameter precedence information.Particularly, the evaluating that the priority update device obtains according to parameter acquiring unit 142 ', for example when recall rate less than default recall rate threshold value, and accuracy rate is during greater than default accuracy rate threshold value, undated parameter priority is that recall rate priority is greater than accuracy rate.For example, recall rate is less than default recall rate threshold value in the evaluating that parameter acquiring unit 142 ' is obtained, and accuracy rate is greater than default accuracy rate threshold value, the priority update device is higher than the parameter precedence information of recall rate with default accuracy rate priority, is updated to recall rate priority and is higher than accuracy rate.Those skilled in the art will be understood that the mode of above-mentioned excavation undated parameter precedence information is only for giving an example; the mode of other undated parameter precedence informations existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, this equipment also comprises the optimized control device (not shown), and this optimized control device can be worked as described evaluating when reaching the evaluating threshold value, finishes to optimize described mining rule.Particularly, junk information deriving means 13 ' carries out junk information based on mining rule to information issue sample and excavates, and obtains the junk information corresponding with this information issue sample; Then, parameter obtaining device 14, middle as a result acquiring unit 141 ' compares analysis with default actual junk information and this junk information in this information issue sample, obtains the comparative analysis result corresponding with described junk information; Subsequently, parameter acquiring unit 142 ' is obtained at least one evaluating according to this comparative analysis result; Mining rule circulation after junk information deriving means 13 ' and parameter obtaining device the 14 ' continuous rule-based optimization device upgrade is carried out, optimized control device detects the each evaluating that obtains of this circulation, and when evaluating reaches the evaluating threshold value, finish to optimize this principle of optimality.Wherein, the evaluating threshold value means the expectancy evaluation parameter that presets.For example, when optimized control device detects accuracy rate greater than predetermined accuracy rate threshold value and recall rate during greater than predetermined recall rate threshold value, optimized control device stops to optimize this mining rule.Those skilled in the art will be understood that it only is for example that the mode of mining rule is optimized in above-mentioned end; the mode that mining rule are optimized in other end existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, described evaluating comprise following at least each:
-the recall rate corresponding with described mining rule;
-the accuracy rate corresponding with described mining rule.
Particularly, the evaluating that obtains of parameter acquiring unit 142 ' includes but not limited to: dig according to the corresponding recall rate of rule and the accuracy rate corresponding with described mining rule with described.Recall rate refers to that junk information deriving means 13 ' excavates the ratio that the real junk information quantity obtain and information are issued actual junk information quantity in the sample by junk information; Accuracy rate refers to junk information deriving means 13, excavates the ratio of the junk information quantity that the real junk information quantity obtain and junk information deriving means 13 ' obtain by the junk information excavation by junk information.Accuracy rate and recall rate are two evaluatings that possible mutually restrict, and when accuracy rate is high, may cause recall rate low, when recall rate is high, may cause accuracy rate low, therefore, need between recall rate and accuracy rate, seek balance, carry out the excavation of junk information in the mode of optimum.Those skilled in the art will be understood that above-mentioned evaluating only for giving an example, and other evaluatings existing or that may occur from now on also should be included in the protection domain of the present invention as applicable to the present invention, and are contained in this at this with way of reference.
Fig. 3 illustrates according to the method flow diagram of one aspect of the invention for assessment of the junk information mining rule.Wherein, assessment apparatus 1 includes but not limited to the cloud that computing machine, network host, single network server, a plurality of webserver collection or a plurality of server consist of.At this, cloud is by consisting of based on a large amount of computing machines of cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is a kind of of Distributed Calculation, a super virtual machine that is comprised of the loosely-coupled computing machine collection of a group.
Particularly, in step S1, assessment apparatus 1 obtains mining rule to be assessed.More specifically, in step S1, assessment apparatus 1 regularly or answer Event triggered to obtain in real time mining rule to be assessed, request such as the mining rule to be assessed that sends by network equipments such as the real-time listening webservers, to obtain mining rule to be assessed, perhaps termly by the communication mode of agreement, such as communication protocols such as http, https, directly other parts or the third party device from assessment apparatus 1 reads mining rule to be assessed.For example, hypothesis evaluation equipment 1 is the webserver, in step S1, this webserver is used for another webserver that junk information is excavated by real-time listening, obtain the http request that the mining rule based on to be assessed that this another webserver sends by http communication protocol is packaged into, this webserver is resolved this http request, and obtains mining rule to be assessed wherein.And for example, in step S1, assessment apparatus 1 is pressed some cycles, obtains the request of mining rule to be assessed by calling predetermined application programming interface (API) to the third party device transmission termly, and receives the mining rule to be assessed that this third party device returns.Those skilled in the art will be understood that the above-mentioned mode of mining rule to be assessed of obtaining is only for giving an example; other existing or modes of obtaining mining rule to be assessed that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
In step S2, assessment apparatus 1 obtains the information issue sample for assessment of described mining rule.Particularly, in step S2, assessment apparatus 1 is by releasing news such as extracting randomly many according to the communication protocol of making an appointment from the network interdynamic platform, or from information issue Sample Storehouse, obtain many and release news, wherein, these release news and indicate in advance junk information sign, distinguishing it as junk information or normal information, and these many released news issue sample as the information of the mining rule of obtaining for assessment of assessment apparatus 1 in step S1.Wherein, described junk information sign is used for identifying every whether release news is real junk information.At this, information issue sample includes but not limited to: 1) many release news and content, such as a plurality of models and the content thereof in the Web Community; 2) junk information sign.At this, information issue Sample Storehouse is used for storing many and releases news and the junk information sign, includes but not limited to relational database, memory storage, harddisk memory etc.For example, suppose that releasing news in the network interdynamic platform is kept in the webserver, in step S2, assessment apparatus 1 is according to the communication protocol of making an appointment, such as http, the communication protocols such as https, send the request of obtaining for assessment of the information issue sample of mining rule to this webserver, and many of accepting in the network interdynamic platform that this webserver obtains at random indicate releasing news of junk information sign, information issue sample as the mining rule of obtaining for assessment of assessment apparatus 1 in step S1, wherein, described network interdynamic platform includes but not limited to: Web Community, mhkc, blog, meagre, news analysis, message interactive etc.Again for example, in step S2, assessment apparatus 1 obtains real junk information and non-spam according to a certain percentage from information issue Sample Storehouse, and with its information issue sample as the mining rule of obtaining for assessment of assessment apparatus 1 in step S1.Those skilled in the art will be understood that the mode of above-mentioned obtaining information issue sample is only for giving an example; the mode of other obtaining informations existing or that may occur from now on issue samples is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Those skilled in the art will be understood that the execution sequence of above-mentioned assessment apparatus 1 in step S1 and step S2 only for giving an example, and in practice, they can be carried out with random order, such as parallel or serial.Those skilled in the art also will be understood that the execution sequence of a kind of assessment apparatus 1 that only illustrates for simplicity's sake among Fig. 3, but this omission far and away with can not affect to the present invention carry out clear, be disclosed as prerequisite fully.
Then, in step S3, assessment apparatus 1 carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample.Particularly, in step S3, assessment apparatus 1 is based on its mining rule of obtaining in step S1, whether the information issue frequency such as an information publisher ID surpasses predetermined frequency threshold value, whether the information publisher is in blacklist, whether comprise rubbish vocabulary etc. in the content that releases news, discriminatory analysis is carried out in releasing news in the information issue sample that assessment apparatus 1 is obtained in step S2, for example release news and satisfy as when each mining rule or whole mining rule when one or more, judge that then this one or more releases news and be junk information, thereby obtain the whole junk information in this information issue sample.
For example, suppose in step S1 that then this releases news and is junk information if the mining rule that assessment apparatus 1 obtains is for information publisher ID comprises rubbish vocabulary in blacklist or in releasing news; Subsequently, in step S2, comprise that three release news in the information issue sample that assessment apparatus 1 obtains, its content is respectively:
A " certificates handling calls 13811112222 ",
B " everybody is happy ",
C " I wish and can make friends ";
Then, based on these two mining rule, in step S3,1 pair these three of assessment apparatus release news and carry out discriminatory analysis, the content of a that releases news is carried out string matching in the rubbish dictionary, take acquisition " certificates handling " as rubbish vocabulary, and the information publisher ID of the c that releases news judges that then a that releases news is junk information with the c that releases news in this information issue sample in blacklist.
Again for example, suppose in step S1, if the mining rule that assessment apparatus 1 obtains surpass predetermined frequency threshold value for the frequency of the same content that releases news of information publisher ID issue and release news in comprise rubbish vocabulary then this release news and is junk information; Subsequently, in step S2, comprise that 20 release news in the information issue sample that assessment apparatus 1 obtains, wherein 10 contents that release news are: " head store is sold all kinds of slimming drugs; favorable price ", and information publisher ID is identical, and within 1 minute, send; Then, in step S3, assessment apparatus 1 releases news to these 10 based on these two mining rule and analyzes, to determine that these 10 contents that release news are identical and by same information publisher ID issue, thereby can judge these ten 10 continuously issues that release news and be same information, the frequency of information issue be 10 beats/mins greater than 5 beats/mins of predetermined frequency threshold values, assessment apparatus 1 carries out string matching with it in the rubbish dictionary simultaneously, and accordingly acquisition " sale ", " slimming drugs " are rubbish vocabulary, and then assessment apparatus 1 obtains in this information issue sample these 10 and releases news and be junk information in step S3.At this, the vocabulary of rubbish described in the illustrated embodiment includes but not limited to banned word, infringement word, indecency word, political nature, agitative word, advertising words etc., the dictionary of rubbish described in the illustrated embodiment is used for storage rubbish vocabulary, include but not limited to relational database, memory storage, harddisk memory etc.Those skilled in the art will be understood that the above-mentioned mode of junk information of obtaining is only for giving an example; other existing or modes of obtaining junk information that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, in step S4, assessment apparatus 1 in conjunction with described information issue sample, obtains at least one the evaluating corresponding with described mining rule according to described junk information.Particularly, in step S4, assessment apparatus 1 excavates the junk information obtained by junk information according to it in step S3, and in conjunction with a plurality of the releasing news and the junk information sign that comprises in its information issue sample that in step S2, obtains, analyse and compare, thereby obtain junk information quantity and non-spam quantity real in these junk information, and then assessment apparatus 1 in step S4 according to the quantity that releases news in the information issue sample, to obtain at least one evaluating, the recall rate of mining rule as described.Wherein, described evaluating includes but not limited to: the 1) recall rate corresponding with described mining rule, computing formula are " real junk information quantity in the real junk information quantity of recall rate=excavate to obtain by junk information/information issue sample "; 2) accuracy rate corresponding with described mining rule, computing formula are " the real junk information quantity of accuracy rate=excavate to obtain by junk information/excavate by junk information the junk information quantity that obtains ".For example, suppose that comprising 500 in the information issue sample releases news, wherein its quantity that releases news for real junk information is shown is 100 for junk information sign, and in step S3, it is 80 that assessment apparatus 1 excavates the junk information quantity that obtains from this information issue sample by junk information; Then, in step S4, assessment apparatus 1 is according to this information issue sample, those are excavated junk information real in acquisition junk information and this information issue sample by junk information compares, excavate the real junk information quantity of acquisition as 40 to obtain those by junk information, and then assessment apparatus 1 is by formula " accuracy rate=excavate the real junk information quantity that obtains/excavate by the junk information junk information quantity that obtains by junk information ", calculating and obtaining the evaluating accuracy rate is 50% (=40/80), by formula " real junk information quantity in the real junk information quantity of recall rate=excavate to obtain by junk information/information issue sample ", calculating and obtaining recall rate is 40% (=40/100).Those skilled in the art will be understood that the above-mentioned mode of evaluating of obtaining is only for giving an example; other existing or modes of obtaining evaluating that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, assessment apparatus 1 is to work continuously in step S1, step S2, step S3 and step S4.Particularly, in step S1, assessment apparatus 1 obtains mining rule to be assessed; In step S2, assessment apparatus 1 obtains the information issue sample for assessment of described mining rule; Then, in step S3, assessment apparatus 1 carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample; Then, in step S4, assessment apparatus 1 in conjunction with described information issue sample, obtains at least one the evaluating corresponding with described mining rule according to described junk information.At this, it will be understood by those skilled in the art that " continuing " refers to that assessment apparatus 1 requires to carry out the obtaining of the obtaining of mining rule to be assessed, information issue sample, obtaining of junk information and obtaining of evaluating according to the mode of operation of setting or adjust in real time respectively in each step, until assessment apparatus 1 stops to obtain mining rule to be assessed in a long time.
At this, need to prove that the example that the every numerical value in giving an example acts on as an illustration is only for understanding the present invention, the True Data during not as practical application.If no special instructions, the function of other local numerical value that occur for simplicity's sake, repeats no more with identical herein herein.
Preferably, in step S2, assessment apparatus 1 obtains the information issue sample corresponding with described mining rule according to described mining rule from information issue Sample Storehouse.Particularly, in step S2, assessment apparatus 1 is according to its mining rule of obtaining in step S1, for example by in information issue Sample Storehouse, carrying out matching inquiry, when obtaining arbitrary mining rule and information, coupling issues the mining rule indicated of releasing news in the Sample Storehouse when corresponding, obtain this and release news, and with releasing news as information issue sample that all matching inquiries obtain; Perhaps by in information issue Sample Storehouse, inquiring about, issue sample to obtain junk information some or that successfully do not excavated by those mining rule in the past as information.For example, suppose in step S1, then this releases news and is junk information in blacklist if the mining rule that assessment apparatus 1 obtains is for information publisher ID, then, in step S2, assessment apparatus 1 is according to this mining rule, in blacklist, choose at random several information publisher ID, and in information issue Sample Storehouse, carry out matching inquiry according to these ID, obtaining some releases news, perhaps information is issued in the Sample Storehouse all information publisher ID of releasing news and in blacklist, carried out matching inquiry, obtain 200 information publisher ID in blacklist with coupling, and obtain accordingly corresponding with these 200 information publisher ID some and release news, to issue sample as described information.Again for example, in step S1, assessment apparatus 1 obtains mining rule, then, in step S2, assessment apparatus 1 carries out matching inquiry with the mining rule ID that arbitrary mining rule identifies in information issue Sample Storehouse, obtain the junk information corresponding with this mining rule ID, and whether corresponding the digging according to rule of this mining rule ID excavates this junk information successfully, and then extract not by its corresponding whole junk information of successfully excavating according to rule of digging, and the general wherein issues sample as information in the junk information of certain proportion (such as 50%).Those skilled in the art will be understood that the mode of above-mentioned obtaining information issue sample is only for giving an example; the mode of other obtaining informations existing or that may occur from now on issue samples is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Fig. 4 illustrates the method flow diagram for assessment of the junk information mining rule in accordance with a preferred embodiment of the present invention, particularly, in step S41 ', assessment apparatus 1 compares analysis with default actual junk information and described junk information in the described information issue sample, obtains the comparative analysis result corresponding with described junk information; Then, in step S42 ', assessment apparatus 1 obtains described at least one evaluating according to described comparative analysis result.At this, step 1 shown in Fig. 4 ' to step 3 ' identical to the content of step S3 with reference to the described step S1 of Fig. 3 with the front, for simplicity's sake, be contained in this with way of reference, do not give unnecessary details and do not do.
More specifically, in step S41 ', default actual junk information and assessment apparatus 1 dig the junk information that certificate obtains based on mining rule and compare one by one analysis in assessment apparatus 1 obtains it in step S2 ' information issue sample in step S3 ', to obtain the comparative analysis result corresponding with those junk information, wherein, comparative analysis is the result include but not limited to: 1) real junk information quantity in those junk information, 2) non-spam quantity in those junk information, 3) content distributed keyword in the non-spam in those junk information, 4) information publisher's credit appraisal grade of non-spam in those junk information, the 5) information publisher's of those real junk information the information issue frequency etc.For example, suppose in step S2 ' that releasing news in the information issue sample that assessment apparatus 1 obtains is 20, the real junk information quantity during this releases news is 10; Then, in step S3 ', it is 6 that assessment apparatus 1 excavates the junk information quantity that obtains from this information issue sample based on mining rule; Subsequently, in step S41 ', assessment apparatus 1 is according to this information issue sample, those are excavated junk information real in acquisition junk information and this information issue sample based on mining rule compares, excavate the real junk information quantity of acquisition as 4 to obtain those based on mining rule, and to obtain those real junk information be same information publisher ID issue, and this information publisher's the information issue frequency is 4 beats/mins.
Then, in step S42 ', assessment apparatus 1 is according to its comparative analysis result who obtains in step S41 ', calculates by formula and obtains at least one evaluating, the accuracy rate corresponding such as the mining rule of obtaining in step S1 ' with assessment apparatus 1.For example, connect example, in step S2 ', releasing news in the information issue sample that assessment apparatus 1 obtains is 20, real junk information quantity during this releases news is 10, in step S3 ', it is 6 that assessment apparatus 1 excavates the junk information that obtains based on mining rule, in step S41 ', assessment apparatus 1 determines that real junk information quantity is 4, in step S42 ', it is 67% (=4/6) that assessment apparatus 1 calculates acquisition evaluating accuracy rate by formula, and calculating the acquisition recall rate by formula is 40% (=4/10).
Those skilled in the art will be understood that the above-mentioned comparative analysis result that obtains only is for example with the mode of obtaining evaluating; other existing or modes of obtaining the comparative analysis result or obtaining evaluating that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, each carries out junk information to described information issue sample and excavates described mining rule at least based on following:
-information issue the frequency;
-information presenting substance;
-information publisher's historical behavior record;
-information publisher's attribute.
1) particularly, the described information issue frequency includes but not limited to: an information publisher's information is issued frequency, has the information that the releases news issue frequency of identical content, is issued the frequency etc. from the information of same IP address.For example, comprising 10 in the information issue sample releases news, in step S3 ', 1 pair these 10 of assessment apparatus release news and analyze, in 1 minute, issued by same information publisher ID to determine that these 10 in releasing news 6 release news, the frequency that this information publisher releases news be 10 beats/mins greater than 5 beats/mins of predetermined frequency threshold values, thereby can judge that these 6 release news and are junk information.
2) described information presenting substance includes but not limited to: the rubbish vocabulary that comprises in the information presenting substance, a plurality of releasing news have identical information presenting substance etc.For example, comprise 3 in the information issue sample and release news, these 3 contents that release news are respectively:
A " certificates handling calls 13811112222 ",
B " everybody is happy ",
C " I wish and can make friends ";
In step S3 ', assessment apparatus 1 carries out string matching with these 3 contents that release news in the rubbish dictionary, with " certificates handling " rubbish vocabulary among a that obtains to release news, and judges that accordingly a that releases news is junk information.
3) described information publisher's historical behavior record includes but not limited to: information publisher's history release news content, information publisher's history interocclusal record when releasing news, and information publisher's historical online hours etc.For example, in step S3 ', assessment apparatus 1 is issued one in sample information publisher ID that releases news with information and carry out matching inquiry in the historical behavior database, the history that obtains this information publisher release news the time all at 1:00 AM to 6:00 AM, and this information publisher's history releases news and comprises rubbish vocabulary in the content, judges then that this releases news to be junk information.Wherein, the historical behavior database in the illustrated embodiment is used for storage information publisher's historical behavior record, includes but not limited to relational database, memory storage, harddisk memory etc.
4) described information publisher's attribute includes but not limited to: the information publisher whether in blacklist, the personal background information that pre-enters of information publisher.For example, in step S3 ', whole information publisher ID that release news that assessment apparatus 1 is issued information in the sample carry out matching inquiry in blacklist, obtain two information publishers that release news in blacklist, judge that then these two release news and are junk information.
Those skilled in the art will be understood that can separately be used for not only that based on above-mentioned four information issue sample is carried out junk information excavates, and can also excavate in conjunction with being used for that information issue sample is carried out junk information.Those skilled in the art will be understood that the mining rule of above-mentioned junk information is only for giving an example; the mining rule of other junk information existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
In a preferred embodiment (with reference to Fig. 4), this process also comprises step S5 ' (not shown), in step S5 ', assessment apparatus 1 is optimized described mining rule according to described evaluating.Referring to Fig. 4 the preferred embodiment is described in detail, wherein, in step S1 ', assessment apparatus 1 obtains mining rule to be assessed; In step S2 ', assessment apparatus 1 obtains the information issue sample for assessment of described mining rule; In step S3 ', assessment apparatus 1 carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample; In step S41 ', assessment apparatus 1 compares analysis with default actual junk information and described junk information in the described information issue sample, obtains the comparative analysis result corresponding with described junk information; In step S42 ', assessment apparatus 1 obtains described at least one evaluating according to described comparative analysis result; Its detailed process is with aforementioned identical with reference to the performed process in step S1 ', step S2 ', step S3 ', step S41 ' and step S42 ' of assessment apparatus 1 among the described embodiment of Fig. 4, for simplicity's sake, be contained in this with way of reference, do not give unnecessary details and do not do.
Particularly, in step S5 ', assessment apparatus 1 is according to its evaluating that obtains in step S42 ', such as the accuracy rate corresponding with mining rule, optimize described mining rule, for example when accuracy rate in the evaluating during less than default accuracy rate threshold value,, the high information publisher of credit rating all do not carry out the junk information excavation by being released news by adjusting mining rule, with the raising accuracy rate.For example, suppose in step S42 ', assessment apparatus 1 calculates by formula that accuracy rate is 50% in the evaluating obtain, in step S5 ', assessment apparatus 1 judging nicety rate 50% is less than default accuracy rate threshold value 60%, then assessment apparatus 1 is adjusted mining rule for releasing news of the high information publisher of credit rating all not carried out the junk information excavation, to improve the accuracy rate in the evaluating.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for giving an example; the mode of other optimization mining rule existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, described in step S5 ', assessment apparatus 1 also can according to described evaluating, in conjunction with described comparative analysis result, be optimized described mining rule.Particularly, in step S5 ', assessment apparatus 1 is according to its evaluating that obtains in step S42 ', such as the recall rate corresponding with mining rule, and the comparative analysis result who in step S41 ', obtains according to assessment apparatus 1, optimize described mining rule, for example when recall rate during less than default recall rate threshold value, the mode of optimizing includes but not limited to: the information of excavating the mining rule of junk information by reducing being used for shown in the comparative analysis result is issued frequency threshold value, or reduce rubbish vocabulary accumulated quantity threshold value etc., to improve recall rate.For example, suppose in step S42 ', recall rate is 40% in the evaluating that assessment apparatus 1 obtains, and less than default recall rate threshold value 50%, then in step S5 ', assessment apparatus 1 is 4 beats/mins according to it obtains the information publisher of junk information among the comparative analysis result in step S41 information issue frequency average, accordingly information is issued frequency threshold value and is reduced to 4 beats/mins from 5 beats/mins, to improve recall rate.Again for example, suppose in step S42 ', recall rate is less than default recall rate threshold value in the evaluating that assessment apparatus 1 obtains, then in step S5 ', assessment apparatus 1 obtains the 2/bar of rubbish vocabulary average that comprises in the junk information content among the comparative analysis result according to it in step S41 ', rubbish vocabulary accumulated quantity threshold value with the junk information content is reduced to 2/bar from 3/bar accordingly, to improve recall rate.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for giving an example; the mode of other optimization mining rule existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
More preferably, in step S5 ', assessment apparatus 1 also can according to described evaluating, in conjunction with the parameter precedence information that presets of described evaluating, be optimized described mining rule.Particularly, in step S5 ', assessment apparatus 1 is also according to described evaluating, such as recall rate and accuracy rate, and according to default parameter precedence information, be higher than recall rate such as accuracy rate priority, select suitable mode to optimize mining rule, to improve this evaluating.For example, suppose in step S42 ', accuracy rate is 50% in the evaluating that assessment apparatus 1 obtains, and less than default accuracy rate threshold value 60%, recall rate is 40%, and less than default recall rate threshold value 50%, then in step S5 ', the default accuracy rate priority of assessment apparatus 1 basis is higher than the parameter precedence information of recall rate, and adjusting mining rule is that releasing news of high-quality user issue all do not excavated, and then the raising accuracy rate.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for giving an example; the mode of other optimization mining rule existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
More preferably, this process also comprises step S6 ' (not shown), and in step S6 ', assessment apparatus 1 can according to described evaluating, upgrade described parameter precedence information.Particularly, in step S6 ', assessment apparatus 1 is for example worked as recall rate less than default recall rate threshold value according to its evaluating that obtains in step S42 ', and accuracy rate is during greater than default accuracy rate threshold value, and undated parameter priority is that recall rate priority is greater than accuracy rate.For example, in step S42 ', recall rate is less than default recall rate threshold value in the evaluating that assessment apparatus 1 obtains, and accuracy rate is greater than default accuracy rate threshold value, in step S6 ', assessment apparatus 1 is higher than the parameter precedence information of recall rate with default accuracy rate priority, is updated to recall rate priority and is higher than accuracy rate.Those skilled in the art will be understood that the mode of above-mentioned excavation undated parameter precedence information is only for giving an example; the mode of other undated parameter precedence informations existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, this process also comprises step S7 ' (not shown), and in step S7 ', assessment apparatus 1 can be worked as described evaluating when reaching the evaluating threshold value, finishes to optimize described mining rule.Particularly, in step S3 ', assessment apparatus 1 carries out junk information based on mining rule to information issue sample and excavates, and obtains the junk information corresponding with this information issue sample; Then, in step S41 ', assessment apparatus 1 compares analysis with default actual junk information and this junk information in this information issue sample, obtains the comparative analysis result corresponding with described junk information; Subsequently, in step S42 ', assessment apparatus 1 obtains at least one evaluating according to this comparative analysis result; Mining rule circulation after assessment apparatus 1 constantly upgrades in step S5 ' based on it in step S3 ' and step S4 ' is carried out, in step S7 ', assessment apparatus 1 detects the each evaluating that obtains of this circulation, and when evaluating reaches the evaluating threshold value, finish to optimize this principle of optimality.Wherein, the evaluating threshold value means the expectancy evaluation parameter that presets.For example, when assessment apparatus 1 detects accuracy rate greater than predetermined accuracy rate threshold value and recall rate during greater than predetermined recall rate threshold value in step S7 ', assessment apparatus 1 stops to optimize this mining rule.Those skilled in the art will be understood that it only is for example that the mode of mining rule is optimized in above-mentioned end; the mode that mining rule are optimized in other end existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, described evaluating comprise following at least each:
-the recall rate corresponding with described mining rule;
-the accuracy rate corresponding with described mining rule.
Particularly, in step S42 ', the evaluating that assessment apparatus 1 obtains includes but not limited to: dig according to the corresponding recall rate of rule and the accuracy rate corresponding with described mining rule with described.Recall rate refers to that assessment apparatus 1 excavates the ratio that the real junk information quantity obtain and information are issued actual junk information quantity in the sample by junk information in step S3 '; Accuracy rate refers to that assessment apparatus 1 excavates the ratio of the real junk information quantity obtain and its junk information quantity of obtaining by the junk information excavation by junk information in step S3 '.Accuracy rate and recall rate are two evaluatings that possible mutually restrict, and when accuracy rate is high, may cause recall rate low, when recall rate is high, may cause accuracy rate low, therefore, need between recall rate and accuracy rate, seek balance, carry out the excavation of junk information in the mode of optimum.Those skilled in the art will be understood that above-mentioned evaluating only for giving an example, and other evaluatings existing or that may occur from now on also should be included in the protection domain of the present invention as applicable to the present invention, and are contained in this at this with way of reference.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned example embodiment, and in the situation that does not deviate from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard embodiment as exemplary, and be nonrestrictive, scope of the present invention is limited by claims rather than above-mentioned explanation, therefore is intended to be included in the present invention dropping on the implication that is equal to important document of claim and all changes in the scope.Any Reference numeral in the claim should be considered as limit related claim.In addition, obviously other unit or step do not got rid of in " comprising " word, and odd number is not got rid of plural number.A plurality of unit of stating in the device claim or device also can be realized by software or hardware by a unit or device.The first, the second word such as grade is used for representing title, and does not represent any specific order.

Claims (20)

1. computer implemented method for assessment of the junk information mining rule, wherein, the method may further comprise the steps:
A obtains mining rule to be assessed;
Wherein, described method also comprises:
I obtains the information issue sample for assessment of described mining rule;
Wherein, described method also comprises:
B carries out junk information to described information issue sample and excavates based on described mining rule, obtains the junk information corresponding with described information issue sample;
C in conjunction with described information issue sample, obtains at least one the evaluating corresponding with described mining rule according to described junk information.
2. method according to claim 1, wherein, described step I also comprises:
-according to described mining rule, from information issue Sample Storehouse, obtain the information issue sample corresponding with described mining rule.
3. method according to claim 1 and 2, wherein, each carries out junk information to described information issue sample and excavates described mining rule at least based on following:
-information issue the frequency;
-information presenting substance;
-information publisher's historical behavior record;
-information publisher's attribute.
4. each described method in 3 according to claim 1, wherein, described step c also comprises:
-default actual junk information and described junk information in the described information issue sample compared analysis, obtain the comparative analysis result corresponding with described junk information;
-according to described comparative analysis result, obtain described at least one evaluating.
5. according to claim 3 or 4 described methods, wherein, the method also comprises step X:
X optimizes described mining rule according to described evaluating.
6. method according to claim 5, wherein, described step X also comprises:
-according to described evaluating, in conjunction with described comparative analysis result, optimize described mining rule.
7. according to claim 5 or 6 described methods, wherein, described step X also comprises:
-according to described evaluating, in conjunction with the parameter precedence information that presets of described evaluating, optimize described mining rule.
8. method according to claim 7, wherein, the method also comprises:
-according to described evaluating, upgrade described parameter precedence information.
9. each described method in 8 according to claim 5, wherein, the method also comprises:
-based on the mining rule after the described optimization, repeat described step b and c, until described evaluating reaches the evaluating threshold value.
10. each described method in 9 according to claim 1, wherein, described evaluating comprise following at least each:
-the recall rate corresponding with described mining rule;
-the accuracy rate corresponding with described mining rule.
11. the equipment for assessment of the junk information mining rule, wherein, this equipment comprises:
The Rule device is used for obtaining mining rule to be assessed;
The sample acquisition device is used for obtaining the information issue sample for assessment of described mining rule;
The junk information deriving means is used for based on described mining rule, described information issue sample is carried out junk information excavate, and obtains the junk information corresponding with described information issue sample;
Parameter obtaining device is used for according to described junk information, in conjunction with described information issue sample, obtains at least one the evaluating corresponding with described mining rule.
12. equipment according to claim 11, wherein, described sample acquisition device also is used for according to described mining rule, obtains the information issue sample corresponding with described mining rule from information issue Sample Storehouse.
13. according to claim 11 or 12 described equipment, wherein, each carries out junk information to described information issue sample and excavates described mining rule at least based on following:
-information issue the frequency;
-information presenting substance;
-information publisher's historical behavior record;
-information publisher's attribute.
14. each described equipment in 13 according to claim 11, wherein, described parameter obtaining device also comprises:
Acquiring unit is used for default actual junk information and the described junk information of described information issue sample compared analysis as a result, obtains the comparative analysis result corresponding with described junk information;
Parameter acquiring unit is used for according to described comparative analysis result, obtains described at least one evaluating.
15. equipment according to claim 14, wherein, this equipment also comprises:
The rule optimization device is used for according to described evaluating, optimizes described mining rule.
16. equipment according to claim 15, wherein, described rule optimization device also is used for according to described evaluating, in conjunction with described comparative analysis result, optimizes described mining rule.
17. according to claim 15 or 16 described equipment, wherein, described rule optimization device also is used for according to described evaluating, in conjunction with the parameter precedence information that presets of described evaluating, optimizes described mining rule.
18. equipment according to claim 17, wherein, this equipment also comprises:
The priority update device is used for according to described evaluating, upgrades described parameter precedence information.
19. each described equipment in 18 according to claim 15, wherein, this equipment also comprises:
Optimized control device is used for when described evaluating reaches the evaluating threshold value, finishes to optimize described mining rule.
20. each described device in 19 according to claim 11, wherein, described evaluating comprise following at least each:
-the recall rate corresponding with described mining rule;
-the accuracy rate corresponding with described mining rule.
CN201110264221.6A 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule Active CN102982048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110264221.6A CN102982048B (en) 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110264221.6A CN102982048B (en) 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule

Publications (2)

Publication Number Publication Date
CN102982048A true CN102982048A (en) 2013-03-20
CN102982048B CN102982048B (en) 2017-08-01

Family

ID=47856084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110264221.6A Active CN102982048B (en) 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule

Country Status (1)

Country Link
CN (1) CN102982048B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104009970A (en) * 2013-09-17 2014-08-27 宁波公众信息产业有限公司 Network information acquisition method
CN104216872A (en) * 2013-05-31 2014-12-17 腾讯科技(深圳)有限公司 Method and device for identifying rubbish chapters in network novels
CN106376002A (en) * 2015-07-20 2017-02-01 中兴通讯股份有限公司 Management method and device, and junk short message monitoring system
CN107705828A (en) * 2017-09-20 2018-02-16 广西金域医学检验所有限公司 Prejudge detection and processing method and processing device, terminal device, the storage medium of rule
CN108182234A (en) * 2017-12-27 2018-06-19 中科鼎富(北京)科技发展有限公司 Regular expression screening technique and device
CN109726312A (en) * 2018-12-25 2019-05-07 广州虎牙信息科技有限公司 A kind of regular expression detection method, device, equipment and storage medium
CN110427577A (en) * 2019-06-26 2019-11-08 五八有限公司 Impact evaluation method, apparatus, electronic equipment and the storage medium of content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101166159A (en) * 2006-10-18 2008-04-23 阿里巴巴公司 A method and system for identifying rubbish information
US20080104062A1 (en) * 2004-02-09 2008-05-01 Mailfrontier, Inc. Approximate Matching of Strings for Message Filtering
CN101389085A (en) * 2008-10-14 2009-03-18 中国联合通信有限公司 Rubbish short message recognition system and method based on sending behavior
CN101996203A (en) * 2009-08-13 2011-03-30 阿里巴巴集团控股有限公司 Web information filtering method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080104062A1 (en) * 2004-02-09 2008-05-01 Mailfrontier, Inc. Approximate Matching of Strings for Message Filtering
CN101166159A (en) * 2006-10-18 2008-04-23 阿里巴巴公司 A method and system for identifying rubbish information
CN101389085A (en) * 2008-10-14 2009-03-18 中国联合通信有限公司 Rubbish short message recognition system and method based on sending behavior
CN101996203A (en) * 2009-08-13 2011-03-30 阿里巴巴集团控股有限公司 Web information filtering method and system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216872A (en) * 2013-05-31 2014-12-17 腾讯科技(深圳)有限公司 Method and device for identifying rubbish chapters in network novels
CN104216872B (en) * 2013-05-31 2017-12-01 腾讯科技(深圳)有限公司 The method and device of rubbish chapters and sections in a kind of identification network novel
CN104009970A (en) * 2013-09-17 2014-08-27 宁波公众信息产业有限公司 Network information acquisition method
CN106376002A (en) * 2015-07-20 2017-02-01 中兴通讯股份有限公司 Management method and device, and junk short message monitoring system
CN107705828A (en) * 2017-09-20 2018-02-16 广西金域医学检验所有限公司 Prejudge detection and processing method and processing device, terminal device, the storage medium of rule
CN108182234A (en) * 2017-12-27 2018-06-19 中科鼎富(北京)科技发展有限公司 Regular expression screening technique and device
CN109726312A (en) * 2018-12-25 2019-05-07 广州虎牙信息科技有限公司 A kind of regular expression detection method, device, equipment and storage medium
CN109726312B (en) * 2018-12-25 2021-10-08 广州虎牙信息科技有限公司 Regular expression detection method, device, equipment and storage medium
CN110427577A (en) * 2019-06-26 2019-11-08 五八有限公司 Impact evaluation method, apparatus, electronic equipment and the storage medium of content

Also Published As

Publication number Publication date
CN102982048B (en) 2017-08-01

Similar Documents

Publication Publication Date Title
CN102982048A (en) Method and device for assessing junk information mining rule
CN110442712B (en) Risk determination method, risk determination device, server and text examination system
CN105704005B (en) Malicious user reporting method and device, and reported information processing method and device
US8607338B2 (en) Malicious advertisement management
CN104836781A (en) Method distinguishing identities of access users, and device
CN112543176A (en) Abnormal network access detection method, device, storage medium and terminal
Scellato et al. Measuring user activity on an online location-based social network
CN105023165A (en) Method, device and system for controlling release tasks in social networking platform
CN111435393B (en) Object vulnerability detection method, device, medium and electronic equipment
WO2014107441A2 (en) Social media impact assessment
CN110060087B (en) Abnormal data detection method, device and server
CN110839088A (en) Detection method, system, device and storage medium for dug by virtual currency
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
CN103279516A (en) Web spider identification method
CN107038620A (en) Based on user call a taxi preference information push and device
CN103744941A (en) Method and device for determining website evaluation result based on website attribute information
CN106998336B (en) Method and device for detecting user in channel
KR20160089800A (en) Apparatus and method for investigating cyber incidents
CN109313541A (en) For showing and the user interface of comparison attacks telemetering resource
CN102868685A (en) Method and device for judging automatic scanning behavior
CN111160919A (en) Block chain address risk assessment method and device
CN102184201A (en) Equipment and method used for selecting recommended sequence of query sequence
CN109478219A (en) For showing the user interface of network analysis
CN108804501A (en) A kind of method and device of detection effective information
CN110928942A (en) Index data monitoring and management method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant