CN102982048B - A kind of method and apparatus for being used to assess junk information mining rule - Google Patents

A kind of method and apparatus for being used to assess junk information mining rule Download PDF

Info

Publication number
CN102982048B
CN102982048B CN201110264221.6A CN201110264221A CN102982048B CN 102982048 B CN102982048 B CN 102982048B CN 201110264221 A CN201110264221 A CN 201110264221A CN 102982048 B CN102982048 B CN 102982048B
Authority
CN
China
Prior art keywords
information
mining rule
rule
sample
evaluating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110264221.6A
Other languages
Chinese (zh)
Other versions
CN102982048A (en
Inventor
李彦宏
舒迅
帅帅
尹佳
罗亮
王波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110264221.6A priority Critical patent/CN102982048B/en
Publication of CN102982048A publication Critical patent/CN102982048A/en
Application granted granted Critical
Publication of CN102982048B publication Critical patent/CN102982048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

It is an object of the invention to provide a kind of method and apparatus for being used to assess junk information mining rule, wherein, assessment equipment acquisition mining rule to be assessed and the information issue sample for assessing the mining rule;Junk information excavation is carried out to described information issue sample subsequently, based on the mining rule, and then obtains at least one evaluating corresponding with the mining rule.Compared with prior art, the present invention at least one evaluating corresponding with mining rule to be assessed by obtaining, the index being estimated to the mining rule is provided to interaction platform manager, and then renewal can be optimized to the mining rule, to improve every evaluating, interaction platform is allowd more accurately to judge junk information and handle it, so as to ensure the normal work of interaction platform.

Description

A kind of method and apparatus for being used to assess junk information mining rule
Technical field
The present invention relates to network technique field, more particularly to a kind of technology for being used to assess junk information mining rule.
Background technology
With the development and application of Internet technology, increasing user is issued and received by open interaction platform Bulk information, makes full use of internet into the exchange and resource-sharing of row information.But, substantial amounts of rubbish is included in these information Rubbish information, the junk information is probably some batch issue, information with illegal objective, takes a large amount of Internet resources, and And greatly cause Network Security Vulnerabilities.Current open interaction platform takes certain measure, by rubbish Information is excavated, and is detected and is handled the junk information in the open interaction platform.But, due to interaction platform manager without Method knows whether the junk information in open interaction platform is effectively excavated, and then excavation, detection mode can not be entered The corresponding optimization of row, so that can not the saving of Logistics networks resource, the purpose of the cleaning of open interaction platform.
Therefore, how junk information mining rule is effectively assessed, as one of current urgent problem to be solved.
The content of the invention
It is an object of the invention to provide a kind of method and apparatus for being used to assess junk information mining rule.
According to an aspect of the invention, there is provided a kind of method for assessing junk information mining rule, wherein, should Method comprises the following steps:
A obtains mining rule to be assessed;
The information that b obtains for assessing the mining rule issues sample;
C is based on the mining rule, carries out junk information excavation to described information issue sample, obtains and sent out with described information The corresponding junk information of cloth sample;
D issues sample according to the junk information with reference to described information, and acquisition is corresponding with the mining rule at least One evaluating.
According to another aspect of the present invention, a kind of equipment for assessing junk information mining rule is additionally provided, wherein, The equipment includes:
Rule device, the mining rule to be assessed for obtaining;
Sample acquiring device, sample is issued for obtaining the information for being used to assess the mining rule;
Junk information acquisition device, for based on the mining rule, junk information to be carried out to described information issue sample Excavate, obtain the junk information corresponding with described information issue sample;
Parameter obtaining device, for according to the junk information, sample to be issued with reference to described information, obtains and is excavated with described At least one corresponding evaluating of rule.
Compared with prior art, the present invention is joined by obtaining at least one corresponding with mining rule to be assessed and evaluating Number, the index being estimated to the mining rule is provided to interaction platform manager, and then the mining rule can be carried out excellent Change and update, to improve every evaluating so that interaction platform more can accurately judge junk information and it is carried out Processing, so as to ensure the normal work of interaction platform, further reaches and saves Internet resources, the mesh of the open interaction platform of cleaning 's.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows the equipment schematic diagram for assessing junk information mining rule according to one aspect of the invention;
Fig. 2 shows the equipment signal for being used to assess junk information mining rule in accordance with a preferred embodiment of the present invention Figure;
Fig. 3 shows to be used for the method flow diagram for assessing junk information mining rule according to a further aspect of the present invention;
Fig. 4 show in accordance with a preferred embodiment of the present invention be used for assess junk information mining rule method flow diagram.
Same or analogous reference represents same or analogous part in accompanying drawing.
Embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 shows the equipment schematic diagram for assessing junk information mining rule according to one aspect of the invention.Assessment is set Standby 1 includes Rule device 11, sample acquiring device 12, junk information acquisition device 13, parameter obtaining device 14.Here, Assessment equipment 1 includes but is not limited to computer, network host, single network server, multiple webserver collection or multiple clothes The cloud for device composition of being engaged in.Here, cloud is by a large amount of computers or webserver structure based on cloud computing (Cloud Computing) Into, wherein, cloud computing is one kind of Distributed Calculation, a super virtual meter being made up of the computer collection of a group loose couplings Calculation machine.
Specifically, Rule device 11 obtains mining rule to be assessed.More specifically, Rule device 11 is regular Or answer event triggering to obtain mining rule to be assessed in real time, the network equipment such as by monitoring the webserver in real time is sent out The request for the mining rule to be assessed sent, to obtain mining rule to be assessed, or regularly passes through the communication party of agreement Formula, such as http, https communication protocol, directly read digging to be assessed from the other parts or third party device of assessment equipment 1 Pick rule.For example, it is assumed that assessment equipment 1 is the webserver, the Rule device 11 of the webserver by supervising in real time Another webserver excavated for junk information is listened, obtains what another webserver was sent by http communication protocols The http request being packaged into based on mining rule to be assessed, the Rule device 11 parses the http request, and obtains it In mining rule to be assessed.And for example, Rule device 11 presses some cycles, regularly by calling predetermined application to compile Journey interface (API) sends the request for obtaining mining rule to be assessed to third party device, and receives third party device return Mining rule to be assessed.Those skilled in the art will be understood that the mode of above-mentioned acquisition mining rule to be assessed is only act Example, the mode of other acquisitions that are existing or being likely to occur from now on mining rule to be assessed is such as applicable to the present invention, also should Within the scope of the present invention, and it is incorporated herein by reference herein.
The information that sample acquiring device 12 obtains for assessing the mining rule issues sample.Specifically, sample acquisition Device 12 by randomly extracted from network interdynamic platform such as according to the communication protocol made an appointment it is a plurality of release news, or From information issue Sample Storehouse in obtain it is a plurality of release news, wherein, these release news in advance indicates junk information mark, with area Point its be junk information or normal information, and release news this etc. is a plurality of as obtaining for assessing Rule device 11 The information issue sample of mining rule.Wherein, whether it is real that the junk information is identified for identifying every and releasing news Junk information.Here, information issue sample includes but is not limited to:1) it is a plurality of to release news and its content, in Web Community Multiple models and its content;2) junk information is identified.Here, information issue Sample Storehouse is used to store a plurality of release news and its rubbish Rubbish message identification, including but not limited to relational database, memory storage, harddisk memory etc..For example, it is assumed that network interdynamic is flat Releasing news in platform is stored in the webserver, sample acquiring device 12 according to the communication protocol made an appointment, such as http, The communication protocols such as https, send to the webserver and obtain the request that the information for being used for assessing mining rule issues sample, and Receive in the network interdynamic platform that the webserver is obtained at random it is a plurality of indicate releasing news for junk information mark, as The information of mining rule for assessing the acquisition of Rule device 11 issues sample, wherein, the network interdynamic platform includes But it is not limited to:Web Community, mhkc, blog, meagre, news analysis, message interactive etc..In another example, sample acquiring device 12 from Real junk information and non-spam are obtained according to a certain percentage in information issue Sample Storehouse, and as assessing The information issue sample for the mining rule that Rule device 11 is obtained.Those skilled in the art will be understood that above-mentioned acquisition information The mode of issue sample is only for example, and the mode that other acquisition information that are existing or being likely to occur from now on issue sample can such as be fitted For the present invention, it should also be included within the scope of the present invention, and be incorporated herein by reference herein.
Those skilled in the art will be understood that the execution sequence of above-mentioned Rule device 11 and sample acquiring device 12 only For citing, in practice, they can be performed with random order, such as concurrently or sequentially.Those skilled in the art should also be understood that, The Rule device 11 only shown for simplicity in Fig. 1 prior to the execution sequence that sample acquiring device 12 is performed, but this It is premised on not interfering with and carry out clear, sufficient disclosure to the present invention far and away to plant omission.
Then, junk information acquisition device 13 is based on the mining rule, and rubbish letter is carried out to described information issue sample Breath is excavated, and obtains the junk information corresponding with described information issue sample.Specifically, junk information acquisition device 13 is based on rule Whether the mining rule that then acquisition device 11 is obtained, such as one information publisher ID information issue frequency exceedes the predetermined frequency Threshold value, information publisher whether in blacklist, in the content that releases news whether comprising rubbish vocabulary etc., sample acquisition is filled The carry out discriminatory analysis that releases news in the information issue sample of 12 acquisitions is put, such as when one or more releases news satisfaction such as When any one mining rule or whole mining rule, then judge that this one or more releases news as junk information, so as to obtain Whole junk information in information issue sample.
If for example, it is assumed that the mining rule that Rule device 11 is obtained is information publisher ID in blacklist or issued In information comprising rubbish vocabulary then this release news as junk information;Then, the information issue sample that sample acquiring device 12 is obtained This includes three and released news, and its content is respectively:
A " certificates handling, calls 13811112222 ",
B " everybody is happy ",
C " I wishes to make friends ";
Then, based on two mining rules, 13 pairs of junk information acquisition device this three, which releases news, to carry out judging to divide Analysis, string matching is carried out by the content for a that releases news in rubbish dictionary, to obtain " certificates handling " for rubbish vocabulary, and issue The information c information publisher ID a and c that releases news that judge to release news in information issue sample in blacklist, then are Junk information.
In another example, it is assumed that if the mining rule that Rule device 11 is obtained, which is an information publisher ID, issues same hair The frequency of the cloth information content exceed predetermined frequency threshold value and release news in comprising rubbish vocabulary then this release news as rubbish Information;Then, the information issue sample that sample acquiring device 12 is obtained includes 20 and released news, wherein 10 release news Content be:" head store sells all kinds of slimming drugs, favorable price ", and information publisher ID is identical, and sent out within 1 minute Send;Then, junk information acquisition device 13 is released news to this 10 based on two mining rules and analyzed, to determine to be somebody's turn to do 10 contents that release news are identical and issued by same information publisher ID, so as to can determine whether that this ten release news as same The continuous issue of 10 times of information, the frequency of information issue is more than 5 beats/min predetermined of frequency threshold value for 10 beats/min, while rubbish Rubbish information acquisition device 13 carries out it string matching in rubbish dictionary, and it is rubbish to obtain " sale ", " slimming drugs " accordingly Rubbish vocabulary, so junk information acquisition device 13 obtain the information issue sample in this 10 release news for rubbish believe Breath.Here, rubbish vocabulary described in illustrated embodiment includes but is not limited to banned word, infringement word, indecency word, political nature, instigated Property word, advertising words etc., rubbish dictionary described in illustrated embodiment is used to store rubbish vocabulary, including but not limited to relation number According to storehouse, memory storage, harddisk memory etc..Those skilled in the art will be understood that the mode of above-mentioned acquisition junk information is only Citing, other modes of acquisition junk information that are existing or being likely to occur from now on are such as applicable to the present invention, should also be included in Within the scope of the present invention, and it is incorporated herein by reference herein.
Then, parameter obtaining device 14 issues sample according to the junk information with reference to described information, obtains and is dug with described At least one corresponding evaluating of pick rule.Specifically, parameter obtaining device 14 is logical according to junk information acquisition device 13 Cross junk information excavate obtain junk information, and combine sample acquiring device 12 obtain information issue sample in include it is many It is individual to release news and its junk information mark, analysed and compared, so as to obtain real rubbish letter in the junk information such as this Quantity and non-spam quantity are ceased, and then parameter obtaining device 14 issues the quantity that releases news in sample according to information, with At least one evaluating is obtained, as described the recall rate of mining rule.Wherein, the evaluating includes but is not limited to:1) The recall rate corresponding with the mining rule, calculation formula is " recall rate=excavate the real of acquisition by junk information Real junk information quantity in junk information quantity/information issue sample ";2) it is corresponding with the mining rule accurate Rate, calculation formula is " accuracy rate=excavating the real junk information quantity obtained by junk information/is dug by junk information The junk information quantity dug up ".Released news for example, it is assumed that information is issued in sample comprising 500, wherein junk information mark Knowledge shows that the quantity that releases news that it is real junk information is 100, and junk information acquisition device 13 is issued from the information It is 80 that sample, which excavates the junk information quantity obtained by junk information,;Then, parameter obtaining device 14 is sent out according to the information Those are excavated acquisition junk information by junk information and really junk information are carried out in information issue sample by cloth sample Compare, to obtain those by the real junk information quantity of junk information excavation acquisition as 40, and then parameter acquiring is filled Putting 14, " accuracy rate=excavating the real junk information quantity obtained by junk information/is dug by junk information by formula The junk information quantity dug up ", it is 50% (=40/80) to calculate and obtain evaluating accuracy rate, by formula " recall rate= The real junk information quantity obtained/information is excavated by junk information and issues real junk information quantity in sample ", meter It is 40% (=40/100) to calculate and obtain recall rate.Those skilled in the art will be understood that the mode of above-mentioned acquisition evaluating is only Citing, other modes of acquisition evaluating that are existing or being likely to occur from now on are such as applicable to the present invention, should also be included in Within the scope of the present invention, and it is incorporated herein by reference herein.
Preferably, Rule device 11, sample acquiring device 12, junk information acquisition device 13 and parameter obtaining device Can be continuously to work between 14.Specifically, Rule device 11 obtains mining rule to be assessed;Sample acquisition The information that device 12 obtains for assessing the mining rule issues sample;Then, junk information acquisition device 13 is based on described Mining rule, junk information excavation is carried out to described information issue sample, obtains the rubbish corresponding with described information issue sample Rubbish information;Then, parameter obtaining device 14 issues sample according to the junk information with reference to described information, obtains and is dug with described At least one corresponding evaluating of pick rule.Here, it will be understood by those skilled in the art that " lasting " refers to each device difference Mode of operation according to setting or real-time adjustment requires that carrying out the acquisition of mining rule to be assessed, information issues obtaining for sample Take, the acquisition of junk information and the acquisition of evaluating, until to stop acquisition in a long time to be evaluated for Rule device 11 The mining rule estimated.
Here, it should be noted that citing in every numerical value as illustration example, only for understand the present invention, True Data during not as practical application.Unless otherwise instructed, the function of the numerical value occurred elsewhere with herein It is identical, for simplicity, repeat no more.
Preferably, sample acquiring device 12 is obtained from information issue Sample Storehouse and dug with described according to the mining rule The corresponding information issue sample of pick rule.Specifically, the excavation that sample acquiring device 12 is obtained according to Rule device 11 Rule, such as, by carrying out matching inquiry in issuing Sample Storehouse in information, send out when matching obtains any bar mining rule with information When the indicated mining rule that released news in cloth Sample Storehouse is corresponding, obtains this and release news, and all matching inquiries are obtained Releasing news for obtaining issues sample as information;Or by being inquired about in issuing Sample Storehouse in information, to obtain a fixed number Amount the or former junk information do not excavated successfully by those mining rules issues sample as information.For example, it is assumed that regular The mining rule that acquisition device 11 is obtained is that this releases news as junk information if information publisher ID is in blacklist, is connect , sample acquiring device 12 randomly selects several information publisher ID according to this mining rule in blacklist, and according to These ID carry out matching inquiry in information issue Sample Storehouse, obtain some and release news, or information is issued into Sample Storehouse In all information publisher ID released news matching inquiry is carried out in blacklist, with match obtain 200 in blacklist In information publisher ID, and obtain corresponding with 200 information publisher ID some accordingly and release news, using as Described information issues sample.In another example, Rule device 11 obtains mining rule, and then, sample acquiring device 12 will be any The mining rule ID that bar mining rule is identified carries out matching inquiry in information issue Sample Storehouse, obtains and mining rule ID Whether corresponding junk information, and digging corresponding to mining rule ID excavate the junk information successfully according to rule, enter And the whole junk information do not excavated by the digging corresponding to it according to rule success are extracted, and will be wherein certain proportion (such as 50%) In junk information sample is issued as information.Those skilled in the art will be understood that above-mentioned acquisition information issues the mode of sample It is only for example, the mode that other acquisition information that are existing or being likely to occur from now on issue sample is such as applicable to the present invention, It should be included within the scope of the present invention, and be incorporated herein by reference herein.
Fig. 2 shows the equipment signal for being used to assess junk information mining rule in accordance with a preferred embodiment of the present invention Figure, parameter obtaining device 14 ' also includes result acquiring unit 141 ' and parameter acquiring unit 142 '.Specifically, as a result obtain single Described information is issued default actual junk information and the junk information in sample and is compared analysis by member 141 ', obtain and The corresponding comparative analysis result of the junk information;Then, parameter acquiring unit 142 ' is obtained according to the comparative analysis result Take at least one of described evaluating.Here, device 11 ' -13 shown in Fig. 2 ' with device 11- above described by reference picture 1 13 content is identical, for simplicity, is incorporated herein by reference, without repeating.
More specifically, the information that result acquiring unit 141 ' obtains sample acquiring device 12 ' issues default in sample Actual junk information and junk information acquisition device 13 ' are dug based on mining rule and are compared one by one point according to the junk information of acquisition Analysis, to obtain the comparative analysis result corresponding with those junk information, wherein, comparative analysis result includes but is not limited to:1) Real junk information quantity, 2 in those junk information) non-spam quantity, 3 in those junk information) those rubbish letter The keyword of content, 4 are issued in breath in non-spam) information publisher's credit of non-spam is commented in those junk information Valency grade, 5) those real junk information information publisher information issue frequency etc..For example, it is assumed that sample acquiring device Releasing news as 20 in 12 ' the information issue samples obtained, the real junk information quantity in releasing news is 10 Bar;Then, junk information acquisition device 13 ' issues the junk information quantity that sample is obtained based on mining rule excavation from the information For 6;Then, as a result acquiring unit 141 ' issues sample according to the information, and those are based on into mining rule excavates acquisition rubbish Information is compared with real junk information in information issue sample, excavates what is obtained based on mining rule to obtain those Real junk information quantity is 4, and it is that same information publisher ID is issued to obtain those real junk information, and the letter The information issue frequency for ceasing publisher is 4 beats/min.
Then, the comparative analysis result that parameter acquiring unit 142 ' is obtained according to result acquiring unit 141 ', passes through formula Calculate and obtain at least one evaluating, such as the accuracy rate corresponding with the mining rule that Rule device 11 is obtained.For example, Connect example, releasing news as 20 in the information issue sample that sample acquiring device 12 ' is obtained, this release news in it is real Junk information quantity be 10, junk information acquisition device 13 ' based on mining rule excavate obtain junk information be 6, As a result acquiring unit 141 ' determines that real junk information quantity is 4, and parameter acquiring unit 142 ' is calculated by formula and obtained Evaluating accuracy rate is 67% (=4/6), and it is 40% (=4/10) to be calculated by formula and obtain recall rate.
Those skilled in the art will be understood that the mode of above-mentioned acquisition comparative analysis result and acquisition evaluating is only to lift Example, other acquisition comparative analysis results or the mode of acquisition evaluating existing or be likely to occur from now on are such as applicable to this Invention, should also be included within the scope of the present invention, and be incorporated herein by reference herein.
Preferably, the mining rule is based on following at least any one and carries out junk information digging to described information issue sample Pick:
- information issues the frequency;
- information presenting substance;
The historical behavior record of-information publisher;
The attribute of-information publisher.
1) specifically, the described information issue frequency includes but is not limited to:The information issue frequency of one information publisher, tool The information released news the issue frequency, the information from same IP address for having identical content issue frequency etc..For example, information is sent out Released news in cloth sample comprising 10, junk information acquisition device 13 ' releases news to this 10 and analyzed, to determine to be somebody's turn to do 10 release news in 6 release news and issued by same information publisher ID in 1 minute, information publisher issue letter The frequency of breath is 10 beats/min and is more than 5 beats/min predetermined of frequency threshold value, so as to can determine whether that this 6 release news as rubbish letter Breath.
2) described information issue content includes but is not limited to:Rubbish vocabulary, Duo Gefa included in information presenting substance Cloth information has identical information presenting substance etc..For example, released news in information issue sample comprising 3, this 3 issue letters Ceasing content is respectively:
A " certificates handling, calls 13811112222 ",
B " everybody is happy ",
C " I wishes to make friends ";
The content that this 3 release news is carried out string matching by junk information acquisition device 13 ' in rubbish dictionary, with " certificates handling " the rubbish vocabulary released news in a is obtained, and a that judges to release news accordingly is junk information.
3) the historical behavior record of described information publisher includes but is not limited to:In the history of information publisher releases news Hold, the history of information publisher releases news time record and the history online hours etc. of information publisher.For example, rubbish is believed Information is issued an information publisher ID released news in sample and entered in historical behavior database by breath acquisition device 13 ' Row matching inquiry, obtain the information publisher history release news the time in 1:00 AM to 6:00 AM, and the information send out The history of cloth person, which releases news, includes rubbish vocabulary in content, then judges that this releases news as junk information.Wherein, implementation is lifted The historical behavior that historical behavior database in example is used for storage information publisher is recorded, including but not limited to relational database, Memory storage, harddisk memory etc..
4) attribute of described information publisher includes but is not limited to:Whether information publisher is in blacklist, information issue The personal background information that person pre-enters.For example, whole issues that junk information acquisition device 13 ' is issued information in sample are believed The information publisher ID of breath carries out matching inquiry in blacklist, obtains two information publishers released news in blacklist In, then judge that this two release news as junk information.
Those skilled in the art will be understood that can not only be individually used for carrying out information issue sample based on above-mentioned four Junk information is excavated, and can be combined with being used for carrying out junk information excavation to information issue sample.Those skilled in the art should be able to Understand that the mining rule of above-mentioned junk information is only for example, the excavation rule of other existing or junk information for being likely to occur from now on The present invention is then such as applicable to, should be also included within the scope of the present invention, and be incorporated herein by reference herein.
In a preferred embodiment (reference picture 2), assessment equipment 1 also includes rule optimization device (not shown), the rule Then optimization device optimizes the mining rule according to the evaluating.The preferred embodiment is carried out referring to Fig. 2 detailed Description, wherein, Rule device 11 ' obtains mining rule to be assessed;Sample acquiring device 12 ' obtains described for assessing The information issue sample of mining rule;Junk information acquisition device 13 ' is based on the mining rule, and sample is issued to described information Junk information excavation is carried out, the junk information corresponding with described information issue sample is obtained;Result in parameter obtaining device 14 ' Described information is issued default actual junk information in sample and is compared analysis with the junk information by acquiring unit 141 ', Obtain the comparative analysis result corresponding with the junk information;The basis of parameter acquiring unit 142 ' in parameter obtaining device 14 ' The comparative analysis result, obtains at least one of described evaluating;Its detailed process and the implementation described by foregoing reference picture 2 Rule device 11 ', sample acquiring device 12 ', junk information acquisition device 13 ' and parameter obtaining device 14 ' are held in example Capable process is identical, for simplicity, is incorporated herein by reference, without repeating.
Specifically, the evaluating that rule optimization device is obtained according to parameter acquiring unit 142 ', such as with mining rule phase Corresponding accuracy rate, optimizes the mining rule, such as when accuracy rate is less than default accuracy rate threshold value in evaluating, leads to It is that the information publisher published information high to credit rating is excavated without junk information to cross adjustment mining rule, to improve standard True rate.For example, it is assumed that it is 50% that parameter acquiring unit 142 ' calculates accuracy rate in the evaluating obtained by formula, rule is excellent Makeup puts judging nicety rate 50% less than default accuracy rate threshold value 60%, then rule optimization device adjustment mining rule for pair Releasing news for the high information publisher of credit rating is excavated without junk information, to improve the accuracy rate in evaluating. Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for example, and other are existing or are likely to occur from now on The mode of optimization mining rule be such as applicable to the present invention, should also be included within the scope of the present invention, and herein to draw It is incorporated herein with mode.
Preferably, the rule optimization device can also be according to the evaluating, with reference to the comparative analysis result, optimization The mining rule.Specifically, the evaluating that rule optimization device is obtained according to parameter acquiring unit 142 ', is such as advised with excavating Then corresponding recall rate, and the comparative analysis result obtained according to result acquiring unit 141 ', optimizes the mining rule, example Such as when recall rate is less than default recall rate threshold value, the mode of optimization includes but is not limited to:By reducing comparative analysis result The shown information for being used to excavate in the mining rule of junk information issues frequency threshold value, or reduction rubbish vocabulary accumulated quantity threshold Value etc., to improve recall rate.For example, it is assumed that recall rate is 40% in the evaluating that parameter acquiring unit 142 ' is obtained, and it is small In default recall rate threshold value 50%, then rule optimization device is according to rubbish in the acquisition comparative analysis result of result acquiring unit 141 ' The information issue frequency average of the information publisher of rubbish information is 4 beats/min, and information is issued into frequency threshold value from 5 beats/min accordingly Clock is reduced to 4 beats/min, to improve recall rate.In another example, it is assumed that recalled in the evaluating that parameter acquiring unit 142 ' is obtained Rate is less than default recall rate threshold value, then rule optimization device obtains rubbish in comparative analysis result according to result acquiring unit 141 ' The rubbish vocabulary 2/bar of average included in the rubbish information content, accordingly by the rubbish vocabulary accumulated quantity threshold value of spam content 2/bar is reduced to from 3/bar, to improve recall rate.Those skilled in the art will be understood that the side of above-mentioned optimization mining rule Formula is only for example, and other modes of optimization mining rule that are existing or being likely to occur from now on are such as applicable to the present invention, also should Within the scope of the present invention, and it is incorporated herein by reference herein.
It is highly preferred that the rule optimization device can also be according to the evaluating, with reference to the preset of the evaluating Parameter priority information, optimize the mining rule.Specifically, rule optimization device is such as called together always according to the evaluating Rate and accuracy rate are returned, and according to default parameter priority information, such as accuracy rate priority is higher than recall rate, the suitable side of selection Formula optimizes mining rule, to improve the evaluating.For example, it is assumed that accurate in the evaluating that parameter acquiring unit 142 ' is obtained Rate is 50%, and less than default accuracy rate threshold value 60%, recall rate is 40%, and less than default recall rate threshold value 50%, Then rule optimization device is higher than the parameter priority information of recall rate according to default accuracy rate priority, and adjustment mining rule is What high-quality user was issued releases news without excavating, and then improves accuracy rate.Those skilled in the art will be understood that The mode for stating optimization mining rule is only for example, and other modes of optimization mining rule that are existing or being likely to occur from now on such as may be used Suitable for the present invention, it should also be included within the scope of the present invention, and be incorporated herein by reference herein.
It is highly preferred that assessment equipment 1 also includes priority update device (not shown), the priority update device can basis The evaluating, updates the parameter priority information.Specifically, priority update device is according to parameter acquiring unit 142 ' The evaluating of acquisition, for example when recall rate be less than default recall rate threshold value, and accuracy rate be more than default accuracy rate threshold value When, undated parameter priority is that recall rate priority is more than accuracy rate.For example, the evaluating that parameter acquiring unit 142 ' is obtained Middle recall rate is less than default recall rate threshold value, and accuracy rate is more than default accuracy rate threshold value, and priority update device will be pre- If accuracy rate priority be higher than the parameter priority information of recall rate, be updated to recall rate priority higher than accuracy rate.Ability Field technique personnel will be understood that the mode of above-mentioned excavation undated parameter precedence information is only for example, and other are existing or from now on may be used The mode of undated parameter precedence information that can occur such as is applicable to the present invention, should also be included in the scope of the present invention with It is interior, and be incorporated herein by reference herein.
Preferably, the equipment also includes optimal control device (not shown), and the optimal control device, which can work as described evaluate, joins When number reaches evaluating threshold value, terminate the optimization mining rule.Specifically, junk information acquisition device 13 ' is based on excavating Rule carries out junk information excavation to information issue sample, obtains the junk information corresponding with information issue sample;Then, The information is issued default actual junk information and the rubbish in sample by parameter obtaining device 14, middle result acquiring unit 141 ' Information is compared analysis, obtains the comparative analysis result corresponding with the junk information;Then, parameter acquiring unit 142 ' According to the comparative analysis result, at least one evaluating is obtained;Junk information acquisition device 13 ' and parameter obtaining device 14 ' Mining rule circulation after constantly rule-based optimization device updates is performed, and optimal control device detects what the circulation was obtained every time Evaluating, and when evaluating reaches evaluating threshold value, terminate to optimize the principle of optimality.Wherein, evaluating threshold Value means preset expectancy evaluation parameter.For example, when optimal control device detects accuracy rate more than predetermined accuracy rate threshold value And recall rate be more than predetermined recall rate threshold value when, optimal control device stop optimize the mining rule.Those skilled in the art It will be understood that the mode of above-mentioned end optimization mining rule is only for example, other end optimizations that are existing or being likely to occur from now on The mode of mining rule is such as applicable to the present invention, should also be included within the scope of the present invention, and herein by reference It is incorporated herein.
Preferably, the evaluating includes following at least any one:
- the recall rate corresponding with the mining rule;
- the accuracy rate corresponding with the mining rule.
Specifically, the evaluating that parameter acquiring unit 142 ' is obtained includes but is not limited to:Dug with described according to regular relative The recall rate and the accuracy rate corresponding with the mining rule answered.Recall rate refers to junk information acquisition device 13 ' by rubbish The real junk information quantity that information excavating is obtained issues the ratio of actual junk information quantity in sample with information;Accuracy rate Refer to junk information acquisition device 13, excavate the real junk information quantity obtained by junk information fills with junk information acquisition Put the 13 ' ratios that the junk information quantity obtained is excavated by junk information.Accuracy rate and recall rate may be restricted mutually Two evaluatings, when accuracy rate is high, may cause recall rate low, when recall rate is high, accuracy rate may be caused low, therefore, need Balance is found between recall rate and accuracy rate, the excavation of junk information is carried out in an optimal manner.Those skilled in the art It will be understood that above-mentioned evaluating is only for example, other evaluatings that are existing or being likely to occur from now on are such as applicable to this hair It is bright, it should also be included within the scope of the present invention, and be incorporated herein by reference herein.
Fig. 3 shows the method flow diagram for assessing junk information mining rule according to one aspect of the invention.Wherein, comment Estimate equipment 1 and include but is not limited to computer, network host, single network server, multiple webserver collection or multiple services The cloud that device is constituted.Here, cloud is made up of a large amount of computers or the webserver based on cloud computing (Cloud Computing), Wherein, cloud computing is one kind of Distributed Calculation, a super virtual computing being made up of the computer collection of a group loose couplings Machine.
Specifically, in step sl, assessment equipment 1 obtains mining rule to be assessed.More specifically, in step sl, Assessment equipment 1 is regular or answers event triggering to obtain mining rule to be assessed in real time, such as by monitoring network service in real time The request for the mining rule to be assessed that the network equipments such as device are sent, to obtain mining rule to be assessed, or regularly leads to Cross the communication mode of agreement, directly such as http, https communication protocol, the other parts or third party device from assessment equipment 1 Read mining rule to be assessed.For example, it is assumed that assessment equipment 1 is the webserver, in step sl, the webserver leads to After monitoring another webserver for junk information excavation in real time, obtain another webserver and pass through http communication protocols The http request being packaged into based on mining rule to be assessed sent is discussed, the webserver parses the http request, and obtains Take mining rule to be assessed therein.And for example, in step sl, assessment equipment 1 presses some cycles, regularly pre- by calling Fixed API (API) sends the request for obtaining mining rule to be assessed to third party device, and receives the 3rd The mining rule to be assessed that method, apparatus is returned.Those skilled in the art will be understood that above-mentioned acquisition mining rule to be assessed Mode is only for example, and the mode of other acquisitions that are existing or being likely to occur from now on mining rule to be assessed is such as applicable to this Invention, should also be included within the scope of the present invention, and be incorporated herein by reference herein.
In step s 2, the information that assessment equipment 1 is obtained for assessing the mining rule issues sample.Specifically, exist In step S2, assessment equipment 1 from network interdynamic platform according to the communication protocol made an appointment by such as randomly extracting many Bar releases news, or from information issue Sample Storehouse in obtain it is a plurality of release news, wherein, these release news and indicate rubbish in advance Message identification, to distinguish it as junk information or normal information, and releases news this etc. is a plurality of as assessing in step The information issue sample for the mining rule that assessment equipment 1 is obtained in S1.Wherein, the junk information is identified for identifying every hair Whether cloth information is real junk information.Here, information issue sample includes but is not limited to:1) it is a plurality of to release news and its interior Hold, multiple models and its content in such as Web Community;2) junk information is identified.Here, information issue Sample Storehouse is used to store It is a plurality of to release news and its junk information mark, including but not limited to relational database, memory storage, harddisk memory etc.. For example, it is assumed that releasing news in network interdynamic platform is stored in the webserver, in step s 2, assessment equipment 1 according to The communication protocol made an appointment, such as http, https communication protocol, send to obtain to be used to assess to the webserver and excavate The information of rule issues the request of sample, and a plurality of indicating of receiving in the network interdynamic platform that the webserver is obtained at random Releasing news for junk information mark, sends out as the information for assessing the mining rule that assessment equipment 1 is obtained in step sl Cloth sample, wherein, the network interdynamic platform includes but is not limited to:It is Web Community, mhkc, blog, meagre, news analysis, short Believe interaction etc..In another example, in step s 2, assessment equipment 1 obtains real according to a certain percentage from information issue Sample Storehouse Junk information and non-spam, and as the letter for assessing the mining rule that assessment equipment 1 is obtained in step sl Breath issue sample.Those skilled in the art will be understood that the mode of above-mentioned acquisition information issue sample is only for example, and other are existing Or be likely to occur from now on acquisition information issue sample mode be such as applicable to the present invention, also should be included in the present invention protection Within scope, and it is incorporated herein by reference herein.
Those skilled in the art will be understood that execution sequence of the above-mentioned assessment equipment 1 in step S1 and step S2 is only act Example, in practice, they can be performed with random order, such as concurrently or sequentially.Those skilled in the art should also be understood that, Fig. 3 In the execution sequence of a kind of assessment equipment 1 that only shows for simplicity, but this omission is not interfere with pair far and away The present invention is carried out premised on clear, sufficient disclosure.
Then, in step s3, assessment equipment 1 is based on the mining rule, and rubbish is carried out to described information issue sample Information excavating, obtains the junk information corresponding with described information issue sample.Specifically, in step s3, the base of assessment equipment 1 In the mining rule that it is obtained in step sl, whether such as one information publisher ID information issue frequency exceedes predetermined frequency Subthreshold, information publisher whether in blacklist, in the content that releases news whether comprising rubbish vocabulary etc., to assessment equipment The carry out discriminatory analysis that releases news in the 1 information issue sample obtained in step s 2, such as when one or more issue letter Breath meet as any one of mining rule or during whole mining rule, then judge that this one or more releases news as junk information, So as to obtain whole junk information in information issue sample.
For example, it is assumed that in step sl, if the mining rule that assessment equipment 1 is obtained is information publisher ID in blacklist Or in releasing news comprising rubbish vocabulary then this release news as junk information;Then, in step s 2, assessment equipment 1 is obtained Information issue sample include three and release news, its content is respectively:
A " certificates handling, calls 13811112222 ",
B " everybody is happy ",
C " I wishes to make friends ";
Then, based on two mining rules, in step s3,1 pair of assessment equipment this three, which releases news, to be judged Analysis, string matching is carried out by the content for a that releases news in rubbish dictionary, to obtain " certificates handling " for rubbish vocabulary, and hair Cloth information c information publisher ID judges a and c that releases news that released news in information issue sample in blacklist, then For junk information.
In another example, it is assumed that in step sl, if the mining rule that assessment equipment 1 is obtained is an information publisher ID issue The frequency of the same content that releases news exceed predetermined frequency threshold value and release news in comprising rubbish vocabulary then this release news For junk information;Then, in step s 2, the information issue sample that assessment equipment 1 is obtained includes 20 and released news, wherein 10 the content released news is:" head store sells all kinds of slimming drugs, favorable price ", and information publisher ID is identical, and 1 Sent within minute;Then, in step s3, assessment equipment 1 is released news progress based on two mining rules to this 10 Analysis, to determine that 10 contents that release news are identical and issued by same information publisher ID, so as to can determine whether this ten hairs Cloth information is 10 continuously issues of same information, and the frequency that information is issued is for 10 beats/min more than predetermined frequency threshold value 5 Beat/min, while assessment equipment 1 carries out it string matching in rubbish dictionary, and " sale ", " slimming drugs " are obtained accordingly For rubbish vocabulary, and then obtain in step s3 in the information issues sample this 10 of assessment equipment 1 release news as rubbish Rubbish information.Here, rubbish vocabulary described in illustrated embodiment include but is not limited to banned word, infringement word, indecency word, political nature, Agitative word, advertising words etc., rubbish dictionary described in illustrated embodiment are used to store rubbish vocabulary, including but not limited to close It is database, memory storage, harddisk memory etc..Those skilled in the art will be understood that the mode of above-mentioned acquisition junk information It is only for example, other modes of acquisition junk information that are existing or being likely to occur from now on are such as applicable to the present invention, should also wrap It is contained within the scope of the present invention, and is incorporated herein by reference herein.
Then, in step s 4, assessment equipment 1 issues sample according to the junk information with reference to described information, obtain with At least one corresponding evaluating of the mining rule.Specifically, in step s 4, assessment equipment 1 according to it in step The junk information obtained is excavated by junk information in S3, and included in information issue sample obtained in step s 2 with reference to it It is multiple release news and its junk information mark, analysed and compared, so as to obtain real rubbish in the junk information such as this Rubbish information content and non-spam quantity, and then assessment equipment 1 issues the issue letter in sample according to information in step s 4 Cease quantity, to obtain at least one evaluating, the recall rate of mining rule as described.Wherein, the evaluating include but It is not limited to:1) recall rate corresponding with the mining rule, calculation formula is " recall rate=excavate acquisition by junk information Real junk information quantity/information issue sample in real junk information quantity ";2) it is corresponding with the mining rule Accuracy rate, calculation formula for " accuracy rate=pass through junk information excavate obtain real junk information quantity/pass through rubbish The junk information quantity that information excavating is obtained ".Released news for example, it is assumed that information is issued in sample comprising 500, wherein rubbish Message identification shows that the quantity that releases news that it is real junk information is 100, in step s3, assessment equipment 1 from this It is 80 that information, which issues sample and the junk information quantity obtained is excavated by junk information,;Then, in step s 4, assessment equipment 1 issues sample according to the information, and those are excavated by junk information and obtains real in junk information and information issue sample Junk information be compared, using obtain those by junk information excavate obtain real junk information quantity as 40, And then by formula, " accuracy rate=excavating the real junk information quantity obtained by junk information/passes through rubbish to assessment equipment 1 The junk information quantity that rubbish information excavating is obtained ", it is 50% (=40/80) to calculate and obtain evaluating accuracy rate, passes through formula " recall rate=real rubbish letter in the real junk information quantity obtained/information issue sample is excavated by junk information Cease quantity ", it is 40% (=40/100) to calculate and obtain recall rate.Those skilled in the art will be understood that above-mentioned acquisition evaluating Mode be only for example, other modes of acquisition evaluating that are existing or being likely to occur from now on are such as applicable to the present invention, Also it should be included within the scope of the present invention, and be incorporated herein by reference herein.
Preferably, assessment equipment 1 is continuously to work in step S1, step S2, step S3 and step S4.Specifically Ground, in step sl, assessment equipment 1 obtain mining rule to be assessed;In step s 2, assessment equipment 1 is obtained for assessing The information issue sample of the mining rule;Then, in step s3, assessment equipment 1 is based on the mining rule, to the letter Breath issue sample carries out junk information excavation, obtains the junk information corresponding with described information issue sample;Then, in step In S4, assessment equipment 1 issues sample according to the junk information with reference to described information, obtains corresponding with the mining rule At least one of evaluating.Here, it will be understood by those skilled in the art that " lasting " refers to assessment equipment 1 in each step respectively Mode of operation according to setting or real-time adjustment requires that carrying out the acquisition of mining rule to be assessed, information issues obtaining for sample Take, the acquisition of junk information and the acquisition of evaluating, until assessment equipment 1 stops acquisition digging to be assessed in a long time Pick rule.
Here, it should be noted that citing in every numerical value as illustration example, only for understand the present invention, True Data during not as practical application.Unless otherwise instructed, the function of the numerical value occurred elsewhere with herein It is identical, for simplicity, repeat no more.
Preferably, in step s 2, assessment equipment 1 is according to the mining rule, obtained from information issue Sample Storehouse with The corresponding information issue sample of the mining rule.Specifically, in step s 2, assessment equipment 1 according to it in step sl The mining rule of acquisition, such as, by carrying out matching inquiry in issuing Sample Storehouse in information, rule are excavated when matching obtains any bar When then corresponding with the indicated mining rule that released news in information issue Sample Storehouse, obtain this and release news, and will be all What matching inquiry was obtained releases news as information issue sample;Or by being inquired about in issuing Sample Storehouse in information, with Junk information that is a number of or not excavated successfully by those mining rules in the past is obtained as information and issues sample.Example Such as, it is assumed that in step sl, the mining rule that assessment equipment 1 is obtained is that the issue is believed if information publisher ID is in blacklist Cease for junk information, then, in step s 2, assessment equipment 1 is randomly selected some according to this mining rule in blacklist Individual information publisher ID, and matching inquiry is carried out in information issues Sample Storehouse according to these ID, obtain some and release news, Or by information issue Sample Storehouse in all information publisher ID released news matching inquiry is carried out in blacklist, with With 200 information publisher ID in blacklist are obtained, and if obtaining corresponding with 200 information publisher ID accordingly Dry bar releases news, to issue sample as described information.In another example, in step sl, assessment equipment 1 obtains mining rule, Then, in step s 2, the mining rule ID that assessment equipment 1 is identified any bar mining rule is in information issue Sample Storehouse Matching inquiry is carried out, the junk information corresponding with mining rule ID, and the digging evidence corresponding to mining rule ID is obtained Whether rule is excavated the junk information successfully, and then extracts the whole rubbish not excavated by the digging corresponding to it according to rule success Information, and information issue sample will be used as wherein in the junk information of certain proportion (such as 50%).Those skilled in the art should It is understood that the mode of above-mentioned acquisition information issue sample is only for example, other acquisition information hairs that are existing or being likely to occur from now on The mode of cloth sample is such as applicable to the present invention, should also be included within the scope of the present invention, and wrap by reference herein It is contained in this.
Fig. 4 shows the method flow for being used to assess junk information mining rule in accordance with a preferred embodiment of the present invention Figure, specifically, in step S41 ', assessment equipment 1 by described information issue sample in default actual junk information with it is described Junk information is compared analysis, obtains the comparative analysis result corresponding with the junk information;Then, in step S42 ' In, assessment equipment 1 obtains at least one of described evaluating according to the comparative analysis result.Here, step shown in Fig. 4 1 ' to step 3 ' it is identical with the content of step S1 to step S3 described by above reference picture 3, for simplicity, by reference It is incorporated herein, without repeating.
More specifically, in step S41 ', the information that assessment equipment 1 obtains it in step S2 ' is issued pre- in sample If actual junk information and assessment equipment 1 dug and carried out one by one according to the junk information of acquisition based on mining rule in step S3 ' Comparative analysis, to obtain the comparative analysis result corresponding with those junk information, wherein, comparative analysis result includes but not limited In:1) real junk information quantity, 2 in those junk information) non-spam quantity, 3 in those junk information) those rubbish The keyword of content, 4 are issued in rubbish information in non-spam) information publisher's letter of non-spam in those junk information With opinion rating, 5) information issue frequency of the information publishers of those real junk information etc..For example, it is assumed that in step In S2 ', releasing news as 20 in the information issue sample that assessment equipment 1 is obtained, the real rubbish in releasing news Information content is 10;Then, in step S3 ', assessment equipment 1 is based on mining rule excavation from information issue sample and obtained Junk information quantity be 6;Then, in step S41 ', assessment equipment 1 issues sample according to the information, and those are based on Mining rule is excavated real junk information in acquisition junk information and information issue sample and is compared, to obtain those bases The real junk information quantity for excavating acquisition in mining rule is 4, and obtains those real junk information for same letter Publisher ID issues are ceased, and the information issue frequency of the information publisher is 4 beats/min.
Then, in step S42 ', the comparative analysis result that assessment equipment 1 is obtained according to it in step S41 ' passes through Formula, which is calculated, obtains at least one evaluating, such as standard corresponding with the mining rule that assessment equipment 1 is obtained in step S1 ' True rate.For example, connecting example, in step S2 ', releasing news as 20 in the information issue sample that assessment equipment 1 is obtained should Real junk information quantity in releasing news is 10, in step S3 ', and assessment equipment 1 is based on mining rule excavation and obtained The junk information obtained is 6, in step S41 ', and assessment equipment 1 determines that real junk information quantity is 4, in step In S42 ', it is 67% (=4/6) that assessment equipment 1, which is calculated by formula and obtains evaluating accuracy rate, is calculated and obtained by formula Recall rate is 40% (=4/10).
Those skilled in the art will be understood that the mode of above-mentioned acquisition comparative analysis result and acquisition evaluating is only to lift Example, other acquisition comparative analysis results or the mode of acquisition evaluating existing or be likely to occur from now on are such as applicable to this Invention, should also be included within the scope of the present invention, and be incorporated herein by reference herein.
Preferably, the mining rule is based on following at least any one and carries out junk information digging to described information issue sample Pick:
- information issues the frequency;
- information presenting substance;
The historical behavior record of-information publisher;
The attribute of-information publisher.
1) specifically, the described information issue frequency includes but is not limited to:The information issue frequency of one information publisher, tool The information released news the issue frequency, the information from same IP address for having identical content issue frequency etc..For example, information is sent out Released news in cloth sample comprising 10, in step S3 ', 1 pair of assessment equipment this 10, which releases news, to be analyzed, to determine This 10 release news in 6 release news and issued by same information publisher ID in 1 minute, information publisher issue The frequency of information is 10 beats/min and is more than 5 beats/min predetermined of frequency threshold value, so as to can determine whether that this 6 release news as rubbish Information.
2) described information issue content includes but is not limited to:Rubbish vocabulary, Duo Gefa included in information presenting substance Cloth information has identical information presenting substance etc..For example, released news in information issue sample comprising 3, this 3 issue letters Ceasing content is respectively:
A " certificates handling, calls 13811112222 ",
B " everybody is happy ",
C " I wishes to make friends ";
In step S3 ', the content that this 3 release news is carried out string matching by assessment equipment 1 in rubbish dictionary, To obtain " certificates handling " the rubbish vocabulary released news in a, and a that judges to release news accordingly is junk information.
3) the historical behavior record of described information publisher includes but is not limited to:In the history of information publisher releases news Hold, the history of information publisher releases news time record and the history online hours etc. of information publisher.For example, in step In S3 ', information is issued one in the sample information publisher ID released news in historical behavior database by assessment equipment 1 Carry out matching inquiry, obtain the information publisher history release news the time in 1:00 AM to 6:00 AM, and the information The history of publisher, which releases news, includes rubbish vocabulary in content, then judges that this releases news as junk information.Wherein, lift real Applying the historical behavior database in example is used for the historical behavior record of storage information publisher, including but not limited to relation data Storehouse, memory storage, harddisk memory etc..
4) attribute of described information publisher includes but is not limited to:Whether information publisher is in blacklist, information issue The personal background information that person pre-enters.For example, in step S3 ', assessment equipment 1 issues information whole issues in sample The information publisher ID of information carries out matching inquiry in blacklist, obtains two information publishers released news in blacklist In, then judge that this two release news as junk information.
Those skilled in the art will be understood that can not only be individually used for carrying out information issue sample based on above-mentioned four Junk information is excavated, and can be combined with being used for carrying out junk information excavation to information issue sample.Those skilled in the art should be able to Understand that the mining rule of above-mentioned junk information is only for example, the excavation rule of other existing or junk information for being likely to occur from now on The present invention is then such as applicable to, should be also included within the scope of the present invention, and be incorporated herein by reference herein.
In a preferred embodiment (reference picture 4), the process also includes step S5 ' (not shown), in step S5 ', Assessment equipment 1 optimizes the mining rule according to the evaluating.The preferred embodiment is carried out referring to Fig. 4 detailed Description, wherein, in step S1 ', assessment equipment 1 obtains mining rule to be assessed;In step S2 ', assessment equipment 1 is obtained Information for assessing the mining rule issues sample;In step S3 ', assessment equipment 1 is based on the mining rule, to institute State information issue sample and carry out junk information excavation, obtain the junk information corresponding with described information issue sample;In step In S41 ', described information is issued default actual junk information in sample and is compared with the junk information by assessment equipment 1 Analysis, obtains the comparative analysis result corresponding with the junk information;In step S42 ', assessment equipment 1 is according to the ratio Compared with analysis result, at least one of described evaluating is obtained;Its detailed process in the embodiment described by foregoing reference picture 4 with commenting Estimate the process performed in step S1 ', step S2 ', step S3 ', step S41 ' and step S42 ' of equipment 1 identical, be concise For the sake of, it is incorporated herein by reference, without repeating.
Specifically, in step S5 ', the evaluating that assessment equipment 1 is obtained according to it in step S42 ', such as with excavation The corresponding accuracy rate of rule, optimizes the mining rule, for example when in evaluating accuracy rate be less than default accuracy rate threshold During value, excavated by adjusting mining rule for the information publisher published information high to credit rating without junk information, To improve accuracy rate.For example, it is assumed that in step S42 ', assessment equipment 1 calculates accurate in the evaluating obtained by formula Rate is 50%, in step S5 ', and the judging nicety rate 50% of assessment equipment 1 is less than default accuracy rate threshold value 60%, then assesses and set Standby 1 adjustment mining rule is that releasing news for the information publisher high to credit rating is excavated without junk information, to improve Accuracy rate in evaluating.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for example, other The mode of optimization mining rule that is existing or being likely to occur from now on is such as applicable to the present invention, should also be included in present invention protection Within scope, and it is incorporated herein by reference herein.
Preferably, described in step S5 ', assessment equipment 1 can also be according to the evaluating, with reference to the comparative analysis As a result, the mining rule is optimized.Specifically, in step S5 ', assessment equipment 1 is according to commenting that it is obtained in step S42 ' Valency parameter, recall rate such as corresponding with mining rule, and the comparative analysis knot obtained according to assessment equipment 1 in step S41 ' Really, the mining rule is optimized, such as when recall rate is less than default recall rate threshold value, the mode of optimization includes but do not limited In:Frequency threshold value is issued by reducing the information in the mining rule for excavating junk information shown in comparative analysis result, Or reduction rubbish vocabulary accumulated quantity threshold value etc., to improve recall rate.For example, it is assumed that in step S42 ', assessment equipment 1 is obtained Evaluating in recall rate be 40%, and less than default recall rate threshold value 50%, then in step S5 ', assessment equipment 1 The information issue frequency average for obtaining the information publisher of junk information in comparative analysis result in step S41 according to it is 4 times/ Minute, information is issued into frequency threshold value accordingly and is reduced to 4 beats/min from 5 beats/min, to improve recall rate.In another example, it is assumed that In step S42 ', recall rate is less than default recall rate threshold value in the evaluating that assessment equipment 1 is obtained, then in step S5 ' In, assessment equipment 1 obtains the rubbish vocabulary included in comparative analysis result in spam content according to it in step S41 ' 2/bar of average, is reduced to 2/bar, to carry by the rubbish vocabulary accumulated quantity threshold value of spam content from 3/bar accordingly High recall rate.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for example, other existing or the presents The mode for the optimization mining rule being likely to occur afterwards is such as applicable to the present invention, should also be included within the scope of the present invention, And be incorporated herein by reference herein.
It is highly preferred that in step S5 ', assessment equipment 1 can also be according to the evaluating, with reference to the evaluating Preset parameter priority information, optimizes the mining rule.Specifically, in step S5 ', assessment equipment 1 is always according to described Evaluating, such as recall rate and accuracy rate, and according to default parameter priority information, such as accuracy rate priority is higher than and recalled Rate, selects suitable method optimizing mining rule, to improve the evaluating.For example, it is assumed that in step S42 ', assessment equipment Accuracy rate is 50% in 1 evaluating obtained, and less than default accuracy rate threshold value 60%, recall rate is 40%, and is less than Default recall rate threshold value 50%, then in step S5 ', assessment equipment 1 is higher than recall rate according to default accuracy rate priority Parameter priority information, adjustment mining rule be to high-quality user issue releasing news without excavate, and then raising Accuracy rate.Those skilled in the art will be understood that the mode of above-mentioned optimization mining rule is only for example, and other are existing or from now on The mode for the optimization mining rule being likely to occur such as is applicable to the present invention, should also be included within the scope of the present invention, and It is incorporated herein by reference herein.
It is highly preferred that the process also includes step S6 ' (not shown), in step S6 ', assessment equipment 1 can be according to described Evaluating, updates the parameter priority information.Specifically, in step S6 ', assessment equipment 1 is according to it in step S42 ' The evaluating of middle acquisition, for example when recall rate be less than default recall rate threshold value, and accuracy rate be more than default accuracy rate threshold During value, undated parameter priority is that recall rate priority is more than accuracy rate.For example, in step S42 ', what assessment equipment 1 was obtained Recall rate is less than default recall rate threshold value in evaluating, and accuracy rate is more than default accuracy rate threshold value, in step S6 ' In, default accuracy rate priority is higher than the parameter priority information of recall rate by assessment equipment 1, is updated to recall rate priority Higher than accuracy rate.Those skilled in the art will be understood that the mode of above-mentioned excavation undated parameter precedence information is only for example, its He is such as applicable to the present invention at the mode of undated parameter precedence information that is existing or being likely to occur from now on, should also be included in this Within invention protection domain, and it is incorporated herein by reference herein.
Preferably, the process also includes step S7 ' (not shown), in step S7 ', and assessment equipment 1 can work as described evaluate When parameter reaches evaluating threshold value, terminate the optimization mining rule.Specifically, in step S3 ', assessment equipment 1 is based on Mining rule carries out junk information excavation to information issue sample, obtains the junk information corresponding with information issue sample; Then, in step S41 ', the information is issued default actual junk information in sample and entered with the junk information by assessment equipment 1 Row comparative analysis, obtains the comparative analysis result corresponding with the junk information;Then, in step S42 ', assessment equipment 1 According to the comparative analysis result, at least one evaluating is obtained;Assessment equipment 1 is constantly based in step S3 ' and step S4 ' Its mining rule after updating in the step S5 ', which is circulated, to be performed, and in step S7 ', assessment equipment 1 detects that the circulation is obtained every time Evaluating, and when evaluating reaches evaluating threshold value, terminate to optimize the principle of optimality.Wherein, evaluating Threshold value means preset expectancy evaluation parameter.For example, when assessment equipment 1 detects accuracy rate more than predetermined in step S7 ' When accuracy rate threshold value and recall rate are more than predetermined recall rate threshold value, assessment equipment 1 stops optimizing the mining rule.This area skill Art personnel will be understood that the mode of above-mentioned end optimization mining rule is only for example, other knots that are existing or being likely to occur from now on The mode of Shu Youhua mining rules is such as applicable to the present invention, should also be included within the scope of the present invention, and herein to draw It is incorporated herein with mode.
Preferably, the evaluating includes following at least any one:
- the recall rate corresponding with the mining rule;
- the accuracy rate corresponding with the mining rule.
Specifically, in step S42 ', the evaluating that assessment equipment 1 is obtained includes but is not limited to:With the digging according to rule Then corresponding recall rate and the accuracy rate corresponding with the mining rule.Recall rate refers to assessment equipment 1 and led in step S3 ' Cross junk information and excavate the ratio that the real junk information quantity obtained issues actual junk information quantity in sample with information; Accuracy rate refers to assessment equipment 1 and passed through in step S3 ' by the real junk information quantity of junk information excavation acquisition with it Junk information excavates the ratio of the junk information quantity obtained.Accuracy rate and recall rate are that possible mutually restrict two evaluate ginseng Number, when accuracy rate is high, may cause recall rate low, when recall rate is high, accuracy rate may be caused low, accordingly, it would be desirable in recall rate Balance is found between accuracy rate, the excavation of junk information is carried out in an optimal manner.Those skilled in the art will be understood that State evaluating to be only for example, other evaluatings that are existing or being likely to occur from now on are such as applicable to the present invention, should also wrap It is contained within the scope of the present invention, and is incorporated herein by reference herein.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as to the claim involved by limitation.This Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple Unit or device can also be realized by a unit or device by software or hardware.The first, the second grade word is used for table Show title, and be not offered as any specific order.

Claims (16)

1. a kind of computer implemented method for assessing junk information mining rule, wherein, this method comprises the following steps:
A obtains mining rule to be assessed, wherein, the mining rule issues the frequency and/or information publisher based on information Historical behavior is recorded and/or the attribute of information publisher carries out junk information excavation;
Wherein, methods described also includes:
The information that i is obtained for assessing the mining rule according to the mining rule issues sample;
Wherein, methods described also includes:
B is based on the mining rule, carries out junk information excavation to described information issue sample, obtains and issue sample with described information This corresponding junk information;
C issues sample according to the junk information with reference to described information, obtains at least one corresponding with the mining rule Evaluating, and according to the evaluating, with reference to the preset parameter priority information of the evaluating, optimize described dig Pick rule.
2. according to the method described in claim 1, wherein, the step i also includes:
- according to the mining rule, the information issue sample corresponding with the mining rule is obtained from information issue Sample Storehouse This.
3. method according to claim 1 or 2, wherein, the step c also includes:
- described information issued into default actual junk information and the junk information in sample be compared analysis, obtain and The corresponding comparative analysis result of the junk information;
- according to the comparative analysis result, obtain at least one of described evaluating.
4. method according to claim 3, wherein, this method also includes step X:
X optimizes the mining rule according to the evaluating.
5. method according to claim 4, wherein, the step X also includes:
- according to the evaluating, with reference to the comparative analysis result, optimize the mining rule.
6. according to the method described in claim 1, wherein, this method also includes:
- according to the evaluating, update the parameter priority information.
7. the method according to any one of claim 4 to 6, wherein, this method also includes:
- based on the mining rule after the optimization, the step b and c is repeated, until the evaluating reaches evaluation ginseng Number threshold value.
8. method according to claim 1 or 2, wherein, the evaluating includes following at least any one:
- the recall rate corresponding with the mining rule;
- the accuracy rate corresponding with the mining rule.
9. a kind of equipment for assessing junk information mining rule, wherein, the equipment includes:
Rule device, the mining rule to be assessed for obtaining, wherein, the mining rule is based on information and issues the frequency And/or the historical behavior record of information publisher and/or the attribute of information publisher carry out junk information excavation;
Sample acquiring device, the information for being obtained according to the mining rule for assessing the mining rule issues sample;
Junk information acquisition device, for based on the mining rule, junk information excavation to be carried out to described information issue sample, Obtain the junk information corresponding with described information issue sample;
Parameter obtaining device, for according to the junk information, sample to be issued with reference to described information, is obtained and the mining rule At least one corresponding evaluating, and according to the evaluating, the preset parameter with reference to the evaluating is preferential Level information, optimizes the mining rule.
10. equipment according to claim 9, wherein, the sample acquiring device is additionally operable to according to the mining rule, from The information issue sample corresponding with the mining rule is obtained in information issue Sample Storehouse.
11. the equipment according to claim 9 or 10, wherein, the parameter obtaining device also includes:
As a result acquiring unit, is carried out for described information to be issued into default actual junk information in sample with the junk information Comparative analysis, obtains the comparative analysis result corresponding with the junk information;
Parameter acquiring unit, for according to the comparative analysis result, obtaining at least one of described evaluating.
12. equipment according to claim 11, wherein, the equipment also includes:
Rule optimization device, for according to the evaluating, optimizing the mining rule.
13. equipment according to claim 12, wherein, the rule optimization device is additionally operable to according to the evaluating, With reference to the comparative analysis result, optimize the mining rule.
14. equipment according to claim 9, wherein, the equipment also includes:
Priority update device, for according to the evaluating, updating the parameter priority information.
15. the equipment according to any one of claim 12 to 14, wherein, the equipment also includes:
Optimal control device, for when the evaluating reaches evaluating threshold value, terminating the optimization mining rule.
16. the equipment according to claim 9 or 10, wherein, the evaluating includes following at least any one:
- the recall rate corresponding with the mining rule;
- the accuracy rate corresponding with the mining rule.
CN201110264221.6A 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule Active CN102982048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110264221.6A CN102982048B (en) 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110264221.6A CN102982048B (en) 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule

Publications (2)

Publication Number Publication Date
CN102982048A CN102982048A (en) 2013-03-20
CN102982048B true CN102982048B (en) 2017-08-01

Family

ID=47856084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110264221.6A Active CN102982048B (en) 2011-09-07 2011-09-07 A kind of method and apparatus for being used to assess junk information mining rule

Country Status (1)

Country Link
CN (1) CN102982048B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216872B (en) * 2013-05-31 2017-12-01 腾讯科技(深圳)有限公司 The method and device of rubbish chapters and sections in a kind of identification network novel
CN104009970A (en) * 2013-09-17 2014-08-27 宁波公众信息产业有限公司 Network information acquisition method
CN106376002B (en) * 2015-07-20 2021-10-12 中兴通讯股份有限公司 Management method and device and spam monitoring system
CN107705828A (en) * 2017-09-20 2018-02-16 广西金域医学检验所有限公司 Prejudge detection and processing method and processing device, terminal device, the storage medium of rule
CN108182234B (en) * 2017-12-27 2021-07-09 鼎富智能科技有限公司 Regular expression screening method and device
CN109726312B (en) * 2018-12-25 2021-10-08 广州虎牙信息科技有限公司 Regular expression detection method, device, equipment and storage medium
CN110427577B (en) * 2019-06-26 2022-04-19 五八有限公司 Content influence evaluation method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101166159A (en) * 2006-10-18 2008-04-23 阿里巴巴公司 A method and system for identifying rubbish information
CN101996203A (en) * 2009-08-13 2011-03-30 阿里巴巴集团控股有限公司 Web information filtering method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9471712B2 (en) * 2004-02-09 2016-10-18 Dell Software Inc. Approximate matching of strings for message filtering
CN101389085B (en) * 2008-10-14 2012-03-21 中国联合网络通信集团有限公司 Rubbish short message recognition system and method based on sending behavior

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101166159A (en) * 2006-10-18 2008-04-23 阿里巴巴公司 A method and system for identifying rubbish information
CN101996203A (en) * 2009-08-13 2011-03-30 阿里巴巴集团控股有限公司 Web information filtering method and system

Also Published As

Publication number Publication date
CN102982048A (en) 2013-03-20

Similar Documents

Publication Publication Date Title
CN102982048B (en) A kind of method and apparatus for being used to assess junk information mining rule
Oerlemans et al. Alliance portfolio diversity, radical and incremental innovation: The moderating role of technology management
US7698248B2 (en) Method and system for auditing processes and projects for process improvement
Anoaica et al. Quantitative description of internal activity on the ethereum public blockchain
CN106716454A (en) Utilizing machine learning to identify non-technical loss
CN106126388A (en) The method of monitor event, regulation engine device and rule engine system
CN111143673A (en) Method and system for multi-channel customer information processing and automatic marketing
CN107634850B (en) A kind of application state acquisition methods and its equipment, storage medium, server
CN108234171A (en) A kind of data processing method, system and device
CN105247506A (en) Service-level agreement analysis
CN107038620A (en) Based on user call a taxi preference information push and device
CN109582885A (en) It is a kind of that the method and device that block chain deposits card is carried out to webpage by webpage monitoring
CN107784504A (en) Client pays a return visit the generation method and terminal device of event
CN106390451B (en) Method and device for testing capacity of game server
Ba et al. Social and rewarding microscopical dynamics in blockchain-based online social networks
CN110392155A (en) It has been shown that, processing method, device and the equipment of notification message
CN115309913A (en) Deep learning-based financial data risk identification method and system
JP2008257539A (en) Communication analyzing device and method
CN103679400B (en) Subway map shows the method and system of project process
CN111259922A (en) Order data processing method and device based on customer order-returning early warning
CN103577541B (en) The ranking fraud detection method and ranking fraud detection system of application program
JP5088233B2 (en) Operation management apparatus, display method, and program
CN110928942A (en) Index data monitoring and management method and device
CN116629599A (en) Cloud management evaluation method and device, electronic equipment and storage medium
CN106790339A (en) Metadata server, network device, and automatic resource management method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant