CN105786792A - Information processing method and device - Google Patents

Information processing method and device Download PDF

Info

Publication number
CN105786792A
CN105786792A CN201410832128.4A CN201410832128A CN105786792A CN 105786792 A CN105786792 A CN 105786792A CN 201410832128 A CN201410832128 A CN 201410832128A CN 105786792 A CN105786792 A CN 105786792A
Authority
CN
China
Prior art keywords
fingerprint
note
fingerprint base
refuse messages
base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410832128.4A
Other languages
Chinese (zh)
Inventor
邓超
张峰
粟栗
冉鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201410832128.4A priority Critical patent/CN105786792A/en
Publication of CN105786792A publication Critical patent/CN105786792A/en
Pending legal-status Critical Current

Links

Landscapes

  • Collating Specific Patterns (AREA)

Abstract

The invention discloses an information processing method. The method comprises the following steps: generating a fingerprint of a to-be-calibrated short message according to text content of the to-be-calibrated short message; simultaneously comparing the fingerprint of the to-be-calibrated short message with fingerprints in a junk short message black fingerprint library and a normal short message white fingerprint library; calibrating the to-be-calibrated short message as the junk short message or the normal short message according to the comparison result of the fingerprint of the to-be-calibrated short message and the fingerprint in the junk short message black fingerprint library and the comparison result of the fingerprint of the to-be-calibrated short message and the fingerprint in the normal short message white fingerprint library. The invention further discloses an information processing device.

Description

A kind of information processing method and device
Technical field
The present invention relates to the field of data service in radio communication, particularly relate to a kind of information processing method and device.
Background technology
Along with the fast development of telecommunication user scale and the Internet instant messaging and social activity class application, all kinds of information produced with short text form, rapidly accumulation and propagation.Wherein the various junk information relating to the types such as illegal, swindle, pornographic, advertisement, harassing and wrecking or flame become user and the problem of operator's headache.
At present, the identification of refuse messages and filtering technique, impact by spam filtering and filtering technique is relatively big, specifically includes that black and white lists method, user behavior rule and method, short message text key word rule and method, method based on the modeling of short message text content mining.But, there is machine and automatically identify the problems such as refuse messages False Rate is high, misdetection rate is high, specific filtration resistance is not high in these methods, thus causing these methods all only as subplan, it is impossible to replace manual examination and verification completely.It is to say, these methods are only as the discovery phase of doubtful refuse messages, then report manual examination and verification, add cost of labor.
Summary of the invention
For solving the technical problem of existing existence, the embodiment of the present invention provides a kind of information processing method and device.
Embodiments provide a kind of information processing method, including:
Content of text according to note to be calibrated, generates the fingerprint of described note to be calibrated;
The fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously;
According to in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note.
In such scheme, before the fingerprint of the described note to be calibrated of described generation, described method also includes:
Described content of text is carried out pretreatment and Denoising disposal;
Correspondingly, according to the content of text after pretreatment and Denoising disposal, the fingerprint of described note to be calibrated is generated.
In such scheme, described content of text is carried out pretreatment and Denoising disposal, including:
Short message text content is carried out English character rejecting and telephone number and the de-noise operation of numeral rejecting.
In such scheme, the fingerprint of the described note to be calibrated of described generation, for:
By the content of text of described note to be calibrated, generate SimHash coding, using the fingerprint as described note to be calibrated.
In such scheme, before the described fingerprint by described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, described method also includes:
According to the artificial calibration result to all doubtful notes, set up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
In such scheme, after setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note, and before the fingerprint of described note to be calibrated being compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note, described method also includes simultaneously:
The white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;
The black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected.
In such scheme, before being compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note by the fingerprint of described note to be calibrated, described method also includes simultaneously:
When determining string length that the content of text after pretreatment and Denoising disposal is corresponding more than the string length thresholding arranged, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
In such scheme, the described fingerprint by described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, including:
Using character string corresponding for the content of text after pretreatment and Denoising disposal as index, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
In such scheme, the described fingerprint by described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, for:
The fingerprint of described demarcation note is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, obtains the fingerprint similarity of correspondence;
The fingerprint similarity obtained is compared with corresponding fingerprint similarity measure Hamming distances, so that it is determined that comparing result.
In such scheme, described basis with in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note, including:
When being comparison success with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison failure, demarcating described note to be calibrated is refuse messages;Or,
When being comparison failure with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison success, demarcating described note to be calibrated is normal note;Or,
When with the comparison result of fingerprint in the black fingerprint base of described refuse messages be comparison failure, and with when the comparison result of fingerprint is comparison failure in the described white fingerprint base of normal note, demarcate described note to be calibrated for treating manually to demarcate note.
The embodiment of the present invention additionally provides a kind of information processor, including: fingerprint generates unit, comparing unit and demarcates unit;Wherein,
Described fingerprint generates unit, for the content of text according to note to be calibrated, generates the fingerprint of described note to be calibrated;
Described comparing unit, for comparing the fingerprint of described note to be calibrated with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously;
Described demarcation unit, for provide according to described comparing unit with in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note.
In such scheme, described device also includes: pretreatment unit, for described content of text carries out pretreatment and Denoising disposal;
Correspondingly, described fingerprint generates unit, for according to the content of text after pretreatment and Denoising disposal, generating the fingerprint of described note to be calibrated.
In such scheme, described device also includes: fingerprint base sets up unit, for according to the artificial calibration result to all doubtful notes, setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
In such scheme, described fingerprint base sets up unit, after being additionally operable to set up the black fingerprint base of described refuse messages and the white fingerprint base of normal note, the white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;And the black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected.
In such scheme, described comparing unit, when being additionally operable to string length corresponding to the content of text after determining pretreatment and Denoising disposal more than the string length thresholding arranged, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
The information processing method of embodiment of the present invention offer and device, the content of text according to note to be calibrated, generate the fingerprint of described note to be calibrated;The fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, according to in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note, so, the accuracy that note is demarcated can be effectively improved, thus being effectively reduced False Rate and misdetection rate.
Accompanying drawing explanation
In accompanying drawing (it is not necessarily drawn to scale), similar accompanying drawing labelling can at parts similar described in different views.The similar reference numerals with different letter suffix can represent the different examples of similar component.Accompanying drawing generally shows each embodiment discussed herein by way of example and not limitation.
Fig. 1 is the embodiment of the present invention one information processing method schematic flow sheet;
Fig. 2 is overall framework and the body of work schematic flow sheet of the double; two fingerprint base collaborative filtering of the embodiment of the present invention two black and white;
Fig. 3 is that the embodiment of the present invention two builds the black fingerprint base of refuse messages and the schematic flow sheet of the white fingerprint base of normal note;
Fig. 4 is the method flow schematic diagram that fingerprint base is carried out collision detection by the embodiment of the present invention two;
Fig. 5 is the method flow schematic diagram that new message is demarcated and filtered by the embodiment of the present invention two;
Fig. 6 is the embodiment of the present invention three information processor structural representation.
Detailed description of the invention
Below in conjunction with drawings and Examples, the present invention is described in further detail again.
Before describing the embodiment of the present invention, first learn about the scheme of existing identification and spam filtering in detail.
At present, the identification of refuse messages and filtering technique, impact by spam filtering and filtering technique is relatively big, specifically includes that black and white lists method, user behavior rule and method, short message text key word rule and method, method based on the modeling of short message text content mining.Every kind method is described in detail below.
1, black and white lists method
Being implemented as of the program: the sender number according to sending note sets up blacklist and white list, if the sender of a note is in blacklist, then no matter short message content whether rubbish, all press refuse messages process, intercept;If sender is in white list, then no matter whether short message content is normal note, all presses normal note and processes, lets pass;To neither at blacklist also not the sender of white list, then by other strategy or rule determine whether refuse messages.Therefore, black and white lists method is a kind of mandatory interception, and intercepts object and the extremely limited householder method of range scale
2, user behavior rule and method
Being implemented as of the program: certain thresholding or constraint rule are set up in the transmission behavior for user, judge whether user sends refuse messages person in violation of rules and regulations, fall within the indifference hold-up interception method directly not considering short message content, such as: flow rule: according to same user in short period window, continuously transmit the quantity of note as flow, if flow is more than the transmission frequency of normal general user, then judge that this transmission number is as refuse messages sender.For this method, a lot of refuse messages senders use multiple transmission number to send same refuse messages content, evade the interception of flow rule.
By way of further example, response rate rule: send in numerous note according to what jack per line sent, it is stipulated that in time window, the quantity of answer short message is few, it is determined that what this transmission number user sent is refuse messages.
3, short message text key word rule and method
Being implemented as of the program: preset the most probable key word of some refuse messages according to the feature of refuse messages content, if a short message content and some important Keywords matching, then assert that this note is refuse messages.Such as " buying now " " passion " etc..But it practice, be deliberately added into interference character or mutation due to refuse messages sender so that key word rule lost efficacy;Simultaneously along with the ageing change of refuse messages content, antistop list needs to update;More contrary normal notes due to some Keywords matching, cause being intercepted, therefore based on the key word strategy of text, be only capable of as supplementary means, form doubtful refuse messages, report manual examination and verification.
4, based on short message text content mining modeling method
The method adopts complicated machine learning algorithm that refuse messages content of text carries out excavation and builds and touch, typically have: according to refuse messages and normal note calibration result, train classification models, new message is carried out classification prediction, owing to normal note uniformly disperses distribution characteristics at problem space, make model to refuse messages and normal note can resolving ability not enough, simultaneously because normal note differs huge with refuse messages quantitative proportion, cause that refuse messages is likely to " being flooded " in modeling process.Therefore build the method touched based on short message text content mining, False Rate and misdetection rate to refuse messages are all higher, and full automation is demarcated to be difficult to be used in real system, it is achieved effectively replace manual examination and verification workload, and only as aid system.
Owing to refuse messages length is less than the short text feature of 140 bytes or 70 Chinese text so that identify the extremely difficult of the close short message text change of content from the short text of magnanimity note.First extract, from no more than 70 Chinese text, the validity feature that can be used for distinguishing note and become very difficult;Secondly, every day, note quantity was very big, reached several 1,000,000,000, and normal note and refuse messages proportional difference great disparity;It addition, short message is often close to pet phrase, differ greatly with written word;In addition, and refuse messages sender is in order to evade interception, can produce various mutation and deformation, even adopts abbreviation and the popular vocabulary of network.
Therefore, the identification of the short breath of conventional garbage described above and filtering technique cannot realize automatic fitration as Major Systems, replace manual examination and verification, and the subject matter of its existence includes:
1, machine identifies that refuse messages False Rate is high, misdetection rate is high, specific filtration resistance is not high automatically, causes that the rate of complaints is big;
2, in customer complaint report result, manually demarcation in result, there is inevitable consequences of hostilities, prior art cannot adapt to or error correction automatically;
3, prior art is only as aid system, produces doubtful note, then reports manual examination and verification, causes that manual examination and verification workload is big, and cost is high.
It addition, refuse messages automatic fitration technology needs the Major Difficulties solved to include:
First, short text cannot effectively model;
Specifically, the restriction of 70 words of note, make the modeling method DeGrain of feature based, cause that misdetection rate and False Rate index are high.
Secondly, the model of cognition built out can not effectively identify normal note;
Specifically, normal note and refuse messages proportional difference are greatly different, and the normal note problem space of magnanimity is uniformly dispersed, and causes the model of cognition built out, it is impossible to the normal note of magnanimity that effectively identification is uniformly dispersed;Thus causing that False Rate is higher, customer complaint increases.Filtering even across prior art, in the doubtful note needing manual examination and verification then reported, the ratio of normal note and refuse messages still reaches 10:1.
3rd, repeat or highly close note recognition effect is poor;
Here, due to interference character substantial amounts of in refuse messages and normal note and deformation note, make a large amount of repetition or highly close note cannot be hit by rules such as key words.
4th, calibration result ambiguousness cannot avoid (conflicting);
Here, artificial demarcation is reported with customer complaint, to same short message content or highly close short message content, it cannot be guaranteed that different people, same people are not in the same time, capital provides on all four calibration result, and much short message content itself has ambiguousness and ambiguity, it is necessary to demarcation of manually voting.Prior art is many does not consider and solves this calibration mass problem directly affecting modeling and follow-up recognition result.If the conflict that can automatically find in calibration result, and according to many people calibration result automatic error-correcting, then can the artificial calibration mass of significant increase, and reduce manual examination and verification workload.
5th, poor in timeliness;
Here, owing to refuse messages content and generation cycle do not have regularity, irregular ageing feature is obvious.Content-based excavate modeling, do not possess recognizable automatically updating property of feature, generally require etc. after the rate of complaints rises, just again collect refuse messages content, train Renewal model.Based on key word rule, old rule does not have automatically updating property equally.
6th, it is impossible to adapt to, under different False Rate index request, reach low misdetection rate and high filtration ratio requires;
Owing to normal short beacon is decided to be refuse messages by erroneous judgement, customer complaint can be caused, so generally under ensureing low erroneous judgement premise (such as False Rate is lower than 1%), user on-demand can require that misdetection rate reaches how many, specific filtration resistance and reaches how many, therefore system is it would be desirable to provide pass through parameter adjustment, adapt to the requirement of these indexs is combined, it is simple to apply in systems in practice.
The technology adopted at present, black and white lists method, user behavior rule method, key word rule method are all with refuse messages for focus, thus formulating relevant strategy, coupling refuse messages, and do not consider the angle from normal note, it is determined that and normal note, to reach reversely negative refuse messages purpose;And based on the method that short message text content carries out excavating modeling, although it is contemplated that excavate typical characteristic for normal note and refuse messages simultaneously, set up model of cognition, but due to normal note and refuse messages proportional difference great disparity, problem space that simultaneously normal note relates to is substantially uniformly to be dispersed, and is difficult to find accurately, have the model of strong discrimination.
Therefore, the embodiment of the present invention have employed the thinking simultaneously considering normal note and refuse messages full dose sample, no longer there is the other mining model in characteristic area for normal note and refuse messages uniformly dispersing space and short text are set up, but retain the existing normal note of full dose and refuse messages sample as far as possible, each note is characterized, it is achieved by very repetition or the storage of very close sample compression with fingerprint.While refuse messages comparison being mated with fingerprint, with the fingerprint of normal note, new message is carried out reversely, i.e. the demarcation of normal note.So, it is to avoid the Model Identification rate only brought around refuse messages modeling coupling is low, the False Rate caused and the high difficult problem of misdetection rate.
Based on this, in various embodiments of the present invention: the content of text according to note to be calibrated, the fingerprint of described note to be calibrated is generated;The fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously;According to in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note.
Embodiment one
The information processing method that the present embodiment provides, as it is shown in figure 1, comprise the following steps:
Step 101: the content of text according to note to be calibrated, generates the fingerprint of described note to be calibrated;
Here, before the fingerprint of the described note to be calibrated of described generation, the method can also include:
Described content of text is carried out pretreatment and Denoising disposal;
Correspondingly, according to the content of text after pretreatment and Denoising disposal, the fingerprint of described note to be calibrated is generated.
Wherein, described content of text is carried out pretreatment and Denoising disposal, specifically may include that
Pretreatment and Denoising disposal to short message text content, including: participle, rejecting stop words, rejecting special string and special symbol, the conversion of complicated and simple word, Digital size write conventional pretreatment and the de-noise operation such as conversion.
It is Chinese for the language of note Chinese version content, except above-mentioned conventional pretreatment and de-noise operation, described content of text is carried out pretreatment and Denoising disposal, specifically can also include:
Short message text content is carried out English character rejecting and telephone number and the de-noise operation of numeral rejecting.
The fingerprint of the described note to be calibrated of described generation, particularly as follows:
By the content of text of described note to be calibrated, generate SimHash coding, using the fingerprint as described note to be calibrated.
Here, when described content of text being carried out pretreatment and Denoising disposal, it is the content of text after pretreatment and Denoising disposal for generating the content of text of SimHash coding.
Step 102: the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously;
Here, before performing this step, the method can also include:
According to the artificial calibration result to all doubtful notes, set up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
In other words, by manual examination and verification person, all doubtful note reported is carried out the demarcation of refuse messages or normal note, then using the doubtful note that is demarcated as normal note as the input source creating the white fingerprint base of normal note, using the doubtful note that is demarcated as refuse messages as the input source creating the black fingerprint base of refuse messages, thus setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
In establishment process, the content of text of every note as input source is carried out pretreatment and Denoising disposal, the content of text after utilization process, generate SimHash coding, using the fingerprint as this input source;And the fingerprint of this input source is compared with the fingerprint in fingerprint base, and fingerprint similarity measure Hamming distances corresponding with setting for the fingerprint similarity obtained is compared, during more than described fingerprint similarity measure Hamming distances, explanation comparison failure, joins fingerprint corresponding for this input source in the fingerprint base of correspondence.
Wherein, when fingerprint base is absent from any fingerprint, it is believed that comparison failure, then directly the fingerprint of this input source is joined in the fingerprint base of correspondence.
During practical application, the fingerprint in fingerprint base, with corresponding character string for index bit, indexes, in order to follow-up quick comparison.
After setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note, and before the fingerprint of described note to be calibrated being compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note, the method can also include simultaneously:
The white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;
The black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected.
Before being compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note by the fingerprint of described note to be calibrated, the method can also include simultaneously:
When determining string length that the content of text after pretreatment and Denoising disposal is corresponding more than the string length thresholding arranged, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
Wherein it is determined that when string length corresponding to content of text after pretreatment and Denoising disposal is less than or equal to the string length thresholding arranged, illustrates that described note to be calibrated is normal note, be directly demarcated as normal note.
The described fingerprint by described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, specifically includes:
Using character string corresponding for the content of text after pretreatment and Denoising disposal as index, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
Wherein, the fingerprint of described demarcation note is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, obtain the fingerprint similarity of correspondence;
The fingerprint similarity obtained is compared with corresponding fingerprint similarity measure Hamming distances, so that it is determined that comparing result.
Wherein, when the fingerprint similarity obtained is less than or equal to the fingerprint similarity measure Hamming distances of described correspondence, comparison success is described;When the fingerprint similarity obtained is more than the fingerprint similarity measure Hamming distances of described correspondence, comparison failure is described.
Step 103: according to in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note.
Specifically, when being comparison success with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison failure, demarcating described note to be calibrated is refuse messages;Or,
When being comparison failure with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison success, demarcating described note to be calibrated is normal note;Or,
When with the comparison result of fingerprint in the black fingerprint base of described refuse messages be comparison failure, and with when the comparison result of fingerprint is comparison failure in the described white fingerprint base of normal note, demarcate described note to be calibrated for treating manually to demarcate note.
The information processing method that the present embodiment provides, the content of text according to note to be calibrated, generate the fingerprint of described note to be calibrated;The fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, according to in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note, so, the accuracy that note is demarcated can be effectively improved, thus being effectively reduced False Rate and misdetection rate.Like this, it is possible to replace manual examination and verification completely, be greatly saved cost of labor.
Short message text content is carried out English character rejecting and telephone number and the de-noise operation of numeral rejecting, so, can effectively further improve the accuracy that note is demarcated.
According to the artificial calibration result to doubtful note, set up the black fingerprint base of described refuse messages and the white fingerprint base of normal note;The white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;The black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected, so, can guarantee that the correctness of fingerprint in fingerprint base, thus effectively further improving the accuracy that note is demarcated.
Using character string corresponding for the content of text after pretreatment and Denoising disposal as index, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, so, can effectively increase the speed of comparison.
Embodiment two
The present embodiment, on the basis of embodiment one, is described in detail the calibration process of normal note and refuse messages.
Fig. 2 is the present embodiment based on the overall framework of the double; two fingerprint base collaborative filtering of black and white and body of work schematic flow sheet.As in figure 2 it is shown, overall framework is broadly divided into following two parts:
Part I, sets up double; two fingerprint bases of refuse messages and normal note, and the conflict in fingerprint base is carried out automatic Verification identification according to user's calibration result, the step 201 in main process~204 realize;
Part II, for new message, by the black and white collaborative comparison of double; two fingerprint base and automatic Calibration, is realized by the step 205 in main process~208.
The body of work flow process that the present embodiment provides, as in figure 2 it is shown, mainly comprise the steps that
Step 201: manually demarcate;
Here, refuse messages and normal note are demarcated by manual examination and verification person.
Step 202: set up the double; two fingerprint base of black and white;
Specifically, each the note that manual examination and verification person is demarcated, traversal completes: by labeled refuse messages, is inserted in the black fingerprint base of refuse messages;By labeled normal note, it is inserted in the white fingerprint base of normal note.
Step 203: fingerprint base collision detection;
Specifically, with the artificial refuse messages demarcated, the white fingerprint base of normal note having built up being compared, if there is the successful situation of comparison, then showing that the white fingerprint base of normal note exists conflict.Comparing with the artificial normal note the demarcated black fingerprint base of refuse messages to having built up, if there is the successful situation of comparison, then showing that the black fingerprint base of refuse messages exists conflict.
Step 204: by the consequences of hostilities demarcated artificial in black and white fingerprint base, carry out error correction, to ensure the correctness of the fingerprint in fingerprint base;
Step 205: new message is compared with the fingerprint in the black fingerprint base of refuse messages, the white fingerprint base of normal note simultaneously;
Step 206: when with fingerprint comparison success in white fingerprint base, new message is labeled as normal note;When with fingerprint comparison success in black fingerprint base, new message is labeled as refuse messages;
Step 207: if new message does not all have comparison success with the fingerprint in black fingerprint base and white fingerprint base, then new message is demarcated as note doubtful to be submitted to a higher level for approval or revision;
Step 208: the work auditor that made a gift to someone by the new message being demarcated as note doubtful to be submitted to a higher level for approval or revision demarcates.
In this example, select to be suitable for the simhash technology of the extensive content of text duplicate removal of magnanimity, the foundation with comparison similar fingerprints is set up as fingerprint base, namely enter the content of text of every refuse messages and normal note in fingerprint base and be all mapped as the simhash coding of 64, then pass through the close degree of simhash coding corresponding to one short message text of comparison corresponding simhash coding note existing with fingerprint base, determine whether this note is the repetition of existing fingerprint correspondence note or identical note, namely whether hit by the fingerprint in fingerprint base.Wherein the figure place of simhash coding can adopt 64, the fingerprint similarity measure Hamming distances K of similarity between the simhash coding corresponding for judging two notes, can according to the actual demand of False Rate and specific filtration resistance, trigger parameter policing rule carries out k=0,1,2,3 automatically selecting and arranging.
Below, first introduce Part I, namely set up double; two fingerprint bases of refuse messages and normal note according to user's calibration result, and the conflict in fingerprint base is carried out automatic Verification identification.
The present embodiment builds the flow process of the black fingerprint base of refuse messages and the white fingerprint base of normal note, as it is shown on figure 3, comprise the following steps:
Step 301: with the doubtful refuse messages adopting key word or other behavior flow rules to produce, and the doubtful refuse messages that the conventional method such as the refuse messages of user's actively report reports is the input source building the black fingerprint base of refuse messages and the white fingerprint base of normal note, by manually these being audited as the doubtful refuse messages of input source, short message content is demarcated;
Here, calibration result is divided into refuse messages set and normal note set, according to practical experience, gives and has nearly 3,000,000 every day in the doubtful refuse messages of manual examination and verification, wherein being demarcated as only accounting for less than 9%~10% of junk information, what be demarcated as normal information accounts for more than 90%.
The refuse messages cross each handmarking and normal note, perform following steps successively, simultaneously completes the black fingerprint base of refuse messages and the establishment of the double; two fingerprint base of the white fingerprint base of normal note.
Step 302: the short message text content of currently pending note is carried out pretreatment and Denoising disposal;
Here, pretreatment and the Denoising disposal to short message text content, including: participle, rejecting stop words, rejecting special string and special symbol, the conversion of complicated and simple word, Digital size write conventional pretreatment and the de-noise operation such as conversion.
Wherein, for the particularity of Chinese short message, in the present embodiment, pretreatment and Denoising disposal to short message text content also include: the de-noise operation that English character rejecting and telephone number and numeral are rejected.
Step 303: the body matter of currently pending note is mapped as 64 SimHash coding, as the fingerprint of this note;
Step 304: the fingerprint of currently pending note is compared with existing whole fingerprints in the black fingerprint base of refuse messages or the normal white fingerprint base of note;
Specifically, the 64 of currently pending note SimHash are encoded, compare with existing whole fingerprints in the black fingerprint base of refuse messages or the normal white fingerprint base of note.
Here, adopt fingerprint similarity measure Hamming distances K value during fingerprint comparison, the fingerprint similarity of comparison and K value are compared, when the fingerprint similarity of comparison is less than or equal to K value, hit is described, i.e. comparison failure, when the fingerprint similarity of comparison is more than K value, illustrate miss, i.e. comparison success;
Wherein it is possible to the mapping relations allocation list preset between K value and False Rate, misdetection rate, specific filtration resistance, then, by user, index request is read out, namely selects the K value of correspondence.
During practical application, in order to accelerate comparison speed, it is possible to adopt the comparison based on index, namely with character string for index, carry out the comparison of fingerprint.
Create at the beginning, when fingerprint base is absent from any fingerprint, it is believed that comparison failure, then directly the fingerprint of this input source is joined in the fingerprint base of correspondence.
Step 305: according to fingerprint comparison result, it is determined that whether the fingerprint of this note adds in corresponding fingerprint base as new fingerprint.
Here, if comparison success, then showing that in fingerprint base, existing certain note repeats or highly similar to this short message content, this note correspondence fingerprint has added fingerprint base, it is not necessary to again add;
If comparison failure, then showing short message content still not similar to this note in fingerprint base, this note correspondence fingerprint adds fingerprint base as new fingerprint.
It should be understood that when new fingerprint adds fingerprint base, it is necessary to be index bit according to character string, index, it is simple to support the quick comparison of fingerprint base.
According to traversal mode, take off a note having calibration result, perform step 302~305, carry out fingerprint comparison and warehouse-in.
After fingerprint base is set up, consider the understanding of actual refuse messages content is also existed inevitable ambiguousness and ambiguity, in manual examination and verification calibration process, between multiple auditors and same auditor's difference nominal time, to same note or similar note, it is likely to provide antipodal calibration result, therefore to ensure the accuracy of follow-up automatization's calibration process based on comparison, simultaneously in order to provide the detection of effective calibration mass and feedback to the manual examination and verification stage, embodiment adds and double; two fingerprint base is demarcated outcome conflict detection and error correction flow process.Basic ideas are: carry out intersection comparison respectively with corresponding refuse messages collection used when creating fingerprint base and normal note set pair black and white fingerprint base.That is, detect with the artificial normal black fingerprint base of note set pair demarcated;Detect with the artificial white fingerprint base of refuse messages set pair demarcated.Idiographic flow as shown in Figure 4, comprises the following steps:
Step 401: the content of text of checking note is carried out pretreatment and Denoising disposal;
Here, concrete processing procedure is identical with handling process when creating fingerprint base, repeats no more here.
Step 402: the body matter to checking note, generates 64 simhash codings, to form the fingerprint of this note;
Step 403: the fingerprint of checking note is carried out cross validation with the fingerprint in black and white fingerprint base;
Here, when verifying that note is normal note, by the fingerprint of checking note and all fingerprint comparisons in black fingerprint base, and the fingerprint similarity of comparison and user configured K value are compared, thus judging whether comparison success, if comparison success, illustrate that this normal note is hit by rubbish fingerprint base, rubbish fingerprint base exists certain fingerprint corresponding to refuse messages, repeat with the fingerprint of this checking note or phase recency is high, this illustrates artificial demarcation same or very close note, one gives normal labeled, one gives rubbish labelling, create conflict.
When verifying that note is refuse messages, by the fingerprint of checking note and all fingerprint comparisons in white fingerprint base, and the fingerprint similarity of comparison and user configured K value are compared, thus judging whether comparison success, if comparison success stricture of vagina, illustrate that this checking note is hit by normal note fingerprint base, normal fingerprints storehouse exists certain fingerprint corresponding to normal note, repeat with the fingerprint of this checking note or phase recency is high, this illustrates artificial demarcation same or very close note, one gives rubbish labelling, and one gives normal labeled, creates conflict.
Step 404: each checking note, respectively output conflicting information record to hit, for conflict error correction;
Here, if a checking note is normal note, its fingerprint hits fingerprint in black fingerprint base, then export the conflicting information tlv triple of<normal short message text: hit fingerprint in black fingerprint base: be hit the raw refuse short message text that black fingerprint is corresponding>.
If one checking note is refuse messages, its fingerprint hits fingerprint in white fingerprint base, then export the conflicting information tlv triple of<refuse messages text: hit fingerprint in white fingerprint base: be hit the original normal short message text that white fingerprint is corresponding>.
Step 405: each the conflicting information tlv triple to detection output, by automatic error-correcting module, is demarcated and repeatedly in calibration result by many people according to this short message text, and mark is normal and the ratio of mark rubbish, by majority rule, and error correction of automatically voting.
Here, to according to the conflict content of Voting principle automatic error-correcting calibration result, manual examination and verification error correction being fed back to.
After completing the error correction of fingerprint base, it is possible to adopt black and white fingerprint base to carry out collaborative automatic Calibration and filtration.Concrete processing procedure is as it is shown in figure 5, comprise the following steps:
Step 501: the content of text of new message is carried out pretreatment and Denoising disposal;
Here, pretreatment and the Denoising disposal to short message text content, including: participle, rejecting stop words, rejecting special string and special symbol, the conversion of complicated and simple word, Digital size write conventional pretreatment and the de-noise operation such as conversion.
Wherein, for the particularity of Chinese short message, in the present embodiment, pretreatment and Denoising disposal to short message text content also include: the de-noise operation that English character rejecting and telephone number and numeral are rejected.
Step 502: to the content of text after pretreatment and denoising, according to significant character string length strategy, decides whether to be necessary to enter fingerprint comparison operation, if it has, then perform step 504, otherwise, performs step 503;
Here, can according to the analysis result of refuse messages and normal short message content length, setup string length threshold, and if less than the string length thresholding arranged, then directly demarcating described new message is normal note;Such as, string length thresholding can be arranged 5, the word string length note less than or equal to 5, directly demarcate the normal note in position.
Step 503: demarcating described new message is normal note;
Step 504: by the content of text after pretreatment and Denoising disposal, generates 64 simhash codings, as the fingerprint of described new message;
Step 505: by the fingerprint of described new message, carries out fingerprint comparison with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note respectively simultaneously;
Here, K value used during two fingerprint base comparisons, according to index needs, different values can be read as respectively.
Step 506: according to collaborative comparison result, complete the automatic Calibration to described new message.
Specifically, if the fingerprint of described new message and certain fingerprint comparison success in the white fingerprint base of normal note, then described new message is demarcated as normal note;If the fingerprint of described new message and certain fingerprint comparison success in the black fingerprint base of refuse messages, then described new message is demarcated as refuse messages;If as all failed with the black fingerprint base comparison of the white fingerprint base of normal note and refuse messages, then described new message is demarcated as treating the new message of artificial demarcation.
The present embodiment is based on short message content, to the short message text content after pretreatment and Denoising disposal, do not carry out feature extraction and modeling, but adopt the technology of text fingerprints, it is defined as rubbish and normal note sets up fingerprint through manual examination and verification respectively to each, the refuse messages calibration result by whether, it is included into the black fingerprint base of refuse messages and the white fingerprint base of normal fingerprints, then to each reported doubtful note, compare with the fingerprint in black and white fingerprint base respectively, comparison is successful, directly it is demarcated as rubbish or normal note, it is not necessary to report manual examination and verification.
In addition, during to manual examination and verification inevitably, identical and close note is demarcated as simultaneously the note of rubbish and non-junk, the original fingerprint stock of this motion can be made in conflict, in order to ensure the correct demarcation to new message, and also to promote the calibration mass in manual examination and verification stage, the present embodiment adopts calibration result set pair black and white fingerprint base to carry out cross detection checking, automatically finds to demarcate conflict, and attempts automatic error-correcting.
Can be seen that in from the description above, the scheme that the present embodiment is same, based on the collaborative automatic Calibration of the double; two fingerprint base of black and white of full dose short message content, efficiently solve short text, problem that deformation, the False Rate highly repeating to cause, misdetection rate index cannot meet actually used demand.
The scheme that the present embodiment provides, for new message, the fingerprint of content-based formation, the note scaling method of the collaborative comparison of the double; two fingerprint base of black and white, make full use of the normal short message text of extensive full dose, demarcation to the normal note of vast scale, effectively prevent based on excavating the difficult problem that the modeling algorithm normal note of dispersing in space to full dose cannot effectively model;Simultaneously as the effective reverse collaborative demarcation of tradition refuse messages black fingerprint filtering scheme, improve refuse messages calibration accuracy.Reduce False Rate simultaneously and reduce the target of misdetection rate.
Verifying through existing network, adopting the scheme of the present embodiment, the doubtful note reported is identified filtration automatically, is only less than 0.5% to the False Rate of normal note, specific filtration resistance can reach 50%, effectively saves manual examination and verification workload and cost.
Further, since adopt full dose note mapping mode one to one to build fingerprint, so to the series of problems brought by the ratio great disparity of normal note and refuse messages, there is natural immunity, and be more suitable for this application scenarios.
Automatically detect the consequences of hostilities in artificial calibration result, and assist automatic error-correcting and customer complaint short message content automatic feedback.
The scheme that the present embodiment provides, meet rubbish short content and the ageing feature of theme change, as long as the new refuse messages content demarcated or complain, adding fingerprint base as new fingerprint, automatically updating of fingerprint used by achieving that in new message comparison process, without re-training model.
It addition, in the present embodiment, can pass through to adjust the distance parameter K used by comparison, index demand when realizing supporting user various by reality operations such as False Rate, misdetection rate, specific filtration resistances.
In addition, system fingerprint comparison process have employed Indexing Mechanism, it is ensured that even if under full dose note situation, quickly providing calibration result.Existing network tests most high standard constant speed degree up to 1.5 ten thousand note/seconds.
Embodiment three
For the method realizing embodiment one, two, the present embodiment provides a kind of information processor, and as shown in Figure 6, this device includes: fingerprint generates unit 61, comparing unit 62 and demarcates unit 63;Wherein,
Described fingerprint generates unit 61, for the content of text according to note to be calibrated, generates the fingerprint of described note to be calibrated;
Described comparing unit 62, for comparing the fingerprint of described note to be calibrated with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously;
Described demarcation unit 63, for provide according to described comparing unit 62 with in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note.
Wherein, this device can also include: pretreatment unit, for described content of text carries out pretreatment and Denoising disposal;
Correspondingly, described fingerprint generates unit 61, for according to the content of text after pretreatment and Denoising disposal, generating the fingerprint of described note to be calibrated.
Wherein, described content of text is carried out pretreatment and Denoising disposal, specifically may include that
Pretreatment and Denoising disposal to short message text content, including: participle, rejecting stop words, rejecting special string and special symbol, the conversion of complicated and simple word, Digital size write conventional pretreatment and the de-noise operation such as conversion.
It is Chinese for the language of note Chinese version content, except above-mentioned conventional pretreatment and de-noise operation, described content of text is carried out pretreatment and Denoising disposal, specifically can also include:
Short message text content is carried out English character rejecting and telephone number and the de-noise operation of numeral rejecting.
Described fingerprint generates unit 61, specifically for: by the content of text of described note to be calibrated, generate SimHash coding, using the fingerprint as described note to be calibrated.
Here, when described content of text being carried out pretreatment and Denoising disposal, it is the content of text after pretreatment and Denoising disposal for generating the content of text of SimHash coding.
This device can also include: fingerprint base sets up unit, for according to the artificial calibration result to all doubtful notes, setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
In other words, by manual examination and verification person, all doubtful note reported is carried out the demarcation of refuse messages or normal note, then fingerprint base sets up unit using the doubtful note that is demarcated as normal note as the input source creating the white fingerprint base of normal note, using the doubtful note that is demarcated as refuse messages as the input source creating the black fingerprint base of refuse messages, thus setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
In establishment process, described fingerprint base is set up unit and the content of text of every note as input source is carried out pretreatment and Denoising disposal, the content of text after utilization process, generates SimHash coding, using the fingerprint as this input source;And the fingerprint of this input source is compared with the fingerprint in fingerprint base, and fingerprint similarity measure Hamming distances corresponding with setting for the fingerprint similarity obtained is compared, during more than described fingerprint similarity measure Hamming distances, explanation comparison failure, joins fingerprint corresponding for this input source in the fingerprint base of correspondence.
Wherein, when fingerprint base is absent from any fingerprint, it is believed that comparison failure, then directly the fingerprint of this input source is joined in the fingerprint base of correspondence.
During practical application, the fingerprint in fingerprint base, with corresponding character string for index bit, indexes, in order to follow-up quick comparison.
Described fingerprint base sets up unit, after being additionally operable to set up the black fingerprint base of described refuse messages and the white fingerprint base of normal note, the white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;And the black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected.
Described comparing unit 62, when being additionally operable to string length corresponding to the content of text after determining pretreatment and Denoising disposal more than the string length thresholding arranged, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
Wherein, when determining string length that the content of text after pretreatment and Denoising disposal is corresponding less than or equal to the string length thresholding arranged, illustrating that described note to be calibrated is normal note, described comparing unit 62 triggers described demarcation unit 63 and directly described short beacon to be calibrated is decided to be normal note.
Described comparing unit 62, specifically for: using character string corresponding for the content of text after pretreatment and Denoising disposal as index, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
Wherein, the fingerprint of described demarcation note is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note by described comparing unit 62 simultaneously, obtains the fingerprint similarity of correspondence;
The fingerprint similarity obtained is compared with corresponding fingerprint similarity measure Hamming distances, so that it is determined that comparing result.
Wherein, when the fingerprint similarity obtained is less than or equal to the fingerprint similarity measure Hamming distances of described correspondence, comparison success is described;When the fingerprint similarity obtained is more than the fingerprint similarity measure Hamming distances of described correspondence, comparison failure is described.
Described demarcation unit 63, specifically for:
When being comparison success with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison failure, demarcating described note to be calibrated is refuse messages;Or,
When being comparison failure with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison success, demarcating described note to be calibrated is normal note;Or,
When with the comparison result of fingerprint in the black fingerprint base of described refuse messages be comparison failure, and with when the comparison result of fingerprint is comparison failure in the described white fingerprint base of normal note, demarcate described note to be calibrated for treating manually to demarcate note.
During practical application, described fingerprint generation unit 61, comparing unit 62, demarcation unit 63, pretreatment unit and fingerprint base set up unit can by the central processing unit (CPU in information processor, CentralProcessingUnit), microprocessor (MCU, MicroControlUnit), digital signal processor (DSP, DigitalSignalProcessor) or programmable logic array (FPGA, Field-ProgrammableGateArray) realize.
The information processor that the present embodiment provides, described fingerprint generates the unit 61 content of text according to note to be calibrated, generates the fingerprint of described note to be calibrated;The fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note by described comparing unit 62 simultaneously;Described demarcation unit 63 according to in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note, so, the accuracy that note is demarcated can be effectively improved, thus being effectively reduced False Rate and misdetection rate.Like this, to replace manual examination and verification completely, it is greatly saved cost of labor.
Short message text content is carried out English character rejecting and telephone number and the de-noise operation of numeral rejecting by described pretreatment unit, so, can effectively further improve the accuracy that note is demarcated.
Described fingerprint base sets up unit according to the artificial calibration result to doubtful note, sets up the black fingerprint base of described refuse messages and the white fingerprint base of normal note;And the white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;And the black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected, so, can guarantee that the correctness of fingerprint in fingerprint base, thus effectively further improving the accuracy that note is demarcated.
Described comparing unit 62 using character string corresponding for the content of text after pretreatment and Denoising disposal as index, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, so, the speed of comparison can effectively be increased.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of hardware embodiment, software implementation or the embodiment in conjunction with software and hardware aspect.And, the present invention can adopt the form at one or more upper computer programs implemented of computer-usable storage medium (including but not limited to disk memory and optical memory etc.) wherein including computer usable program code.
The present invention is that flow chart and/or block diagram with reference to method according to embodiments of the present invention, equipment (system) and computer program describe.It should be understood that can by the combination of the flow process in each flow process in computer program instructions flowchart and/or block diagram and/or square frame and flow chart and/or block diagram and/or square frame.These computer program instructions can be provided to produce a machine to the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device so that the instruction performed by the processor of computer or other programmable data processing device is produced for realizing the device of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and can guide in the computer-readable memory that computer or other programmable data processing device work in a specific way, the instruction making to be stored in this computer-readable memory produces to include the manufacture of command device, and this command device realizes the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, make on computer or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computer or other programmable devices provides for realizing the step of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.
The above, be only presently preferred embodiments of the present invention, is not intended to limit protection scope of the present invention.

Claims (15)

1. an information processing method, it is characterised in that described method includes:
Content of text according to note to be calibrated, generates the fingerprint of described note to be calibrated;
The fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously;
According to in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note.
2. method according to claim 1, it is characterised in that before the fingerprint of the described note to be calibrated of described generation, described method also includes:
Described content of text is carried out pretreatment and Denoising disposal;
Correspondingly, according to the content of text after pretreatment and Denoising disposal, the fingerprint of described note to be calibrated is generated.
3. method according to claim 2, it is characterised in that described content of text is carried out pretreatment and Denoising disposal, including:
Short message text content is carried out English character rejecting and telephone number and the de-noise operation of numeral rejecting.
4. method according to claim 1, it is characterised in that the fingerprint of the described note to be calibrated of described generation, for:
By the content of text of described note to be calibrated, generate SimHash coding, using the fingerprint as described note to be calibrated.
5. method according to claim 1, it is characterised in that before the described fingerprint by described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, described method also includes:
According to the artificial calibration result to all doubtful notes, set up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
6. method according to claim 5, it is characterized in that, after setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note, and before the fingerprint of described note to be calibrated being compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note, described method also includes simultaneously:
The white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;
The black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected.
7. method according to claim 2, it is characterised in that before being compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note by the fingerprint of described note to be calibrated, described method also includes simultaneously:
When determining string length that the content of text after pretreatment and Denoising disposal is corresponding more than the string length thresholding arranged, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
8. method according to claim 2, it is characterised in that the described fingerprint by described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, including:
Using character string corresponding for the content of text after pretreatment and Denoising disposal as index, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
9. method according to claim 8, it is characterised in that the described fingerprint by described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, for:
The fingerprint of described demarcation note is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously, obtains the fingerprint similarity of correspondence;
The fingerprint similarity obtained is compared with corresponding fingerprint similarity measure Hamming distances, so that it is determined that comparing result.
10. method according to claim 1, it is characterized in that, described basis with in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note, including:
When being comparison success with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison failure, demarcating described note to be calibrated is refuse messages;Or,
When being comparison failure with the comparison result of fingerprint in the black fingerprint base of described refuse messages, and with when in the described white fingerprint base of normal note, the comparison result of fingerprint is comparison success, demarcating described note to be calibrated is normal note;Or,
When with the comparison result of fingerprint in the black fingerprint base of described refuse messages be comparison failure, and with when the comparison result of fingerprint is comparison failure in the described white fingerprint base of normal note, demarcate described note to be calibrated for treating manually to demarcate note.
11. an information processor, it is characterised in that described device includes: fingerprint generates unit, comparing unit and demarcates unit;Wherein,
Described fingerprint generates unit, for the content of text according to note to be calibrated, generates the fingerprint of described note to be calibrated;
Described comparing unit, for comparing the fingerprint of described note to be calibrated with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously;
Described demarcation unit, for provide according to described comparing unit with in the black fingerprint base of described refuse messages fingerprint comparison result and with the comparison result of fingerprint in the described white fingerprint base of normal note, demarcating described note to be calibrated is refuse messages or normal note.
12. device according to claim 11, it is characterised in that described device also includes: pretreatment unit, for carrying out pretreatment and Denoising disposal to described content of text;
Correspondingly, described fingerprint generates unit, for according to the content of text after pretreatment and Denoising disposal, generating the fingerprint of described note to be calibrated.
13. device according to claim 11, it is characterised in that described device also includes: fingerprint base sets up unit, for according to the artificial calibration result to all doubtful notes, setting up the black fingerprint base of described refuse messages and the white fingerprint base of normal note.
14. device according to claim 13, it is characterized in that, described fingerprint base sets up unit, after being additionally operable to set up the black fingerprint base of described refuse messages and the white fingerprint base of normal note, the white fingerprint base of note normal described in refuse messages set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the described white fingerprint base of normal note is corrected;And the black fingerprint base of refuse messages described in normal note set pair used when setting up fingerprint base is carried out collision detection;According to testing result, the fingerprint in the black fingerprint base of described refuse messages is corrected.
15. device according to claim 12, it is characterized in that, described comparing unit, when being additionally operable to string length corresponding to the content of text after determining pretreatment and Denoising disposal more than the string length thresholding arranged, the fingerprint of described note to be calibrated is compared with the fingerprint in the black fingerprint base of refuse messages and the white fingerprint base of normal note simultaneously.
CN201410832128.4A 2014-12-26 2014-12-26 Information processing method and device Pending CN105786792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410832128.4A CN105786792A (en) 2014-12-26 2014-12-26 Information processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410832128.4A CN105786792A (en) 2014-12-26 2014-12-26 Information processing method and device

Publications (1)

Publication Number Publication Date
CN105786792A true CN105786792A (en) 2016-07-20

Family

ID=56389071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410832128.4A Pending CN105786792A (en) 2014-12-26 2014-12-26 Information processing method and device

Country Status (1)

Country Link
CN (1) CN105786792A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446149A (en) * 2016-09-21 2017-02-22 联动优势科技有限公司 Filtering method and device for notification message
CN106681980A (en) * 2015-11-05 2017-05-17 中国移动通信集团公司 Method and device for analyzing junk short messages
CN107705166A (en) * 2016-08-11 2018-02-16 杭州朗和科技有限公司 Information processing method and device
CN108703824A (en) * 2018-03-15 2018-10-26 哈工大机器人(合肥)国际创新研究院 A kind of bionic hand control system and control method based on myoelectricity bracelet
CN108874777A (en) * 2018-06-11 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and device of text anti-spam
CN109408795A (en) * 2017-08-17 2019-03-01 中国移动通信集团公司 A kind of text recognition method, equipment, computer readable storage medium and device
CN109547319A (en) * 2017-09-22 2019-03-29 中移(杭州)信息技术有限公司 A kind of message treatment method and device
CN112275438A (en) * 2020-10-13 2021-01-29 成都智叟智能科技有限公司 Dry and wet garbage separation and crushing control method and system based on data analysis

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101137087A (en) * 2007-08-01 2008-03-05 浙江大学 Short message monitoring center and monitoring method
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Method for real time filtering large scale rubbish SMS based on content
CN101711013A (en) * 2009-12-08 2010-05-19 中兴通讯股份有限公司 Method for processing multimedia message and device thereof
CN101977360A (en) * 2010-09-30 2011-02-16 北京新媒传信科技有限公司 Junk short message filter method
CN102096703A (en) * 2010-12-29 2011-06-15 北京新媒传信科技有限公司 Filtering method and equipment of short messages
CN102480702A (en) * 2010-11-24 2012-05-30 腾讯科技(深圳)有限公司 Short message intercepting method and system
CN103024746A (en) * 2012-12-30 2013-04-03 清华大学 System and method for processing spam short messages for telecommunication operator
CN103729384A (en) * 2012-10-16 2014-04-16 中国移动通信集团公司 Information filtering method, system and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Method for real time filtering large scale rubbish SMS based on content
CN101137087A (en) * 2007-08-01 2008-03-05 浙江大学 Short message monitoring center and monitoring method
CN101711013A (en) * 2009-12-08 2010-05-19 中兴通讯股份有限公司 Method for processing multimedia message and device thereof
CN101977360A (en) * 2010-09-30 2011-02-16 北京新媒传信科技有限公司 Junk short message filter method
CN102480702A (en) * 2010-11-24 2012-05-30 腾讯科技(深圳)有限公司 Short message intercepting method and system
CN102096703A (en) * 2010-12-29 2011-06-15 北京新媒传信科技有限公司 Filtering method and equipment of short messages
CN103729384A (en) * 2012-10-16 2014-04-16 中国移动通信集团公司 Information filtering method, system and device
CN103024746A (en) * 2012-12-30 2013-04-03 清华大学 System and method for processing spam short messages for telecommunication operator

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BINGFENG PI 等: "SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages", 《PROCEEDINGS OF THE SECOND SYMPOSIUM INTERNATIONAL COMPUTER SCIENCE AND COMPUTATIONAL TECHNOLOGY》 *
孙德才 等: "《近似串匹配关键技术及实用算法》", 30 June 2014, 《东北大学出版社》 *
罗刚: "《使用C#开发搜索引擎》", 29 February 2012, 《清华大学出版社》 *
黄文良 等: "一种高效垃圾短信过滤系统的实现", 《电信科学》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106681980A (en) * 2015-11-05 2017-05-17 中国移动通信集团公司 Method and device for analyzing junk short messages
CN106681980B (en) * 2015-11-05 2019-06-28 中国移动通信集团公司 A kind of refuse messages analysis method and device
CN107705166A (en) * 2016-08-11 2018-02-16 杭州朗和科技有限公司 Information processing method and device
CN106446149B (en) * 2016-09-21 2020-01-10 联动优势科技有限公司 Notification information filtering method and device
CN106446149A (en) * 2016-09-21 2017-02-22 联动优势科技有限公司 Filtering method and device for notification message
CN109408795B (en) * 2017-08-17 2022-04-15 中国移动通信集团公司 Text recognition method, text recognition equipment, computer readable storage medium and device
CN109408795A (en) * 2017-08-17 2019-03-01 中国移动通信集团公司 A kind of text recognition method, equipment, computer readable storage medium and device
CN109547319A (en) * 2017-09-22 2019-03-29 中移(杭州)信息技术有限公司 A kind of message treatment method and device
CN108703824A (en) * 2018-03-15 2018-10-26 哈工大机器人(合肥)国际创新研究院 A kind of bionic hand control system and control method based on myoelectricity bracelet
CN108874777A (en) * 2018-06-11 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and device of text anti-spam
CN108874777B (en) * 2018-06-11 2023-03-07 北京奇艺世纪科技有限公司 Text anti-spam method and device
CN112275438A (en) * 2020-10-13 2021-01-29 成都智叟智能科技有限公司 Dry and wet garbage separation and crushing control method and system based on data analysis
CN112275438B (en) * 2020-10-13 2022-03-01 成都智叟智能科技有限公司 Dry and wet garbage separation and crushing control method and system based on data analysis

Similar Documents

Publication Publication Date Title
CN105786792A (en) Information processing method and device
CN108366045B (en) Method and device for setting wind control scoring card
CN107122483B (en) Basic geographic information data quality inspection method, device and system
CN105989268A (en) Safety access method and system for human-computer identification
CN103262088B (en) The method and apparatus of the downgrader code in evaluate application code
CN106874768A (en) The method and device of penetration testing
CN112866023B (en) Network detection method, model training method, device, equipment and storage medium
CN107566376A (en) One kind threatens information generation method, apparatus and system
CN107341399B (en) Method and device for evaluating security of code file
CN106296195A (en) A kind of Risk Identification Method and device
CN106572117A (en) Method and apparatus for detecting WebShell file
CN109543764B (en) Early warning information validity detection method and detection system based on intelligent semantic perception
CN103793652A (en) Application system code safety scanning device based on static analysis
CN111079029B (en) Sensitive account detection method, storage medium and computer equipment
CN104268134A (en) Subjective and objective classifier building method and system
CN103927483A (en) Decision model used for detecting malicious programs and detecting method of malicious programs
WO2019242627A1 (en) Data processing method and apparatus
CN113158189B (en) Method, device, equipment and medium for generating malicious software analysis report
CN111104521A (en) Anti-fraud detection method and detection system based on graph analysis
CN112016138A (en) Method and device for automatic safe modeling of Internet of vehicles and electronic equipment
CN110390082A (en) A kind of communication matrix control methods and system
CN111079184A (en) Method, system, device and storage medium for protecting data leakage
CN108920909B (en) Counterfeit mobile application program discrimination method and system
CN106920022B (en) Safety vulnerability assessment method, system and equipment for cigarette industrial control system
CN116975206B (en) Vertical field training method and device based on AIGC large model and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160720