CN105404670A

CN105404670A - Harassing text message determining method and apparatus

Info

Publication number: CN105404670A
Application number: CN201510784065.4A
Authority: CN
Inventors: 李强; 张金晶; 常富洋
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2015-11-16
Filing date: 2015-11-16
Publication date: 2016-03-16
Anticipated expiration: 2035-11-16
Also published as: CN105404670B

Abstract

The present invention provide a harassing text message determining method. The method comprises the steps of: parsing an original text message content to acquire each word and digit information therein; with the digit information being a reference, calculating a distance from each word to each digit; using the distance as a weight and each word as a dimension, describing the text message content to generate a plurality of corresponding feature vectors; inputting each feature vector into a classification model to obtain an output result; and based on the output result, determining whether the text message is a harassing text message. According to the method and apparatus provided by the present invention, a harassing text message can be determined more accurately, and the probability of intercepting a harassing text message can be increased, and moreover, the probability of intercepting a text message useful to a user is reduced.

Description

Harassing and wrecking note method of discrimination and device

Technical field

The present invention relates to mobile terminal technology, specifically, the present invention relates to a kind of harassing and wrecking note method of discrimination and device.

Background technology

Along with the development of infotech, mobile communication becomes the Main Means of people's periodic traffic, and except the communication mode such as phone, video, note also as one conveniently communication mode, becomes the communication mode that cost performance is the highest, coverage rate is the widest.But thing followed harassing and wrecking note brings very large puzzlement to user.Businessman does propaganda, lawless person sends out fishing network address etc. by note, can send harassing and wrecking note to user.Therefore, prior art is in order to avoid receiving the propelling movement of a large amount of harassing and wrecking note, can judge whether the note of current reception is the note useful to user based on certain strategy, and the information being determined as harassing and wrecking note is automatically put into blacklist or deletion, thus prevent from harassing in a large number the inconvenience that note brings to user.

According to address list, prior art differentiates whether the transmit leg of note is strangeness numbers usually, to differentiate whether this note is harassing and wrecking notes, or by simple strategy, harassing and wrecking note is filtered, cause and judge into the note useful to user by accident harassing and wrecking note, based on this, need to provide one to harass note method of discrimination more accurately, to improve the accuracy judging harassing and wrecking note.

Summary of the invention

Object of the present invention is intended to solve at least one problem above-mentioned, provides a kind of harassing and wrecking note method of discrimination and device, as far as possible correctly to differentiate harassing and wrecking note.

To achieve these goals, the invention provides a kind of harassing and wrecking note method of discrimination, comprise the following steps:

Resolve original short message content to obtain wherein each word and numerical information;

With described numerical information for benchmark calculates the distance of each word to each numeral respectively;

Using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;

Each proper vector is inputted respectively disaggregated model to obtain Output rusults;

Differentiate whether this note is harassing and wrecking notes based on described Output rusults.

Concrete, described disaggregated model is the model of training in advance, and its training step is as follows:

Resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;

With described numerical information for each word of benchmark divides the distance being clipped to each numeral;

Using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;

Be positive sample and negative sample by training sample handmarking;

Adopt described positive sample and negative sample train classification models.

Further, describedly to the concrete steps that short message content is described to generate corresponding proper vector be: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.

Concrete, each word described characterizes as unit distance to the distance of each numeral described using each word.

Concrete, described analyzing step is specific as follows:

Delete the customizing messages in original note;

Based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;

Extract numerical information wherein.

Concrete, described customizing messages comprises URL, IP address, mobile phone, customer phone, landline telephone.

Preferably, adaboost Algorithm for Training disaggregated model is adopted.

Concrete, describedly differentiate that whether this note is the concrete steps of harassing and wrecking note and is based on described Output rusults:

When the Output rusults that at least there is a proper vector in multiple proper vectors of this note is correct, then differentiate that this note is normal note;

Otherwise differentiate that this note is for harassing and wrecking note.

Preferably, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.

Further, also comprise step, note disaggregated model being determined as harassing and wrecking note is stored in blacklist.

Further, also comprise step, the note being determined as harassing and wrecking note is deleted from user's message list.

A kind of harassing and wrecking note discriminating gear, comprising:

Parsing module: for resolving original short message content to obtain wherein each word and numerical information;

Distance calculation module: for described numerical information for benchmark calculates the distance of each word to each numeral respectively;

Feature vector generation module: for using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;

Sort module: for each proper vector being inputted respectively disaggregated model to obtain Output rusults;

Discrimination module: for differentiating based on described Output rusults whether this note is harassing and wrecking notes.

Concrete, described disaggregated model is the model of training in advance, and generate based on training module training, the step that described training module performs is as follows:

Be positive sample and negative sample by training sample handmarking;

Concrete, the concrete steps that described feature vector generation module performs are: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.

Concrete, the step that described parsing module performs is specific as follows:

Delete the customizing messages in original note;

Extract numerical information wherein.

Preferably, described training module adopts adaboost Algorithm for Training disaggregated model.

Concrete, the concrete steps that described discrimination module performs are:

Otherwise differentiate that this note is for harassing and wrecking note.

Concrete, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.

Further, also comprise black list module, be stored in blacklist for note disaggregated model being determined as harassing and wrecking note.

Further, also comprise removing module, for the note being determined as harassing and wrecking note being deleted from user's message list.

Compared to existing technology, the solution of the present invention has the following advantages:

The present invention is by carrying out participle parsing to short message content, extract numerical information wherein, and adopt each word using each numeral as benchmark and divide the distance being clipped to each numeral for feature interpretation note is with generating feature vector, judge whether this proper vector exists correct numeral by the disaggregated model of training in advance, if at least there is a numeral correctly, judge that this note is normal note, otherwise be harassing and wrecking note.The judgement of harassing note is carried out based on the method for the invention, more accurately can determine whether the note that customer mobile terminal receives is harassing and wrecking notes, particularly for the consumption information of bank's propelling movement, the information etc. of receiving of logistics company propelling movement, there is the information of significant figure, reduce the probability that those information are mistaken for harassing and wrecking note, improve the precision that harassing and wrecking note differentiates further.

The aspect that the present invention adds and advantage will part provide in the following description, and these will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:

Fig. 1 is the schematic flow sheet of harassing and wrecking note method of discrimination of the present invention;

Fig. 2 is the schematic flow sheet of note analyzing step of the present invention;

Fig. 3 is the schematic flow sheet of disaggregated model training step of the present invention;

Fig. 4 is the structural representation of harassing and wrecking note discriminating gear of the present invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.

Those skilled in the art of the present technique are appreciated that unless expressly stated, and singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form.Should be further understood that, the wording used in instructions of the present invention " comprises " and refers to there is described feature, integer, step, operation, element and/or assembly, but does not get rid of and exist or add other features one or more, integer, step, operation, element, assembly and/or their group.Should be appreciated that, when we claim element to be " connected " or " coupling " to another element time, it can be directly connected or coupled to other elements, or also can there is intermediary element.In addition, " connection " used herein or " coupling " can comprise wireless connections or wirelessly to couple.Wording "and/or" used herein comprises one or more whole or arbitrary unit listing item be associated and all combinations.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, and all terms used herein (comprising technical term and scientific terminology), have the meaning identical with the general understanding of the those of ordinary skill in field belonging to the present invention.It should also be understood that, those terms defined in such as general dictionary, should be understood to that there is the meaning consistent with the meaning in the context of prior art, unless and by specific definitions as here, otherwise can not explain by idealized or too formal implication.

Those skilled in the art of the present technique are appreciated that, here used " terminal ", " terminal device " had both comprised the equipment of wireless signal receiver, it only possesses the equipment of the wireless signal receiver without emissive ability, comprise again the equipment receiving and launch hardware, it has and on bidirectional communication link, can perform the reception of two-way communication and launch the equipment of hardware.This equipment can comprise: honeycomb or other communication facilitiess, its honeycomb or other communication facilities of having single line display or multi-line display or not having multi-line display; PCS (PersonalCommunicationsService, PCS Personal Communications System), it can combine voice, data processing, fax and/or its communication ability; PDA (PersonalDigitalAssistant, personal digital assistant), it can comprise radio frequency receiver, pager, the Internet/intranet access, web browser, notepad, calendar and/or GPS (GlobalPositioningSystem, GPS) receiver; Conventional laptop and/or palmtop computer or other equipment, it has and/or comprises the conventional laptop of radio frequency receiver and/or palmtop computer or other equipment.Here used " terminal ", " terminal device " can be portable, can transport, be arranged in the vehicles (aviation, sea-freight and/or land), or be suitable for and/or be configured at local runtime, and/or with distribution form, any other position operating in the earth and/or space is run.Here used " terminal ", " terminal device " can also be communication terminal, access terminals, music/video playback terminal, can be such as PDA, MID (MobileInternetDevice, mobile internet device) and/or there is the mobile phone of music/video playing function, also can be the equipment such as intelligent television, Set Top Box.

Those skilled in the art of the present technique are appreciated that used remote network devices here, and it includes but not limited to the cloud that computing machine, network host, single network server, multiple webserver collection or multiple server are formed.At this, cloud is formed by based on a large amount of computing machine of cloud computing (CloudComputing) or the webserver, and wherein, cloud computing is the one of Distributed Calculation, the super virtual machine be made up of a group loosely-coupled computing machine collection.In embodiments of the invention, realize communicating by any communication mode between remote network devices, terminal device with WNS server, include but not limited to, the mobile communication based on 3GPP, LTE, WIMAX, the computer network communication based on TCP/IP, udp protocol and the low coverage wireless transmission method based on bluetooth, Infrared Transmission standard.

Shown in figure 1, in order to differentiate whether the note that customer mobile terminal receives is harassing and wrecking notes more accurately, the invention provides a kind of harassing and wrecking note method of discrimination, specifically comprising the following steps:

S11, resolve original short message content to obtain wherein each word and numerical information;

The embodiment of the present invention is mainly used in differentiating whether the note that consumption information, balance amount information etc. that bank sends comprise important numbers information is harassing and wrecking notes, so formerly resolve short message content, to obtain numerical information wherein.Shown in figure 2, described resolving is specific as follows:

Step 1, the customizing messages deleted in original note;

Described customizing messages specifically comprises the information such as URL, IP address, mobile phone, customer phone, landline telephone in note, by those information deletions, with the interference preventing his-and-hers watches from levying the useful numerical information such as spending amount, logistics odd numbers.

Step 2, based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;

Standard syntax based on Chinese carries out participle to short message content, so that the word of the complete meaning can be expressed as a participle, thus short message content is split, be divided into different words and one or more numeral, its corresponding part of speech determined respectively in each word, and the part of speech of numeral correspondence is set as m.Wherein, when there is punctuation mark in the numerals such as the sign amount of money, as 200.00 yuan, then punctuation mark wherein being deleted, making it form string number.

Step 3, extraction numerical information wherein.

Based on the short message content after above-mentioned decomposition, take part of speech as distinguishing characteristics, extracting part of speech is the information of m, then extract all numerical information.

Thus, by carrying out participle parsing to short message content, extract wherein each word and numerical information, for follow-up process.

S12, with described numerical information for benchmark calculates the distance of each word to each numeral respectively;

The one or more numerical informations extracted with above-mentioned steps are for benchmark, and each word calculated in the note extracted by above-mentioned steps divides the distance being clipped to each numerical information, and described distance specifically characterizes using each word as unit distance.That is, add up each word and divide the number of words being clipped to each numeral, using this number of words as each word to the distance of each numeral.

S13, using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;

Divide the distance being clipped to each numeral in note as weight each word in the note calculated, namely characterize the influence degree for correct digit of each word to each numeral.Using each word as dimension, namely characterize in note the word existing and how much affect each digital correctness.Using described distance as weight, each word, as dimension, generates multiple proper vector, to describe this short message content.Be specially, respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, thus generates multiple different proper vector.

S14, each proper vector is inputted disaggregated model respectively to obtain Output rusults;

Described disaggregated model is the model of training in advance, namely by pre-prepd sample set train classification models, to classify to note subsequently through this disaggregated model, thus determines whether note is harassing and wrecking notes.Preferably, adaboost Algorithm for Training disaggregated model is adopted.Shown in figure 3, the step of described train classification models is specific as follows:

Step 11, resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;

Sample set is pre-prepd note set, comprises n bar note, resolves every bar short message content wherein.Described resolving is specially described in above-mentioned steps S11, does not repeat them here.Each word in every bar note and numerical information is obtained by this analyzing step, and the part of speech of their correspondences.

Step 12, with described numerical information for each word of benchmark divides the distance being clipped to each numeral;

The one or more numerical informations extracted with above-mentioned steps are for benchmark, and each word calculated in the note extracted by above-mentioned steps divides the distance being clipped to each numerical information, and described distance specifically characterizes using each word as unit distance.That is, add up each word and divide the number of words being clipped to each numeral, using this number of words as each word to the distance of each numeral.Every bar note in sample set is calculated its respective distance according to the method.

Step 13, using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;

Divide the distance being clipped to each numeral in note as weight each word in the note calculated, namely characterize the influence degree for correct digit of each word to each numeral.Using each word as dimension, namely characterize in note the word existing and how much affect each digital correctness.Using described distance as weight, each word, as dimension, generates multiple proper vector, to describe this short message content.Be specially, respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, thus generates multiple different proper vector.Wherein, all words in described every bar note with the numeral of in multiple numeral for the proper vector that benchmark generates is a proper vector.Every bar short message content is all described according to the method, generates multiple proper vector, i.e. multiple training sample.

Step 14, be positive sample and negative sample by training sample handmarking;

The training sample of above-mentioned generation being carried out handmarking, as in order to judge the information such as spending amount, remaining sum that bank etc. sends, then the proper vector being benchmark generation based on the numeral characterizing the amount of money, remaining sum etc. in note being labeled as positive sample; Other numerical informations in note, as beaten the numerals such as several foldings, are labeled as negative sample with the proper vector that those numerals generate for benchmark.Further, be decided to be by the object information of positive sample correctly, Output rusults value is 1, and the object information of negative sample is decided to be mistake, and Output rusults value is 0.

Step 15, adopt described positive sample and negative sample train classification models.

Using those positive and negative samples as input, the correct or wrong object information of its correspondence, as output, adopts adaboost algorithm to train, thus obtains disaggregated model.

Using each proper vector corresponding for note as input, the disaggregated model based on described training obtains Output rusults.Wherein, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.

S15, to differentiate based on described Output rusults whether this note is harassing and wrecking notes.

Output rusults based on above-mentioned disaggregated model differentiates whether note is harassing and wrecking notes.When multiple proper vectors corresponding to note are respectively as the input of disaggregated model, have at least one to be correct in the Output rusults of acquisition, then this note is normal note.Otherwise, judge that this note is as harassing and wrecking note.

After judging that note is as harassing and wrecking note, then this note tackled and be stored in blacklist.In other embodiments, judge that note is as after harassing and wrecking note, deletes this note from the message list of user, with the inconvenience avoiding harassing and wrecking note to bring for user, improve user and use note to carry out the Experience Degree communicated.

Shown in figure 4, in order to set forth harassing and wrecking note method of discrimination of the present invention further, modularization explanation is carried out to it, a kind of harassing and wrecking note discriminating gear is provided, comprise parsing module 11, distance calculation module 12, feature vector generation module 13, sort module 14, discrimination module 15, and training module 16, black list module 17 and the removing module 18 in Partial Transformation embodiment, wherein

Parsing module 11: for resolving original short message content to obtain wherein each word and numerical information;

The embodiment of the present invention is mainly used in differentiating whether the note that consumption information, balance amount information etc. that bank sends comprise important numbers information is harassing and wrecking notes, so first resolved short message content by parsing module 11, to obtain numerical information wherein.The concrete steps that described parsing module performs are as follows:

Step 1, the customizing messages deleted in original note;

Step 3, extraction numerical information wherein.

Thus, carry out participle parsing by parsing module 11 pairs of short message contents, extract wherein each word and numerical information, for follow-up process.

Distance calculation module 12: for described numerical information for benchmark calculates the distance of each word to each numeral respectively;

The one or more numerical informations extracted with described parsing module 11 are for benchmark, each word calculated by distance calculation module 12 in the note extracted by parsing module 11 divides the distance being clipped to each numerical information, and described distance specifically characterizes using each word as unit distance.That is, add up each word and divide the number of words being clipped to each numeral, using this number of words as each word to the distance of each numeral.

Feature vector generation module 13: for using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;

Each word in the note calculate distance calculation module 12 divides the distance being clipped to each numeral in note as weight, namely characterizes the influence degree for correct digit of each word to each numeral.Using each word as dimension, namely characterize in note the word existing and how much affect each digital correctness.Using described distance as weight, each word, as dimension, generates multiple proper vector by feature vector generation module 13, to describe this short message content.Be specially, respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, thus generates multiple different proper vector.

Sort module 14: for each proper vector being inputted respectively disaggregated model to obtain Output rusults;

Described disaggregated model is the model of training in advance, namely by pre-prepd sample set train classification models, to classify to note subsequently through this disaggregated model, thus determines whether note is harassing and wrecking notes.Preferably, adaboost Algorithm for Training disaggregated model is adopted by training module 16.The step of described training module 16 train classification models is specific as follows:

Step 14, be positive sample and negative sample by training sample handmarking;

Sort module 14 is using each proper vector corresponding for note as input, and the disaggregated model based on described training obtains Output rusults.Wherein, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.

Discrimination module 15: for differentiating based on described Output rusults whether this note is harassing and wrecking notes.

Based on the Output rusults of disaggregated model, discrimination module 15 differentiates whether note is harassing and wrecking notes.Be specially: when multiple proper vectors corresponding to note are respectively as the input of disaggregated model, have at least one to be correct in the Output rusults of acquisition, then this note is normal note.Otherwise, judge that this note is as harassing and wrecking note.

After judging that note is as harassing and wrecking note, then by black list module 17 this note tackled and be stored in blacklist.In other embodiments, judge that note is as after harassing and wrecking note, is deleted this note by removing module 18 from the message list of user, with the inconvenience avoiding harassing and wrecking note to bring for user, improve user and use note to carry out the Experience Degree communicated.

In sum, based on harassing and wrecking note method of discrimination of the present invention or device, can judge more accurately to harass note, the probability that harassing and wrecking note is blocked can be improved, reduce the probability that the note useful to user is blocked simultaneously.

The above is only some embodiments of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. harass a note method of discrimination, it is characterized in that, comprise the following steps:

2. method according to claim 1, is characterized in that, described disaggregated model is the model of training in advance, and its training step is as follows:

Be positive sample and negative sample by training sample handmarking;

3. method according to claim 1, it is characterized in that, describedly to the concrete steps that short message content is described to generate corresponding proper vector be: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.

4. the method according to any one of claims 1 to 3, is characterized in that, each word described characterizes as unit distance to the distance of each numeral described using each word.

5. method according to claim 1 and 2, is characterized in that, described analyzing step is specific as follows:

Delete the customizing messages in original note;

Extract numerical information wherein.

6. harass a note discriminating gear, it is characterized in that, comprising:

7. device according to claim 6, is characterized in that, described disaggregated model is the model of training in advance, and generate based on training module training, the step that described training module performs is as follows:

Be positive sample and negative sample by training sample handmarking;

8. device according to claim 6, it is characterized in that, the concrete steps that described feature vector generation module performs are: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.

9. the device according to any one of claim 6 to 8, is characterized in that, each word described characterizes as unit distance to the distance of each numeral described using each word.

10. device according to claim 6, is characterized in that, the step that described parsing module performs is specific as follows:

Delete the customizing messages in original note;

Extract numerical information wherein.