CN105404670A - Harassing text message determining method and apparatus - Google Patents

Harassing text message determining method and apparatus Download PDF

Info

Publication number
CN105404670A
CN105404670A CN201510784065.4A CN201510784065A CN105404670A CN 105404670 A CN105404670 A CN 105404670A CN 201510784065 A CN201510784065 A CN 201510784065A CN 105404670 A CN105404670 A CN 105404670A
Authority
CN
China
Prior art keywords
word
note
distance
numeral
short message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510784065.4A
Other languages
Chinese (zh)
Other versions
CN105404670B (en
Inventor
李强
张金晶
常富洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510784065.4A priority Critical patent/CN105404670B/en
Publication of CN105404670A publication Critical patent/CN105404670A/en
Application granted granted Critical
Publication of CN105404670B publication Critical patent/CN105404670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72436User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. SMS or e-mail

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provide a harassing text message determining method. The method comprises the steps of: parsing an original text message content to acquire each word and digit information therein; with the digit information being a reference, calculating a distance from each word to each digit; using the distance as a weight and each word as a dimension, describing the text message content to generate a plurality of corresponding feature vectors; inputting each feature vector into a classification model to obtain an output result; and based on the output result, determining whether the text message is a harassing text message. According to the method and apparatus provided by the present invention, a harassing text message can be determined more accurately, and the probability of intercepting a harassing text message can be increased, and moreover, the probability of intercepting a text message useful to a user is reduced.

Description

Harassing and wrecking note method of discrimination and device
Technical field
The present invention relates to mobile terminal technology, specifically, the present invention relates to a kind of harassing and wrecking note method of discrimination and device.
Background technology
Along with the development of infotech, mobile communication becomes the Main Means of people's periodic traffic, and except the communication mode such as phone, video, note also as one conveniently communication mode, becomes the communication mode that cost performance is the highest, coverage rate is the widest.But thing followed harassing and wrecking note brings very large puzzlement to user.Businessman does propaganda, lawless person sends out fishing network address etc. by note, can send harassing and wrecking note to user.Therefore, prior art is in order to avoid receiving the propelling movement of a large amount of harassing and wrecking note, can judge whether the note of current reception is the note useful to user based on certain strategy, and the information being determined as harassing and wrecking note is automatically put into blacklist or deletion, thus prevent from harassing in a large number the inconvenience that note brings to user.
According to address list, prior art differentiates whether the transmit leg of note is strangeness numbers usually, to differentiate whether this note is harassing and wrecking notes, or by simple strategy, harassing and wrecking note is filtered, cause and judge into the note useful to user by accident harassing and wrecking note, based on this, need to provide one to harass note method of discrimination more accurately, to improve the accuracy judging harassing and wrecking note.
Summary of the invention
Object of the present invention is intended to solve at least one problem above-mentioned, provides a kind of harassing and wrecking note method of discrimination and device, as far as possible correctly to differentiate harassing and wrecking note.
To achieve these goals, the invention provides a kind of harassing and wrecking note method of discrimination, comprise the following steps:
Resolve original short message content to obtain wherein each word and numerical information;
With described numerical information for benchmark calculates the distance of each word to each numeral respectively;
Using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;
Each proper vector is inputted respectively disaggregated model to obtain Output rusults;
Differentiate whether this note is harassing and wrecking notes based on described Output rusults.
Concrete, described disaggregated model is the model of training in advance, and its training step is as follows:
Resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;
With described numerical information for each word of benchmark divides the distance being clipped to each numeral;
Using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;
Be positive sample and negative sample by training sample handmarking;
Adopt described positive sample and negative sample train classification models.
Further, describedly to the concrete steps that short message content is described to generate corresponding proper vector be: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.
Concrete, each word described characterizes as unit distance to the distance of each numeral described using each word.
Concrete, described analyzing step is specific as follows:
Delete the customizing messages in original note;
Based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;
Extract numerical information wherein.
Concrete, described customizing messages comprises URL, IP address, mobile phone, customer phone, landline telephone.
Preferably, adaboost Algorithm for Training disaggregated model is adopted.
Concrete, describedly differentiate that whether this note is the concrete steps of harassing and wrecking note and is based on described Output rusults:
When the Output rusults that at least there is a proper vector in multiple proper vectors of this note is correct, then differentiate that this note is normal note;
Otherwise differentiate that this note is for harassing and wrecking note.
Preferably, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.
Further, also comprise step, note disaggregated model being determined as harassing and wrecking note is stored in blacklist.
Further, also comprise step, the note being determined as harassing and wrecking note is deleted from user's message list.
A kind of harassing and wrecking note discriminating gear, comprising:
Parsing module: for resolving original short message content to obtain wherein each word and numerical information;
Distance calculation module: for described numerical information for benchmark calculates the distance of each word to each numeral respectively;
Feature vector generation module: for using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;
Sort module: for each proper vector being inputted respectively disaggregated model to obtain Output rusults;
Discrimination module: for differentiating based on described Output rusults whether this note is harassing and wrecking notes.
Concrete, described disaggregated model is the model of training in advance, and generate based on training module training, the step that described training module performs is as follows:
Resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;
With described numerical information for each word of benchmark divides the distance being clipped to each numeral;
Using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;
Be positive sample and negative sample by training sample handmarking;
Adopt described positive sample and negative sample train classification models.
Concrete, the concrete steps that described feature vector generation module performs are: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.
Concrete, each word described characterizes as unit distance to the distance of each numeral described using each word.
Concrete, the step that described parsing module performs is specific as follows:
Delete the customizing messages in original note;
Based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;
Extract numerical information wherein.
Concrete, described customizing messages comprises URL, IP address, mobile phone, customer phone, landline telephone.
Preferably, described training module adopts adaboost Algorithm for Training disaggregated model.
Concrete, the concrete steps that described discrimination module performs are:
When the Output rusults that at least there is a proper vector in multiple proper vectors of this note is correct, then differentiate that this note is normal note;
Otherwise differentiate that this note is for harassing and wrecking note.
Concrete, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.
Further, also comprise black list module, be stored in blacklist for note disaggregated model being determined as harassing and wrecking note.
Further, also comprise removing module, for the note being determined as harassing and wrecking note being deleted from user's message list.
Compared to existing technology, the solution of the present invention has the following advantages:
The present invention is by carrying out participle parsing to short message content, extract numerical information wherein, and adopt each word using each numeral as benchmark and divide the distance being clipped to each numeral for feature interpretation note is with generating feature vector, judge whether this proper vector exists correct numeral by the disaggregated model of training in advance, if at least there is a numeral correctly, judge that this note is normal note, otherwise be harassing and wrecking note.The judgement of harassing note is carried out based on the method for the invention, more accurately can determine whether the note that customer mobile terminal receives is harassing and wrecking notes, particularly for the consumption information of bank's propelling movement, the information etc. of receiving of logistics company propelling movement, there is the information of significant figure, reduce the probability that those information are mistaken for harassing and wrecking note, improve the precision that harassing and wrecking note differentiates further.
The aspect that the present invention adds and advantage will part provide in the following description, and these will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:
Fig. 1 is the schematic flow sheet of harassing and wrecking note method of discrimination of the present invention;
Fig. 2 is the schematic flow sheet of note analyzing step of the present invention;
Fig. 3 is the schematic flow sheet of disaggregated model training step of the present invention;
Fig. 4 is the structural representation of harassing and wrecking note discriminating gear of the present invention.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
Those skilled in the art of the present technique are appreciated that unless expressly stated, and singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form.Should be further understood that, the wording used in instructions of the present invention " comprises " and refers to there is described feature, integer, step, operation, element and/or assembly, but does not get rid of and exist or add other features one or more, integer, step, operation, element, assembly and/or their group.Should be appreciated that, when we claim element to be " connected " or " coupling " to another element time, it can be directly connected or coupled to other elements, or also can there is intermediary element.In addition, " connection " used herein or " coupling " can comprise wireless connections or wirelessly to couple.Wording "and/or" used herein comprises one or more whole or arbitrary unit listing item be associated and all combinations.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, and all terms used herein (comprising technical term and scientific terminology), have the meaning identical with the general understanding of the those of ordinary skill in field belonging to the present invention.It should also be understood that, those terms defined in such as general dictionary, should be understood to that there is the meaning consistent with the meaning in the context of prior art, unless and by specific definitions as here, otherwise can not explain by idealized or too formal implication.
Those skilled in the art of the present technique are appreciated that, here used " terminal ", " terminal device " had both comprised the equipment of wireless signal receiver, it only possesses the equipment of the wireless signal receiver without emissive ability, comprise again the equipment receiving and launch hardware, it has and on bidirectional communication link, can perform the reception of two-way communication and launch the equipment of hardware.This equipment can comprise: honeycomb or other communication facilitiess, its honeycomb or other communication facilities of having single line display or multi-line display or not having multi-line display; PCS (PersonalCommunicationsService, PCS Personal Communications System), it can combine voice, data processing, fax and/or its communication ability; PDA (PersonalDigitalAssistant, personal digital assistant), it can comprise radio frequency receiver, pager, the Internet/intranet access, web browser, notepad, calendar and/or GPS (GlobalPositioningSystem, GPS) receiver; Conventional laptop and/or palmtop computer or other equipment, it has and/or comprises the conventional laptop of radio frequency receiver and/or palmtop computer or other equipment.Here used " terminal ", " terminal device " can be portable, can transport, be arranged in the vehicles (aviation, sea-freight and/or land), or be suitable for and/or be configured at local runtime, and/or with distribution form, any other position operating in the earth and/or space is run.Here used " terminal ", " terminal device " can also be communication terminal, access terminals, music/video playback terminal, can be such as PDA, MID (MobileInternetDevice, mobile internet device) and/or there is the mobile phone of music/video playing function, also can be the equipment such as intelligent television, Set Top Box.
Those skilled in the art of the present technique are appreciated that used remote network devices here, and it includes but not limited to the cloud that computing machine, network host, single network server, multiple webserver collection or multiple server are formed.At this, cloud is formed by based on a large amount of computing machine of cloud computing (CloudComputing) or the webserver, and wherein, cloud computing is the one of Distributed Calculation, the super virtual machine be made up of a group loosely-coupled computing machine collection.In embodiments of the invention, realize communicating by any communication mode between remote network devices, terminal device with WNS server, include but not limited to, the mobile communication based on 3GPP, LTE, WIMAX, the computer network communication based on TCP/IP, udp protocol and the low coverage wireless transmission method based on bluetooth, Infrared Transmission standard.
Shown in figure 1, in order to differentiate whether the note that customer mobile terminal receives is harassing and wrecking notes more accurately, the invention provides a kind of harassing and wrecking note method of discrimination, specifically comprising the following steps:
S11, resolve original short message content to obtain wherein each word and numerical information;
The embodiment of the present invention is mainly used in differentiating whether the note that consumption information, balance amount information etc. that bank sends comprise important numbers information is harassing and wrecking notes, so formerly resolve short message content, to obtain numerical information wherein.Shown in figure 2, described resolving is specific as follows:
Step 1, the customizing messages deleted in original note;
Described customizing messages specifically comprises the information such as URL, IP address, mobile phone, customer phone, landline telephone in note, by those information deletions, with the interference preventing his-and-hers watches from levying the useful numerical information such as spending amount, logistics odd numbers.
Step 2, based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;
Standard syntax based on Chinese carries out participle to short message content, so that the word of the complete meaning can be expressed as a participle, thus short message content is split, be divided into different words and one or more numeral, its corresponding part of speech determined respectively in each word, and the part of speech of numeral correspondence is set as m.Wherein, when there is punctuation mark in the numerals such as the sign amount of money, as 200.00 yuan, then punctuation mark wherein being deleted, making it form string number.
Step 3, extraction numerical information wherein.
Based on the short message content after above-mentioned decomposition, take part of speech as distinguishing characteristics, extracting part of speech is the information of m, then extract all numerical information.
Thus, by carrying out participle parsing to short message content, extract wherein each word and numerical information, for follow-up process.
S12, with described numerical information for benchmark calculates the distance of each word to each numeral respectively;
The one or more numerical informations extracted with above-mentioned steps are for benchmark, and each word calculated in the note extracted by above-mentioned steps divides the distance being clipped to each numerical information, and described distance specifically characterizes using each word as unit distance.That is, add up each word and divide the number of words being clipped to each numeral, using this number of words as each word to the distance of each numeral.
S13, using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;
Divide the distance being clipped to each numeral in note as weight each word in the note calculated, namely characterize the influence degree for correct digit of each word to each numeral.Using each word as dimension, namely characterize in note the word existing and how much affect each digital correctness.Using described distance as weight, each word, as dimension, generates multiple proper vector, to describe this short message content.Be specially, respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, thus generates multiple different proper vector.
S14, each proper vector is inputted disaggregated model respectively to obtain Output rusults;
Described disaggregated model is the model of training in advance, namely by pre-prepd sample set train classification models, to classify to note subsequently through this disaggregated model, thus determines whether note is harassing and wrecking notes.Preferably, adaboost Algorithm for Training disaggregated model is adopted.Shown in figure 3, the step of described train classification models is specific as follows:
Step 11, resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;
Sample set is pre-prepd note set, comprises n bar note, resolves every bar short message content wherein.Described resolving is specially described in above-mentioned steps S11, does not repeat them here.Each word in every bar note and numerical information is obtained by this analyzing step, and the part of speech of their correspondences.
Step 12, with described numerical information for each word of benchmark divides the distance being clipped to each numeral;
The one or more numerical informations extracted with above-mentioned steps are for benchmark, and each word calculated in the note extracted by above-mentioned steps divides the distance being clipped to each numerical information, and described distance specifically characterizes using each word as unit distance.That is, add up each word and divide the number of words being clipped to each numeral, using this number of words as each word to the distance of each numeral.Every bar note in sample set is calculated its respective distance according to the method.
Step 13, using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;
Divide the distance being clipped to each numeral in note as weight each word in the note calculated, namely characterize the influence degree for correct digit of each word to each numeral.Using each word as dimension, namely characterize in note the word existing and how much affect each digital correctness.Using described distance as weight, each word, as dimension, generates multiple proper vector, to describe this short message content.Be specially, respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, thus generates multiple different proper vector.Wherein, all words in described every bar note with the numeral of in multiple numeral for the proper vector that benchmark generates is a proper vector.Every bar short message content is all described according to the method, generates multiple proper vector, i.e. multiple training sample.
Step 14, be positive sample and negative sample by training sample handmarking;
The training sample of above-mentioned generation being carried out handmarking, as in order to judge the information such as spending amount, remaining sum that bank etc. sends, then the proper vector being benchmark generation based on the numeral characterizing the amount of money, remaining sum etc. in note being labeled as positive sample; Other numerical informations in note, as beaten the numerals such as several foldings, are labeled as negative sample with the proper vector that those numerals generate for benchmark.Further, be decided to be by the object information of positive sample correctly, Output rusults value is 1, and the object information of negative sample is decided to be mistake, and Output rusults value is 0.
Step 15, adopt described positive sample and negative sample train classification models.
Using those positive and negative samples as input, the correct or wrong object information of its correspondence, as output, adopts adaboost algorithm to train, thus obtains disaggregated model.
Using each proper vector corresponding for note as input, the disaggregated model based on described training obtains Output rusults.Wherein, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.
S15, to differentiate based on described Output rusults whether this note is harassing and wrecking notes.
Output rusults based on above-mentioned disaggregated model differentiates whether note is harassing and wrecking notes.When multiple proper vectors corresponding to note are respectively as the input of disaggregated model, have at least one to be correct in the Output rusults of acquisition, then this note is normal note.Otherwise, judge that this note is as harassing and wrecking note.
After judging that note is as harassing and wrecking note, then this note tackled and be stored in blacklist.In other embodiments, judge that note is as after harassing and wrecking note, deletes this note from the message list of user, with the inconvenience avoiding harassing and wrecking note to bring for user, improve user and use note to carry out the Experience Degree communicated.
Shown in figure 4, in order to set forth harassing and wrecking note method of discrimination of the present invention further, modularization explanation is carried out to it, a kind of harassing and wrecking note discriminating gear is provided, comprise parsing module 11, distance calculation module 12, feature vector generation module 13, sort module 14, discrimination module 15, and training module 16, black list module 17 and the removing module 18 in Partial Transformation embodiment, wherein
Parsing module 11: for resolving original short message content to obtain wherein each word and numerical information;
The embodiment of the present invention is mainly used in differentiating whether the note that consumption information, balance amount information etc. that bank sends comprise important numbers information is harassing and wrecking notes, so first resolved short message content by parsing module 11, to obtain numerical information wherein.The concrete steps that described parsing module performs are as follows:
Step 1, the customizing messages deleted in original note;
Step 2, based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;
Step 3, extraction numerical information wherein.
Thus, carry out participle parsing by parsing module 11 pairs of short message contents, extract wherein each word and numerical information, for follow-up process.
Distance calculation module 12: for described numerical information for benchmark calculates the distance of each word to each numeral respectively;
The one or more numerical informations extracted with described parsing module 11 are for benchmark, each word calculated by distance calculation module 12 in the note extracted by parsing module 11 divides the distance being clipped to each numerical information, and described distance specifically characterizes using each word as unit distance.That is, add up each word and divide the number of words being clipped to each numeral, using this number of words as each word to the distance of each numeral.
Feature vector generation module 13: for using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;
Each word in the note calculate distance calculation module 12 divides the distance being clipped to each numeral in note as weight, namely characterizes the influence degree for correct digit of each word to each numeral.Using each word as dimension, namely characterize in note the word existing and how much affect each digital correctness.Using described distance as weight, each word, as dimension, generates multiple proper vector by feature vector generation module 13, to describe this short message content.Be specially, respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, thus generates multiple different proper vector.
Sort module 14: for each proper vector being inputted respectively disaggregated model to obtain Output rusults;
Described disaggregated model is the model of training in advance, namely by pre-prepd sample set train classification models, to classify to note subsequently through this disaggregated model, thus determines whether note is harassing and wrecking notes.Preferably, adaboost Algorithm for Training disaggregated model is adopted by training module 16.The step of described training module 16 train classification models is specific as follows:
Step 11, resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;
Sample set is pre-prepd note set, comprises n bar note, resolves every bar short message content wherein.Described resolving is specially described in above-mentioned steps S11, does not repeat them here.Each word in every bar note and numerical information is obtained by this analyzing step, and the part of speech of their correspondences.
Step 12, with described numerical information for each word of benchmark divides the distance being clipped to each numeral;
The one or more numerical informations extracted with above-mentioned steps are for benchmark, and each word calculated in the note extracted by above-mentioned steps divides the distance being clipped to each numerical information, and described distance specifically characterizes using each word as unit distance.That is, add up each word and divide the number of words being clipped to each numeral, using this number of words as each word to the distance of each numeral.Every bar note in sample set is calculated its respective distance according to the method.
Step 13, using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;
Divide the distance being clipped to each numeral in note as weight each word in the note calculated, namely characterize the influence degree for correct digit of each word to each numeral.Using each word as dimension, namely characterize in note the word existing and how much affect each digital correctness.Using described distance as weight, each word, as dimension, generates multiple proper vector, to describe this short message content.Be specially, respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, thus generates multiple different proper vector.Wherein, all words in described every bar note with the numeral of in multiple numeral for the proper vector that benchmark generates is a proper vector.Every bar short message content is all described according to the method, generates multiple proper vector, i.e. multiple training sample.
Step 14, be positive sample and negative sample by training sample handmarking;
The training sample of above-mentioned generation being carried out handmarking, as in order to judge the information such as spending amount, remaining sum that bank etc. sends, then the proper vector being benchmark generation based on the numeral characterizing the amount of money, remaining sum etc. in note being labeled as positive sample; Other numerical informations in note, as beaten the numerals such as several foldings, are labeled as negative sample with the proper vector that those numerals generate for benchmark.Further, be decided to be by the object information of positive sample correctly, Output rusults value is 1, and the object information of negative sample is decided to be mistake, and Output rusults value is 0.
Step 15, adopt described positive sample and negative sample train classification models.
Using those positive and negative samples as input, the correct or wrong object information of its correspondence, as output, adopts adaboost algorithm to train, thus obtains disaggregated model.
Sort module 14 is using each proper vector corresponding for note as input, and the disaggregated model based on described training obtains Output rusults.Wherein, described Output rusults is 1 and characterizes correct, and Output rusults is 0 and characterizes mistake.
Discrimination module 15: for differentiating based on described Output rusults whether this note is harassing and wrecking notes.
Based on the Output rusults of disaggregated model, discrimination module 15 differentiates whether note is harassing and wrecking notes.Be specially: when multiple proper vectors corresponding to note are respectively as the input of disaggregated model, have at least one to be correct in the Output rusults of acquisition, then this note is normal note.Otherwise, judge that this note is as harassing and wrecking note.
After judging that note is as harassing and wrecking note, then by black list module 17 this note tackled and be stored in blacklist.In other embodiments, judge that note is as after harassing and wrecking note, is deleted this note by removing module 18 from the message list of user, with the inconvenience avoiding harassing and wrecking note to bring for user, improve user and use note to carry out the Experience Degree communicated.
In sum, based on harassing and wrecking note method of discrimination of the present invention or device, can judge more accurately to harass note, the probability that harassing and wrecking note is blocked can be improved, reduce the probability that the note useful to user is blocked simultaneously.
The above is only some embodiments of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. harass a note method of discrimination, it is characterized in that, comprise the following steps:
Resolve original short message content to obtain wherein each word and numerical information;
With described numerical information for benchmark calculates the distance of each word to each numeral respectively;
Using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;
Each proper vector is inputted respectively disaggregated model to obtain Output rusults;
Differentiate whether this note is harassing and wrecking notes based on described Output rusults.
2. method according to claim 1, is characterized in that, described disaggregated model is the model of training in advance, and its training step is as follows:
Resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;
With described numerical information for each word of benchmark divides the distance being clipped to each numeral;
Using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;
Be positive sample and negative sample by training sample handmarking;
Adopt described positive sample and negative sample train classification models.
3. method according to claim 1, it is characterized in that, describedly to the concrete steps that short message content is described to generate corresponding proper vector be: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.
4. the method according to any one of claims 1 to 3, is characterized in that, each word described characterizes as unit distance to the distance of each numeral described using each word.
5. method according to claim 1 and 2, is characterized in that, described analyzing step is specific as follows:
Delete the customizing messages in original note;
Based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;
Extract numerical information wherein.
6. harass a note discriminating gear, it is characterized in that, comprising:
Parsing module: for resolving original short message content to obtain wherein each word and numerical information;
Distance calculation module: for described numerical information for benchmark calculates the distance of each word to each numeral respectively;
Feature vector generation module: for using described distance as weight, each word, as dimension, is described to generate corresponding multiple proper vector to short message content;
Sort module: for each proper vector being inputted respectively disaggregated model to obtain Output rusults;
Discrimination module: for differentiating based on described Output rusults whether this note is harassing and wrecking notes.
7. device according to claim 6, is characterized in that, described disaggregated model is the model of training in advance, and generate based on training module training, the step that described training module performs is as follows:
Resolve to obtain wherein each word and numerical information to the original short message content of every bar in sample set;
With described numerical information for each word of benchmark divides the distance being clipped to each numeral;
Using described respective distance as weight, each word, as dimension, is described to generate corresponding training sample to every bar short message content;
Be positive sample and negative sample by training sample handmarking;
Adopt described positive sample and negative sample train classification models.
8. device according to claim 6, it is characterized in that, the concrete steps that described feature vector generation module performs are: respectively using each numeral described as benchmark, using the word of its front and back as dimension, the word calculating its front and back divides the distance being clipped to each numeral, using respective distance as weight, to generate the multiple proper vectors for describing this note.
9. the device according to any one of claim 6 to 8, is characterized in that, each word described characterizes as unit distance to the distance of each numeral described using each word.
10. device according to claim 6, is characterized in that, the step that described parsing module performs is specific as follows:
Delete the customizing messages in original note;
Based on grammer, participle is carried out to short message content, to obtain word, numeral and the corresponding part of speech in note;
Extract numerical information wherein.
CN201510784065.4A 2015-11-16 2015-11-16 Harass short message method of discrimination and device Active CN105404670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510784065.4A CN105404670B (en) 2015-11-16 2015-11-16 Harass short message method of discrimination and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510784065.4A CN105404670B (en) 2015-11-16 2015-11-16 Harass short message method of discrimination and device

Publications (2)

Publication Number Publication Date
CN105404670A true CN105404670A (en) 2016-03-16
CN105404670B CN105404670B (en) 2018-09-25

Family

ID=55470159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510784065.4A Active CN105404670B (en) 2015-11-16 2015-11-16 Harass short message method of discrimination and device

Country Status (1)

Country Link
CN (1) CN105404670B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105898722A (en) * 2016-03-31 2016-08-24 联想(北京)有限公司 Discriminating method and device for abnormal short messages, and electronic device
WO2022121164A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101184259A (en) * 2007-11-01 2008-05-21 浙江大学 Keyword automatically learning and updating method in rubbish short message
CN101196881A (en) * 2006-12-08 2008-06-11 富士通株式会社 Words symbolization processing method and system for number and special symbol string in text
US20120109945A1 (en) * 2010-10-29 2012-05-03 Emilia Maria Lapko Method and system of improving navigation within a set of electronic documents
CN102572745A (en) * 2010-12-24 2012-07-11 中国移动通信集团上海有限公司 Method and device for determining waste short message
CN102572744A (en) * 2010-12-13 2012-07-11 中国移动通信集团设计院有限公司 Recognition feature library acquisition method and device as well as short message identification method and device
CN103957516A (en) * 2014-05-13 2014-07-30 北京网秦天下科技有限公司 Junk short message filtering method and engine
CN104539624A (en) * 2015-01-08 2015-04-22 北京奇虎科技有限公司 Safety monitoring method and device for number information in text

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196881A (en) * 2006-12-08 2008-06-11 富士通株式会社 Words symbolization processing method and system for number and special symbol string in text
CN101184259A (en) * 2007-11-01 2008-05-21 浙江大学 Keyword automatically learning and updating method in rubbish short message
US20120109945A1 (en) * 2010-10-29 2012-05-03 Emilia Maria Lapko Method and system of improving navigation within a set of electronic documents
CN102572744A (en) * 2010-12-13 2012-07-11 中国移动通信集团设计院有限公司 Recognition feature library acquisition method and device as well as short message identification method and device
CN102572745A (en) * 2010-12-24 2012-07-11 中国移动通信集团上海有限公司 Method and device for determining waste short message
CN103957516A (en) * 2014-05-13 2014-07-30 北京网秦天下科技有限公司 Junk short message filtering method and engine
CN104539624A (en) * 2015-01-08 2015-04-22 北京奇虎科技有限公司 Safety monitoring method and device for number information in text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王红: ""基于内容的中文垃圾短信分类技术的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
肖子玉等: ""信息安全与垃圾短信监控"", 《电信工程技术与标准化》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105898722A (en) * 2016-03-31 2016-08-24 联想(北京)有限公司 Discriminating method and device for abnormal short messages, and electronic device
US10231129B2 (en) 2016-03-31 2019-03-12 Lenovo (Beijing) Limited Malicious text message identification
CN105898722B (en) * 2016-03-31 2019-07-26 联想(北京)有限公司 A kind of discrimination method, device and the electronic equipment of improper short message
WO2022121164A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium

Also Published As

Publication number Publication date
CN105404670B (en) 2018-09-25

Similar Documents

Publication Publication Date Title
CN105261366B (en) Audio recognition method, speech engine and terminal
CN110309304A (en) A kind of file classification method, device, equipment and storage medium
US10725737B2 (en) Address information-based account mapping method and apparatus
CN103955505B (en) A kind of event method of real-time and system based on microblogging
CN106202028B (en) A kind of address information recognition methods and device
CN101605158A (en) Mobile phone dedicated for deaf-mutes
CN104462509A (en) Review spam detection method and device
CN102722525A (en) Methods and systems for establishing language model of address book names and searching voice
CN103019407B (en) Input method application method, automatic question answering processing method, electronic equipment and server
CN103324745A (en) Text garbage identifying method and system based on Bayesian model
CN104539624A (en) Safety monitoring method and device for number information in text
CN111177367B (en) Case classification method, classification model training method and related products
CN106874258A (en) A kind of text similarity computational methods and system based on Hanzi attribute vector representation
CN106454780A (en) Junk short message filtering system and method
CN103389979A (en) System, device and method for recommending classification lexicon in input method
CN104284306A (en) Junk message filter method and system, mobile terminal and cloud server
CN105939359A (en) Method and device for detecting privacy leakage of mobile terminal
CN105893484A (en) Microblog Spammer recognition method based on text characteristics and behavior characteristics
CN102315953A (en) Method and device for detecting junk posts based on occurrence rule of posts
CN112766255A (en) Optical character recognition method, device, equipment and storage medium
CN115658955B (en) Cross-media retrieval and model training method, device, equipment and menu retrieval system
CN107135314A (en) Harass detection method, system, mobile terminal and the server of short message
CN113779429A (en) Traffic congestion situation prediction method, device, equipment and storage medium
CN105404670A (en) Harassing text message determining method and apparatus
CN113590810B (en) Abstract generation model training method, abstract generation device and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right