CN102572744A - Recognition feature library acquisition method and device as well as short message identification method and device - Google Patents

Recognition feature library acquisition method and device as well as short message identification method and device Download PDF

Info

Publication number
CN102572744A
CN102572744A CN2010106022631A CN201010602263A CN102572744A CN 102572744 A CN102572744 A CN 102572744A CN 2010106022631 A CN2010106022631 A CN 2010106022631A CN 201010602263 A CN201010602263 A CN 201010602263A CN 102572744 A CN102572744 A CN 102572744A
Authority
CN
China
Prior art keywords
short message
sample set
short
character string
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010106022631A
Other languages
Chinese (zh)
Other versions
CN102572744B (en
Inventor
万狄飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Design Institute Co Ltd
Original Assignee
China Mobile Group Design Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Design Institute Co Ltd filed Critical China Mobile Group Design Institute Co Ltd
Priority to CN201010602263.1A priority Critical patent/CN102572744B/en
Publication of CN102572744A publication Critical patent/CN102572744A/en
Application granted granted Critical
Publication of CN102572744B publication Critical patent/CN102572744B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a recognition feature library acquisition method and a device as well as a short message identification method and a device. The recognition feature library acquisition method comprises the following steps of: forming a sample seat by utilizing a plurality of short messages with predefined message types from a user; carrying out character string extraction on each short message in the sample set to obtain a first character string set, wherein character strings in the first character string set are different; counting the quantity of the short messages including the corresponding character string in the sample set according to each character string in the first character string set; calculating mutual information of each character string corresponding to each type of the short message according to a counting result; and according to the sequence of the mutual information from large to small, selecting part of or whole character strings in the first character string set to form a recognition feature library. The efficiency of recognition of the short messages is improved.

Description

Recognition feature storehouse acquisition methods, device and short message identification method, device
Technical field
The present invention relates to the short message identification technology of communication network, particularly a kind of recognition feature storehouse acquisition methods, device and short message identification method, device.
Background technology
Ministry of Industry and Information of country administers refuse messages always and shows great attention to, and requires each operator to carry out self-check in China targetedly, and the behavior of all kinds of illegal and infringement user ' s rights is firmly prevented in the behavior of standardizing the management conscientiously.Concerning operator and administrative department, administer refuse messages except that strict control, technological means also is necessary.
For the definition of refuse messages, the benevolent see benevolence and the wise see wisdom; Except anti-party anti-state; Influence is national, the people are stable and united, and it must be outside the refuse messages that erotica has the short message content of hindering social weathering, the note of identical content; Different people its content to be taken a decision as to whether refuse messages in the eyes be indefinite, especially for the advertisement SMS of trade promotion.
In the prior art, at the intercepting rubbish short message that carrier side carries out, can only be to anti-party anti-state on the content, influence is national, the people are stable and united, and erotica has the note of hindering social weathering, and the main refuse messages that surpasses thresholding on the number flow that send is tackled.Can't remove to carry out distinctive, personalized intercepting rubbish short message from mobile phone personal user's angle; And if use unified standard to handle, as far as the certain user, perhaps can produce mistake deletion; Perhaps can produce the effect that does not have deletion, cause treatment effeciency low, illustrate as follows.
Suppose that for user A it extremely dislikes certain artist X, and for user B; It extremely likes X, if having the news about X to be pushed to the user with short message way this moment, in this case; If unified discrimination standard is set, perhaps this news is differentiated for junk short message can not send, and can delete the message that B wants so by mistake; Perhaps this news is sent to A and B, but at A Here it is junk short message, all there is the shortcoming of inefficiency in dual mode.
Summary of the invention
The purpose of this invention is to provide a kind of recognition feature storehouse acquisition methods, device and short message identification method, device, improve the efficient of short message identification.
To achieve these goals, the embodiment of the invention provides a kind of recognition feature storehouse acquisition methods, comprising:
Utilize a plurality of from user and the predetermined short message formation of type of message sample set;
Each short message in the sample set is carried out character string extract, obtain first string assemble; Each character string in said first string assemble is all different;
To each character string in said first string assemble, add up the number of short that comprises this character string in the short message of said sample set;
According to the mutual information of statistics calculating character string corresponding to the short message classification;
According to mutual information order from big to small, from said first string assemble, select part or all of character string to form the recognition feature storehouse.
To achieve these goals, the embodiment of the invention provides a kind of recognition feature storehouse deriving means, comprising:
The sample set generation module is used to utilize a plurality of from user and the predetermined short message formation of type of message sample set;
The first character string abstraction module is used for each short message of sample set is carried out the character string extraction, obtains first string assemble; Each character string in said first string assemble is all different;
Statistical module is used for each character string to said first string assemble, adds up the number of short that comprises this character string in the short message of said sample set;
The mutual information computing module is used for according to the mutual information of statistics calculating character string corresponding to the short message classification;
Character string is selected module, is used for according to mutual information order from big to small, from said first string assemble, selects part or all of character string to form the recognition feature storehouse.
Above-mentioned said character string is following corresponding to the mutual information MI of short message classification:
MI ( t m , c i ) = Σ i = 1 n P ( t m , c i ) log P ( t m , c i ) P ( t m ) P ( c i )
Wherein:
t mRepresent m character string in first string assemble, m=1 ..., L, L are the quantity of the character string that writes down in said first string assemble;
c iRepresent i classification in the predefined short message classification;
MI (t m, c i) expression t mWith classification c iBetween mutual information;
P (t m) expression said sample set short message in comprise this character string t mQuantity and the ratio of the number of short in the sample set of short message;
P (c i) expression said sample set short message in classification be c iQuantity and the ratio of the number of short in the sample set of short message;
P (t m, c i) represent to comprise this character string t in the said sample set m, and classification is c iThe ratio of quantity of quantity and the short message that sample set comprises of short message.
To achieve these goals, the embodiment of the invention provides the short message identification method in a kind of above-mentioned recognition feature storehouse, comprising:
Obtain a short message to be discriminated, and short message to be discriminated is carried out character string extract, obtain second string assemble;
The character string of from said recognition feature storehouse, selecting to be included in said second string assemble is formed the three-character doctrine set of strings;
According to the type of message distribution situation of first short message in the sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;
According to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.
Above-mentioned short message identification method, wherein, said standard straight-line is: x-y+Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse.
Above-mentioned short message identification method, wherein, said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0, said second standard straight-line is: α * x-y+ β * Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse;
Said α is a twiddle factor, and said β is a shift factor;
Said according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is that junk short message specifically comprises:
Judge (whether x y) is positioned at unreliable zone to said coordinate points; The zone of said unreliable zone for forming to the coordinate points of distance in the predetermined interval scope of said first standard straight-line;
At said coordinate points (x; When y) being positioned at unreliable zone; According to said coordinate (x; Y) and the position between said second standard straight-line judge whether said short message to be discriminated is junk short message, otherwise said coordinate (x, y) and the position between said first standard straight-line judge whether said short message to be discriminated is junk short message.
Above-mentioned short message identification method, wherein,
F=(μ+1)·PR/(μP+R);
P=A/B;
R=A/C;
A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating, and C is the number of short that is defined as junk short message in the sample set in advance;
μ is the importance adjustment factor, and said μ is more than or equal to 0;
The value of said α and β is the value that makes that said F is maximum.
Above-mentioned short message identification method wherein, also comprises after obtaining said short message to be discriminated:
Whether the calling number of judging said short message to be discriminated is present in contacts list or the blacklist list;
When whether the calling number of said short message to be discriminated is present in contacts list, directly preserves said short message to be discriminated and after inbox, finish;
When the calling number of said short message to be discriminated is present in blacklist list, directly preserves said short message to be discriminated and behind dustbin, finish;
The calling number of said short message to be discriminated neither is present in contacts list, when also not being present in blacklist list, gets into said step of short message to be discriminated being carried out the character string extraction.
To achieve these goals, the embodiment of the invention provides the short message identification device in a kind of above-mentioned recognition feature storehouse, comprising:
The second character string abstraction module is used to obtain a short message to be discriminated, and short message to be discriminated is carried out character string extract, and obtains second string assemble;
The set generation module is used for the character string that selection is included in said second string assemble from said recognition feature storehouse and forms the three-character doctrine set of strings;
The coordinate determination module, be used for according to the type of message distribution situation of first short message of sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;
Recognition processing module, be used for according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.
Above-mentioned short message identification device, wherein, said standard straight-line is: x-y+Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse.
Above-mentioned short message identification device, wherein, said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0, said second standard straight-line is: α * x-y+ β * Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse;
Said α is a twiddle factor, and said β is a shift factor; Said recognition processing module specifically comprises:
Judging unit judges (whether x y) is positioned at unreliable zone to said coordinate points; The zone of said unreliable zone for forming to the coordinate points of distance in the predetermined interval scope of said first standard straight-line;
The Classification and Identification unit; Be used at said coordinate points (x; When y) being positioned at unreliable zone, according to said coordinate (x, y) and the position between said second standard straight-line judge whether said short message to be discriminated is junk short message; Otherwise said coordinate (x, y) and the position between said first standard straight-line judge whether said short message to be discriminated is junk short message.
Above-mentioned short message identification device, wherein:
F=(μ+1)·PR/(μP+R);
P=A/B;
R=A/C;
A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating, and C is the number of short that is defined as junk short message in the sample set in advance;
μ is the importance adjustment factor, and said μ is more than or equal to 0;
The value of said α and β is the value that makes that said F is maximum.
The embodiment of the invention has following beneficial effect:
In the embodiment of the invention; Differentiate according to the message in the sample set for short message to be discriminated coordinate in a coordinate system; And since the short message in the sample set from the user, and type of message (whether being junk short message promptly) confirmed by the user in advance, so the embodiment of the invention can satisfy different personal users; Personalized intercepting rubbish short message can be provided, therefore can improve the efficient of short message identification.
Description of drawings
Fig. 1 is the schematic flow sheet of the recognition feature storehouse acquisition methods of the embodiment of the invention;
Fig. 2 is the schematic flow sheet of the short message identification method of the embodiment of the invention;
Fig. 3 is the sketch map in unreliable zone.
Embodiment
In recognition feature storehouse acquisition methods, device and the short message identification method of the embodiment of the invention, the device; Utilize the definite note of type of reporting of user to form the analyzing samples collection; And obtain the corresponding refuse messages feature database of user based on this analyzing samples collection; Utilize model-naive Bayesian that short message to be identified is differentiated then; Because the refuse messages feature database is based on the definite note analysis of type of reporting of user and obtains, and therefore can distinctive personalized intercepting rubbish short message be provided for the personal user.
As shown in Figure 1, the recognition feature storehouse acquisition methods of the embodiment of the invention comprises:
Step 11 is utilized a plurality of from user and the predetermined short message formation of type of message sample set;
Step 12 is carried out character string to each short message in the sample set and is extracted, and obtains first string assemble; Each character string in said first string assemble is all different;
Step 13 to each character string in said first string assemble, is added up the number of short that comprises this character string in the short message of said sample set;
Step 14 is according to the mutual information of statistics calculating character string corresponding to the short message classification;
Step 15 according to mutual information order from big to small, selects part or all of character string to form the recognition feature storehouse from said first string assemble.
In specific embodiment of the present invention, need be according to the mutual information MI of statistics calculating character string corresponding to the short message classification, its concrete computing formula is following:
MI ( t m , c i ) = Σ i = 1 n P ( t m , c i ) log P ( t m , c i ) P ( t m ) P ( c i )
Wherein:
t mRepresent m character string in first string assemble, m=1 ..., L, L are the quantity of the character string that writes down in said first string assemble;
c iRepresent i classification in the predefined short message classification; Like two types of junk short message and normal short messages;
MI (t m, c i) expression t mWith classification c iBetween mutual information;
P (t m) expression said sample set short message in comprise this character string t mQuantity and the ratio of the number of short in the sample set of short message; As suppose that 5 short messages are arranged in the sample set, and character string " XX " occurs in 3 short messages, then P (t m) be 3/5;
P (c i) expression said sample set short message in classification be c iQuantity and the ratio of the number of short in the sample set of short message; As suppose that 5 short messages are arranged in the sample set, and be defined as spam type c in advance 1Number of short be 3, P (c then 1) be 3/5;
P (t m, c i) represent to comprise this character string t in the said sample set m, and classification is c iThe ratio of quantity of quantity and the short message that sample set comprises of short message.As suppose that 5 short messages are arranged in the sample set, and comprise this character string t kShort message be 3, belong to spam type c again in these 3 short messages 1Short message be 1, P (t then m, c 1) be 1/5.
The recognition feature storehouse deriving means of the embodiment of the invention comprises:
The sample set generation module is used to utilize a plurality of from user and the predetermined short message formation of type of message sample set;
The character string abstraction module is used for each short message of sample set is carried out the character string extraction, obtains first string assemble; Each character string in said first string assemble is all different;
Statistical module is used for each character string to said first string assemble, adds up the number of short that comprises this character string in the short message of said sample set;
The mutual information computing module is used for according to the mutual information of statistics calculating character string corresponding to the short message classification;
Character string is selected module, is used for according to mutual information order from big to small, from said first string assemble, selects part or all of character string to form the recognition feature storehouse.
In specific embodiment of the present invention, consider that classification capacity increases along with the increase of the quantity of the character string in the recognition feature storehouse, but be not the relation of linear increment between the quantity of classification capacity and character string; More after a little while, along with the increase of the quantity of character string, classification capacity can obviously strengthen in the total number of character string; But when the total number of character string surpassed certain thresholding, along with the increase of the quantity of character string, classification capacity can't significantly strengthen; But the increase of the quantity of character string can bring the amount of calculation of classification processing to strengthen; Therefore, in the specific embodiment of the invention, the character string in the recognition feature storehouse (characteristic) can be limited in certain scale.
Increasing the classification capacity (like classification accuracy) that is brought like the quantity in certain character string increases when being lower than a preset thresholding, then no longer increases the character string quantity in the recognition feature storehouse.
Certainly,, perhaps do not consider under the situation of computational processing, can not control the character string quantity in the recognition feature storehouse yet if require the recognition capability maximization.
The recognition feature storehouse deriving means of the embodiment of the invention can be separately exists with the mode of server, also can run on mobile phone end.
After obtaining above-mentioned recognition feature storehouse, short message identification is carried out in the above-mentioned recognition feature storehouse that obtains promptly capable of using, and as shown in Figure 2, the short message identification method of the embodiment of the invention comprises:
Step 21 is obtained a short message to be discriminated, and short message to be discriminated is carried out character string extract, and obtains second string assemble;
Step 22, the character string of from said feature database, selecting to be included in said second string assemble is formed the three-character doctrine set of strings; Character string in the said feature database extracts the character string that obtains and selects to obtain according to the mutual information between character string and the type of message through character string for the short message from sample set; Said sample set comprises a plurality of from user and the predetermined short message of type of message;
Step 23, according to the type of message distribution situation of first short message in the said sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;
Step 24, according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.
In the specific embodiment of the invention; Differentiate according to the message in the sample set for short message to be discriminated coordinate in a coordinate system; And since sample set in short message from the user; And type of message (whether being junk short message promptly) is confirmed by the user in advance, so the short message identification method of the embodiment of the invention can satisfy different personal users, personalized intercepting rubbish short message can be provided.
Step 12 and 21 all need be carried out character string to short message to be discriminated and extracted, and in specific embodiment of the present invention, adopts N metacharacter string to extract, and the N span is 2~4, and extracting with 2 metacharacter strings is that example is explained as follows.
The word content of supposing short message to be discriminated is following: purchase by group the South Mountain countdown! Ten li Lanshan County of blue light, this weekend 95 foldings purchase by group South Mountain forest garden house final opportunity, and other has special house type specially to enjoy pleasantly surprised discount, detailed inquiry 62586969, it is following then to adopt N metacharacter string to extract the result who obtains:
Purchase by group, purchase south, South Mountain, mountain and fall, fall meter ....
In specific embodiment of the present invention, short message to be discriminated is carried out after character string extracts, be example to comprise M character string in the recognition feature storehouse, can obtain following text vector:
d=(W 1,W 2,...,W M)
Wherein, W i=0 or 1, if i characteristic in the recognition feature storehouse appears in the short message to be identified W i=1, otherwise W i=0.
The judgement parameter f (d) that short message to be identified is set is as follows:
f ( d ) = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 ) + Σ k = 1 M W k log p k 1 1 - p k 1 - Σ k = 1 M W k log p k 2 1 - p k 2
Wherein:
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p Ki(k=1 ..., M) expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of the short message of i type;
In specific embodiment of the present invention, this i=1,2, wherein, during i=1, short message is a junk short message, during i=2, short message is normal short message.
In the step 23, according to the type of message distribution situation of first short message in the said sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y), wherein:
x = Σ k = 1 M W k log p k 1 1 - p k 1
y = Σ k = 1 M W k log p k 2 1 - p k 2
The short message to be identified that this x representative estimates according to characteristic belongs to the estimating of short message (junk short message) of the first kind; Y representes to belong to according to the short message to be identified that characteristic estimates the estimating of short message (normal short message) of second type.
After coordinate is confirmed; Because need be according to said coordinate (x; Y) and the position between the standard straight-line in the said coordinate system judge whether said short message to be discriminated is junk short message, therefore need to confirm a standard straight-line, in specific embodiment of the present invention; Standard straight-line can be the straight line of various ways, and explanation as follows respectively.
In mode one, this standard straight-line is following: x-y+Con=0
Wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
Can find that under the situation that sample is confirmed, above-mentioned Con is a constant.
Under the situation that standard straight-line is confirmed, can whether set up judgement according to following formula:
x-y+Con≥0
When following formula is set up, show f (d) more than or equal to 0, short message to be discriminated is a junk short message, otherwise short message to be discriminated is normal short message.
In mode one, this standard straight-line is x-y+Con=0, and this moment is with a part of short message coordinates computed in the sample set; And concern according to the coordinate of short message and the position between the x-y+Con=0 and to judge; Can access the differentiation result of the short message in the sample set, can find that through analyzing the differentiation result of the part short message in the sample set (whether being junk short message) is different with predetermined type of message; Though quantity is few, still exists in a word and differentiate the inaccurate situation of result.
The short message of comparing and correctly classifying, closer by the position of the short message of misclassification in coordinate system to the distance of cutting apart straight line.According to observation, can the two dimensional surface that be made up of X and Y be divided into reliable and unreliable two zones, wherein:
Wherein as shown in Figure 3; Unreliable zone is to arrive the distance
Figure BSA00000396198700122
of x-y+Con=0 at predetermined interval scope [dist2; Dist1] in the zone (being the residing zone of dotted line) formed of coordinate points, other zones then be reliable regional.
In specific embodiment of the present invention, this predetermined interval scope [dist2, dist1] can be obtained through following mode, explains as follows:
Utilize this straight line of x-y+Con=0 as judgment criteria; Each short message in the sample set is projected in the coordinate system; Obtain evaluation result according to the relation of the position between subpoint and the straight line then; The distribution situation of the subpoint of analysis and judgment mistake (inconsistent with predetermined type of message) decides [dist2, dist1] then, as:
Dist2 is for passing judgment on the short message that mistake and subpoint are arranged in first side of x-y+Con=0; Ultimate range between subpoint and the x-y+Con=0; And dist1 is for passing judgment on the short message that mistake and subpoint are arranged in the opposite side of x-y+Con=0, the ultimate range between subpoint and the x-y+Con=0.
Perhaps
According to the short message identification accuracy rate [dist2 is set; Dist1], as shown in Figure 3, [dist2 is set; Dist1], guarantee that the probability that subpoint is positioned at the short message that the area relative short message outside the dotted line is correctly validated gets final product greater than preset thresholding (as 95%).
In order to improve the accuracy of differentiation, when the position of short message to be discriminated in coordinate system is in unreliable zone, then utilize the another kind of mode of standard straight-line to differentiate, as follows:
α*X-Y+β*Con=0
Above-mentioned α is a twiddle factor, and β is a shift factor;
Above-mentioned standard straight-line is that x-y+Con=0 obtains through rotation and translation, and the purpose of introducing α and β is to improve the accuracy of differentiation, and the acquisition process with regard to β and two parameters of α is elaborated below.
β is used for original straight line x-y+Con=0 of cutting apart is carried out translation, and α is used for straight line x-y+Con=0 is rotated.
In specific embodiment of the present invention, can confirm the optimum segmentation straight line in the unreliable zone that text distributes to the search of parameter beta and α through genetic algorithm.
The span of threshold value beta and α is relevant with the scope in insecure zone in the two-dimensional textual space, and in the specific embodiment of the invention, the span of concrete β is following:
When Con greater than 0 the time, β ∈ ( 1 - 2 * | Dist 2 | Con , 1 + 2 * | Dist 1 | Con ) ;
When Con less than 0 the time, β ∈ ( 1 + 2 * | Dist 2 | Con , 1 - 2 * | Dist 1 | Con ) ;
When Con equals 0, β=0.
In insecure zone, two-dimensional textual space, the scope desirable in theory 0 at text segmentation line and X axle clamp angle is spent to 90 degree, and in specific embodiment of the present invention, the preferable span of α is between 0.36 to 2.75.
Genetic algorithm (GA) is a kind of probabilistic search algorithm of overall importance based on biological evolutions such as natural selection and hereditary variation mechanism.The same with other heuristic search (like hill climbing method, simulated annealing method, Monte Carlo method) with the analytic method based on derivative, genetic algorithm (GA) also is a kind of alternative manner in form.
It progressively improves current separating from selected initial solution through continuous iteration, to the last searches optimal solution or satisfactory solution; In evolutionary computation; The iterative computation process has adopted the evolutionary mechanism of simulation organism, from one group separate (colony's), adopt the mode that is similar to natural selection and generative propagation; Inheriting on the basis of original excellent genes, generating the colony that the next generation with better performance index separates.
When generating progeny population, at first from excellent to bad, sort the chromosome of contemporary population, select a certain proportion of the next individuality to eliminate then; Superseded ratio can be made as 40%; In upper individuality, carry out evenly and intersect, the sub-individuality of generation is filled up in the population, to keep population scale constant; Carry out mutation operation according to the variation probability of setting at last, generate progeny population.
Because GA is at the problem space search good characteristic that optimal value showed, in the specific embodiment of the invention with GA be incorporated into based in the optimum Naive Bayes Classification model to confirm threshold value beta and α.
β and α are value real numbers within limits, can be regarded as the phenotype form of genetic algorithm, are called coding from phenotype to genotypic mapping.We adopt the binary coding form, with the individuality of β and α variate-value representative be expressed as one 0, the 1} binary string, certainly, the long precision of finding the solution that depends on of string.For example: the precision of finding the solution is accurate to 3 decimals, and siding-to-siding block length is 0.5, must the interval be divided into 0.5 * 10 3Equal portions.Because 256=2 8<0.5 * 10 3<2 9=512, so the encoded binary string grows to 9 of few needs.
Three main performances, efficiency evaluation index are arranged in short-message classified: accurate rate P, recall rate R and F-measure, wherein:
P=A/B
Wherein, A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating; P has defined the order of accuarcy of classification results, and how much promptly have in the classification results is correct.
R=A/C
Wherein, C is the number of short that is defined as junk short message in the sample set in advance, and R has described the ability of correct classification, and it is correct that how many classification are promptly arranged in the classification results.
For once test, accuracy rate and recall rate generally are inversely proportional to.Improve accuracy rate, recall rate can descend; Improve recall rate, accuracy rate can descend.F-measure combines P and two indexs of R, can carry out the overall evaluation to grader, as follows:
F = ( μ + 1 ) · PR μP + R
Wherein: μ is more than or equal to 0, is the constant of regulating P and the relative significance level of R, and μ is big more, and the significance level of R is high more, and when μ=0, F=P is accuracy rate;
Because F can develop and be following expression way:
F = ( μ + 1 ) · PR μP + R = μ + 1 μ · PR P + R / u
And when μ → ∞, F=R is recall rate.
In specific embodiment of the present invention, under the situation that μ selectes, become in the embodiment of the invention and calculate as follows:
arg max ( α , β ) ( μ + 1 ) · PR μP + R
That is to say and calculate α and the β that makes that F is maximum.
Under normal conditions,, then get μ=1, at this moment obtain the most frequently used F (being called for short F1) if P and R equality are treated, as follows:
F 1 = 2 × P × R P + R
Because β and α are value real numbers within limits, can be regarded as the phenotype form of genetic algorithm, therefore can utilize genetic algorithm to calculate to make the α and the β of F maximum.
Certainly; Also can α and β carried out five equilibrium; To the combination of each α and β five equilibrium, each short message in the sample set is projected in the coordinate system then, obtain evaluation result according to the relation of the position between subpoint and the straight line then; Calculate P and R according to evaluation result then, utilize P and R to calculate F then:
Select to make α that F is maximum and β as final result at last.
Illustrate as follows.
Suppose that α and β difference value is [0.36,2.75] and [1,3], obtain 10000 kind possible combinations with [0.36,2.75] and [1,3] difference 100 five equilibriums this moment.
Then these 10000 kinds of possible situation are discerned processing to sample set respectively, each makes up corresponding to a F, selects to make F maximum α and β to get final product as final result at last.
Certainly, can also calculate the α and the value of β that makes that F is maximum, in this detailed description one by one through other existing algorithm.
Because short message is differentiated the disposal ability all need consume the terminal each time, when being present in contacts list, show that this short message is that the people that the user is familiar with sends like the calling number of short message to be discriminated; This moment, unnecessary identification and when calling number is present in blacklist list, showed that this short message is that the user does not want the short message that receives; Need not discern yet, therefore, in order to improve treatment effeciency; In the specific embodiment of the invention, after obtaining said short message to be discriminated, also comprise:
Whether the calling number of judging said short message to be discriminated is present in contacts list or the blacklist list;
When whether the calling number of said short message to be discriminated is present in contacts list, directly preserves said short message to be discriminated and after inbox, finish;
When the calling number of said short message to be discriminated is present in blacklist list, directly preserves said short message to be discriminated and behind dustbin, finish;
The calling number of said short message to be discriminated neither is present in contacts list, when also not being present in blacklist list, gets into said step of short message to be discriminated being carried out the character string extraction.
The short message identification device of the embodiment of the invention comprises:
The second character string abstraction module is used to obtain a short message to be discriminated, and short message to be discriminated is carried out character string extract, and obtains second string assemble;
The set generation module is used for the character string that selection is included in said second string assemble from said recognition feature storehouse and forms the three-character doctrine set of strings;
The coordinate determination module, be used for according to the type of message distribution situation of first short message of sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;
Recognition processing module, be used for according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.
Above-mentioned short message identification device, said standard straight-line is: x-y+Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse.
Above-mentioned short message identification device, said standard straight-line can also be to comprise first standard straight-line and second standard straight-line, and said first standard straight-line is: x-y+Con=0, and said second standard straight-line is: α * x-y+ β * Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse;
Said α and β are respectively twiddle factor and shift factor;
Said recognition processing module specifically comprises:
Judging unit judges (whether x y) is positioned at unreliable zone to said coordinate points; The zone of said unreliable zone for forming to the coordinate points of distance in the predetermined interval scope of said first standard straight-line;
The Classification and Identification unit; Be used at said coordinate points (x; When y) being positioned at unreliable zone, according to said coordinate (x, y) and the position between said second standard straight-line judge whether said short message to be discriminated is junk short message; Otherwise said coordinate (x, y) and the position between said first standard straight-line judge whether said short message to be discriminated is junk short message.
Above-mentioned short message identification device, wherein:
F=(μ+1)·PR/(μP+R);
P=A/B;
R=A/C;
A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating, and C is the number of short that is defined as junk short message in the sample set in advance;
μ is the importance adjustment factor, and said μ is more than or equal to 0.
The value of said α and β is the value that makes that said F is maximum.
When thinking that P and R are of equal importance, μ gets 1; At this moment, the value of the value of said α and β for making that 2PR/ (P+R) is maximum.
In the specific embodiment of the invention; When recognition feature storehouse deriving means exists with server mode; Need the user to upload the short message that type of message is confirmed; And simultaneously, the terminal also need be from recognition feature storehouse that the server sync server calculates and the value of α and β, so that carry out short message identification in this locality.
The above only is a preferred implementation of the present invention; Should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; Can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.

Claims (13)

1. a recognition feature storehouse acquisition methods is characterized in that, comprising:
Utilize a plurality of from user and the predetermined short message formation of type of message sample set;
Each short message in the sample set is carried out character string extract, obtain first string assemble; Each character string in said first string assemble is all different;
To each character string in said first string assemble, add up the number of short that comprises this character string in the short message of said sample set;
According to the mutual information of statistics calculating character string corresponding to the short message classification;
According to mutual information order from big to small, from said first string assemble, select part or all of character string to form the recognition feature storehouse.
2. recognition feature according to claim 1 storehouse acquisition methods is characterized in that, said character string is following corresponding to the mutual information MI of short message classification:
MI ( t m , c i ) = Σ i = 1 n P ( t m , c i ) log P ( t m , c i ) P ( t m ) P ( c i )
Wherein:
t mRepresent m character string in first string assemble, m=1 ..., L, L are the quantity of the character string that writes down in said first string assemble;
c iRepresent i classification in the predefined short message classification;
MI (t m, c i) expression t mWith classification c iBetween mutual information;
P (t m) expression said sample set short message in comprise this character string t mQuantity and the ratio of the number of short in the sample set of short message;
P (c i) expression said sample set short message in classification be c iQuantity and the ratio of the number of short in the sample set of short message;
P (t m, c i) represent to comprise this character string t in the said sample set m, and classification is c iThe ratio of quantity of quantity and the short message that sample set comprises of short message.
3. a recognition feature storehouse deriving means is characterized in that, comprising:
The sample set generation module is used to utilize a plurality of from user and the predetermined short message formation of type of message sample set;
The first character string abstraction module is used for each short message of sample set is carried out the character string extraction, obtains first string assemble; Each character string in said first string assemble is all different;
Statistical module is used for each character string to said first string assemble, adds up the number of short that comprises this character string in the short message of said sample set;
The mutual information computing module is used for according to the mutual information of statistics calculating character string corresponding to the short message classification;
Character string is selected module, is used for according to mutual information order from big to small, from said first string assemble, selects part or all of character string to form the recognition feature storehouse.
4. recognition feature according to claim 3 storehouse deriving means is characterized in that, said character string is following corresponding to the mutual information MI of short message classification:
MI ( t m , c i ) = Σ i = 1 n P ( t m , c i ) log P ( t m , c i ) P ( t m ) P ( c i )
Wherein:
t mRepresent m character string in first string assemble, m=1 ..., L, L are the quantity of the character string that writes down in said first string assemble;
c iRepresent i classification in the predefined short message classification;
MI (t m, c i) expression t mWith classification c iBetween mutual information;
P (t m) expression said sample set short message in comprise this character string t mQuantity and the ratio of the number of short in the sample set of short message;
P (c i) expression said sample set short message in classification be c iQuantity and the ratio of the number of short in the sample set of short message;
P (t m, c i) represent to comprise this character string t in the said sample set m, and classification is c iThe ratio of quantity of quantity and the short message that sample set comprises of short message.
5. a short message identification method of utilizing the recognition feature storehouse that claim 1 or 2 said recognition feature storehouse acquisition methods obtain is characterized in that, comprising:
Obtain a short message to be discriminated, and short message to be discriminated is carried out character string extract, obtain second string assemble;
The character string of from said recognition feature storehouse, selecting to be included in said second string assemble is formed the three-character doctrine set of strings;
According to the type of message distribution situation of first short message in the sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;
According to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.
6. short message identification method according to claim 5 is characterized in that, said standard straight-line is: x-y+Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse.
7. short message identification method according to claim 5; It is characterized in that said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0; Said second standard straight-line is: α * x-y+ β * Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse;
Said α is a twiddle factor, and said β is a shift factor;
Said according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is that junk short message specifically comprises:
Judge (whether x y) is positioned at unreliable zone to said coordinate points; The zone of said unreliable zone for forming to the coordinate points of distance in the predetermined interval scope of said first standard straight-line;
At said coordinate points (x; When y) being positioned at unreliable zone; According to said coordinate (x; Y) and the position between said second standard straight-line judge whether said short message to be discriminated is junk short message, otherwise said coordinate (x, y) and the position between said first standard straight-line judge whether said short message to be discriminated is junk short message.
8. short message identification method according to claim 7 is characterized in that:
F=(μ+1)·PR/(μP+R);
P=A/B;
R=A/C;
A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating, and C is the number of short that is defined as junk short message in the sample set in advance;
μ is the importance adjustment factor, and said μ is more than or equal to 0;
The value of said α and β is the value that makes that said F is maximum.
9. according to any described short message identification method among the claim 5-8, it is characterized in that, after obtaining said short message to be discriminated, also comprise:
Whether the calling number of judging said short message to be discriminated is present in contacts list or the blacklist list;
When whether the calling number of said short message to be discriminated is present in contacts list, directly preserves said short message to be discriminated and after inbox, finish;
When the calling number of said short message to be discriminated is present in blacklist list, directly preserves said short message to be discriminated and behind dustbin, finish;
The calling number of said short message to be discriminated neither is present in contacts list, when also not being present in blacklist list, gets into said step of short message to be discriminated being carried out the character string extraction.
10. a short message identification device that utilizes the recognition feature storehouse that claim 1 or 2 said recognition feature storehouse acquisition methods obtain is characterized in that, comprising:
The second character string abstraction module is used to obtain a short message to be discriminated, and short message to be discriminated is carried out character string extract, and obtains second string assemble;
The set generation module is used for the character string that selection is included in said second string assemble from said recognition feature storehouse and forms the three-character doctrine set of strings;
The coordinate determination module, be used for according to the type of message distribution situation of first short message of sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;
Recognition processing module, be used for according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.
11. short message identification device according to claim 10 is characterized in that, said standard straight-line is: x-y+Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse.
12. short message identification device according to claim 10; It is characterized in that said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0; Said second standard straight-line is: α * x-y+ β * Con=0, wherein:
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
Con = log P { c 1 } P { c 2 } + Σ k = 1 M log ( 1 - p k 1 ) log ( 1 - p k 2 )
P{c 1Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;
P{c 2Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;
p K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;
p K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;
K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse;
Said α is a twiddle factor, and said β is a shift factor; Said recognition processing module specifically comprises:
Judging unit judges (whether x y) is positioned at unreliable zone to said coordinate points; The zone of said unreliable zone for forming to the coordinate points of distance in the predetermined interval scope of said first standard straight-line;
The Classification and Identification unit; Be used at said coordinate points (x; When y) being positioned at unreliable zone, according to said coordinate (x, y) and the position between said second standard straight-line judge whether said short message to be discriminated is junk short message; Otherwise said coordinate (x, y) and the position between said first standard straight-line judge whether said short message to be discriminated is junk short message.
13. short message identification device according to claim 12 is characterized in that:
F=(μ+1)·PR/(μP+R);
P=A/B;
R=A/C;
A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating, and C is the number of short that is defined as junk short message in the sample set in advance;
μ is the importance adjustment factor, and said μ is more than or equal to 0;
The value of said α and β is the value that makes that said F is maximum.
CN201010602263.1A 2010-12-13 2010-12-13 Recognition feature library acquisition method and device as well as short message identification method and device Active CN102572744B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010602263.1A CN102572744B (en) 2010-12-13 2010-12-13 Recognition feature library acquisition method and device as well as short message identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010602263.1A CN102572744B (en) 2010-12-13 2010-12-13 Recognition feature library acquisition method and device as well as short message identification method and device

Publications (2)

Publication Number Publication Date
CN102572744A true CN102572744A (en) 2012-07-11
CN102572744B CN102572744B (en) 2014-11-05

Family

ID=46416970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010602263.1A Active CN102572744B (en) 2010-12-13 2010-12-13 Recognition feature library acquisition method and device as well as short message identification method and device

Country Status (1)

Country Link
CN (1) CN102572744B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103501487A (en) * 2013-09-18 2014-01-08 小米科技有限责任公司 Method, device, terminal, server and system for updating classifier
WO2015039478A1 (en) * 2013-09-17 2015-03-26 中兴通讯股份有限公司 Method and apparatus for recognizing junk messages
CN105404670A (en) * 2015-11-16 2016-03-16 北京奇虎科技有限公司 Harassing text message determining method and apparatus
CN105744493A (en) * 2014-12-08 2016-07-06 中国移动通信集团河北有限公司 Information identification method and apparatus
CN105893501A (en) * 2016-03-30 2016-08-24 中国联合网络通信集团有限公司 Information inquiry short-message processing method and system
CN108763209A (en) * 2018-05-22 2018-11-06 阿里巴巴集团控股有限公司 A kind of method, apparatus and equipment of feature extraction and risk identification
CN109906587A (en) * 2017-07-26 2019-06-18 松下电器(美国)知识产权公司 Vehicle-mounted relay, vehicle mounted surveillance device, vehicle-mounted control network system, communication monitoring method and program
CN110730270A (en) * 2019-09-09 2020-01-24 上海凯京信达科技集团有限公司 Short message grouping method and device, computer storage medium and electronic equipment
CN111259207A (en) * 2018-11-30 2020-06-09 阿里巴巴集团控股有限公司 Short message identification method, device and equipment
CN111740969A (en) * 2020-06-12 2020-10-02 北京三快在线科技有限公司 Method, device, equipment and storage medium for verifying electronic certificate information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
CN101600178A (en) * 2009-06-26 2009-12-09 成都市华为赛门铁克科技有限公司 Junk information confirmation method and device, terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
CN101600178A (en) * 2009-06-26 2009-12-09 成都市华为赛门铁克科技有限公司 Junk information confirmation method and device, terminal

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015039478A1 (en) * 2013-09-17 2015-03-26 中兴通讯股份有限公司 Method and apparatus for recognizing junk messages
CN103501487A (en) * 2013-09-18 2014-01-08 小米科技有限责任公司 Method, device, terminal, server and system for updating classifier
CN105744493A (en) * 2014-12-08 2016-07-06 中国移动通信集团河北有限公司 Information identification method and apparatus
CN105744493B (en) * 2014-12-08 2019-09-10 中国移动通信集团河北有限公司 A kind of information identifying method and device
CN105404670A (en) * 2015-11-16 2016-03-16 北京奇虎科技有限公司 Harassing text message determining method and apparatus
CN105404670B (en) * 2015-11-16 2018-09-25 北京奇虎科技有限公司 Harass short message method of discrimination and device
CN105893501A (en) * 2016-03-30 2016-08-24 中国联合网络通信集团有限公司 Information inquiry short-message processing method and system
CN109906587B (en) * 2017-07-26 2022-05-13 松下电器(美国)知识产权公司 In-vehicle relay device, in-vehicle monitoring device, in-vehicle control network system, communication monitoring method, and computer-readable recording medium
CN109906587A (en) * 2017-07-26 2019-06-18 松下电器(美国)知识产权公司 Vehicle-mounted relay, vehicle mounted surveillance device, vehicle-mounted control network system, communication monitoring method and program
CN108763209A (en) * 2018-05-22 2018-11-06 阿里巴巴集团控股有限公司 A kind of method, apparatus and equipment of feature extraction and risk identification
CN111259207A (en) * 2018-11-30 2020-06-09 阿里巴巴集团控股有限公司 Short message identification method, device and equipment
CN110730270A (en) * 2019-09-09 2020-01-24 上海凯京信达科技集团有限公司 Short message grouping method and device, computer storage medium and electronic equipment
CN111740969A (en) * 2020-06-12 2020-10-02 北京三快在线科技有限公司 Method, device, equipment and storage medium for verifying electronic certificate information
CN111740969B (en) * 2020-06-12 2022-09-16 北京三快在线科技有限公司 Method, device, equipment and storage medium for verifying electronic certificate information

Also Published As

Publication number Publication date
CN102572744B (en) 2014-11-05

Similar Documents

Publication Publication Date Title
CN102572744B (en) Recognition feature library acquisition method and device as well as short message identification method and device
CN107515873B (en) Junk information identification method and equipment
CN103024746B (en) System and method for processing spam short messages for telecommunication operator
CN110457404B (en) Social media account classification method based on complex heterogeneous network
CN106779827A (en) A kind of Internet user's behavior collection and the big data method of analysis detection
CN103812872A (en) Network water army behavior detection method and system based on mixed Dirichlet process
CN103309990A (en) User multidimensional analysis and monitoring method based on public information of Internet user
CN105335491A (en) Method and system for recommending books to users on basis of clicking behavior of users
KR101764696B1 (en) Method and System for determination of social network hot topic in consideration of user’s influence and time
CN106682686A (en) User gender prediction method based on mobile phone Internet-surfing behavior
CN110110225B (en) Online education recommendation model based on user behavior data analysis and construction method
CN106339495A (en) Topic detection method and system based on hierarchical incremental clustering
CN106547875A (en) A kind of online incident detection method of the microblogging based on sentiment analysis and label
CN108763496A (en) A kind of sound state data fusion client segmentation algorithm based on grid and density
CN110085322A (en) A kind of improved method of k-means cluster diabetes Early-warning Model
WO2015062359A1 (en) Method and device for advertisement classification, server and storage medium
Kamino et al. Reassessment of the extinction risk of endemic species in the Neotropics: how can modelling tools help us
CN115062732A (en) Resource sharing cooperation recommendation method and system based on big data user tag information
Burnie et al. An analysis of the change in discussions on social media with bitcoin price
CN111428151A (en) False message identification method and device based on network acceleration
CN109783805A (en) A kind of network community user recognition methods and device
Sandberger-Loua et al. Gene-flow in the clouds: landscape genetics of a viviparous, montane grassland toad in the tropics
CN105992178B (en) A kind of refuse messages recognition methods and device
CN116049526B (en) Commodity associated video big data intelligent pushing system and method for meta-space electronic commerce platform
CN115905648A (en) Gaussian mixture model-based user group and financial user group analysis method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant