CN102572744A

CN102572744A - Recognition feature library acquisition method and device as well as short message identification method and device

Info

Publication number: CN102572744A
Application number: CN2010106022631A
Authority: CN
Inventors: 万狄飞
Original assignee: China Mobile Group Design Institute Co Ltd
Current assignee: China Mobile Group Design Institute Co Ltd
Priority date: 2010-12-13
Filing date: 2010-12-13
Publication date: 2012-07-11
Anticipated expiration: 2030-12-13
Also published as: CN102572744B

Abstract

The invention provides a recognition feature library acquisition method and a device as well as a short message identification method and a device. The recognition feature library acquisition method comprises the following steps of: forming a sample seat by utilizing a plurality of short messages with predefined message types from a user; carrying out character string extraction on each short message in the sample set to obtain a first character string set, wherein character strings in the first character string set are different; counting the quantity of the short messages including the corresponding character string in the sample set according to each character string in the first character string set; calculating mutual information of each character string corresponding to each type of the short message according to a counting result; and according to the sequence of the mutual information from large to small, selecting part of or whole character strings in the first character string set to form a recognition feature library. The efficiency of recognition of the short messages is improved.

Description

Recognition feature storehouse acquisition methods, device and short message identification method, device

Technical field

The present invention relates to the short message identification technology of communication network, particularly a kind of recognition feature storehouse acquisition methods, device and short message identification method, device.

Background technology

Ministry of Industry and Information of country administers refuse messages always and shows great attention to, and requires each operator to carry out self-check in China targetedly, and the behavior of all kinds of illegal and infringement user ' s rights is firmly prevented in the behavior of standardizing the management conscientiously.Concerning operator and administrative department, administer refuse messages except that strict control, technological means also is necessary.

For the definition of refuse messages, the benevolent see benevolence and the wise see wisdom; Except anti-party anti-state; Influence is national, the people are stable and united, and it must be outside the refuse messages that erotica has the short message content of hindering social weathering, the note of identical content; Different people its content to be taken a decision as to whether refuse messages in the eyes be indefinite, especially for the advertisement SMS of trade promotion.

In the prior art, at the intercepting rubbish short message that carrier side carries out, can only be to anti-party anti-state on the content, influence is national, the people are stable and united, and erotica has the note of hindering social weathering, and the main refuse messages that surpasses thresholding on the number flow that send is tackled.Can't remove to carry out distinctive, personalized intercepting rubbish short message from mobile phone personal user's angle; And if use unified standard to handle, as far as the certain user, perhaps can produce mistake deletion; Perhaps can produce the effect that does not have deletion, cause treatment effeciency low, illustrate as follows.

Suppose that for user A it extremely dislikes certain artist X, and for user B; It extremely likes X, if having the news about X to be pushed to the user with short message way this moment, in this case; If unified discrimination standard is set, perhaps this news is differentiated for junk short message can not send, and can delete the message that B wants so by mistake; Perhaps this news is sent to A and B, but at A Here it is junk short message, all there is the shortcoming of inefficiency in dual mode.

Summary of the invention

The purpose of this invention is to provide a kind of recognition feature storehouse acquisition methods, device and short message identification method, device, improve the efficient of short message identification.

To achieve these goals, the embodiment of the invention provides a kind of recognition feature storehouse acquisition methods, comprising:

Utilize a plurality of from user and the predetermined short message formation of type of message sample set;

Each short message in the sample set is carried out character string extract, obtain first string assemble; Each character string in said first string assemble is all different;

To each character string in said first string assemble, add up the number of short that comprises this character string in the short message of said sample set;

According to the mutual information of statistics calculating character string corresponding to the short message classification;

According to mutual information order from big to small, from said first string assemble, select part or all of character string to form the recognition feature storehouse.

To achieve these goals, the embodiment of the invention provides a kind of recognition feature storehouse deriving means, comprising:

The sample set generation module is used to utilize a plurality of from user and the predetermined short message formation of type of message sample set;

The first character string abstraction module is used for each short message of sample set is carried out the character string extraction, obtains first string assemble; Each character string in said first string assemble is all different;

Statistical module is used for each character string to said first string assemble, adds up the number of short that comprises this character string in the short message of said sample set;

The mutual information computing module is used for according to the mutual information of statistics calculating character string corresponding to the short message classification;

Character string is selected module, is used for according to mutual information order from big to small, from said first string assemble, selects part or all of character string to form the recognition feature storehouse.

Above-mentioned said character string is following corresponding to the mutual information MI of short message classification:

MI (t_{m}, c_{i}) = Σ_{i = 1}^{n} P (t_{m}, c_{i}) \log \frac{P (t_{m}, c_{i})}{P (t_{m}) P (c_{i})}

Wherein:

t _mRepresent m character string in first string assemble, m=1 ..., L, L are the quantity of the character string that writes down in said first string assemble;

c _iRepresent i classification in the predefined short message classification;

MI (t _m, c _i) expression t _mWith classification c _iBetween mutual information;

P (t _m) expression said sample set short message in comprise this character string t _mQuantity and the ratio of the number of short in the sample set of short message;

P (c _i) expression said sample set short message in classification be c _iQuantity and the ratio of the number of short in the sample set of short message;

P (t _m, c _i) represent to comprise this character string t in the said sample set _m, and classification is c _iThe ratio of quantity of quantity and the short message that sample set comprises of short message.

To achieve these goals, the embodiment of the invention provides the short message identification method in a kind of above-mentioned recognition feature storehouse, comprising:

Obtain a short message to be discriminated, and short message to be discriminated is carried out character string extract, obtain second string assemble;

The character string of from said recognition feature storehouse, selecting to be included in said second string assemble is formed the three-character doctrine set of strings;

According to the type of message distribution situation of first short message in the sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;

According to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.

Above-mentioned short message identification method, wherein, said standard straight-line is: x-y+Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

P{c ₁Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of junk short message in advance in the sample set;

P{c ₂Expression: type of message is confirmed as the ratio of number of short in number of short and the sample set of normal short message in advance in the sample set;

p _K1Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of junk short message;

p _K2Expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of normal short message;

K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse.

Above-mentioned short message identification method, wherein, said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0, said second standard straight-line is: α * x-y+ β * Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

K=1 ..., M, M are the quantity of the character string that writes down in the said recognition feature storehouse;

Said α is a twiddle factor, and said β is a shift factor;

Said according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is that junk short message specifically comprises:

Judge (whether x y) is positioned at unreliable zone to said coordinate points; The zone of said unreliable zone for forming to the coordinate points of distance in the predetermined interval scope of said first standard straight-line;

At said coordinate points (x; When y) being positioned at unreliable zone; According to said coordinate (x; Y) and the position between said second standard straight-line judge whether said short message to be discriminated is junk short message, otherwise said coordinate (x, y) and the position between said first standard straight-line judge whether said short message to be discriminated is junk short message.

Above-mentioned short message identification method, wherein,

F＝(μ+1)·PR/(μP+R)；

P＝A/B；

R＝A/C；

A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating, and C is the number of short that is defined as junk short message in the sample set in advance;

μ is the importance adjustment factor, and said μ is more than or equal to 0;

The value of said α and β is the value that makes that said F is maximum.

Above-mentioned short message identification method wherein, also comprises after obtaining said short message to be discriminated:

Whether the calling number of judging said short message to be discriminated is present in contacts list or the blacklist list;

When whether the calling number of said short message to be discriminated is present in contacts list, directly preserves said short message to be discriminated and after inbox, finish;

When the calling number of said short message to be discriminated is present in blacklist list, directly preserves said short message to be discriminated and behind dustbin, finish;

The calling number of said short message to be discriminated neither is present in contacts list, when also not being present in blacklist list, gets into said step of short message to be discriminated being carried out the character string extraction.

To achieve these goals, the embodiment of the invention provides the short message identification device in a kind of above-mentioned recognition feature storehouse, comprising:

The second character string abstraction module is used to obtain a short message to be discriminated, and short message to be discriminated is carried out character string extract, and obtains second string assemble;

The set generation module is used for the character string that selection is included in said second string assemble from said recognition feature storehouse and forms the three-character doctrine set of strings;

The coordinate determination module, be used for according to the type of message distribution situation of first short message of sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;

Recognition processing module, be used for according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.

Above-mentioned short message identification device, wherein, said standard straight-line is: x-y+Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Above-mentioned short message identification device, wherein, said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0, said second standard straight-line is: α * x-y+ β * Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Said α is a twiddle factor, and said β is a shift factor; Said recognition processing module specifically comprises:

Judging unit judges (whether x y) is positioned at unreliable zone to said coordinate points; The zone of said unreliable zone for forming to the coordinate points of distance in the predetermined interval scope of said first standard straight-line;

The Classification and Identification unit; Be used at said coordinate points (x; When y) being positioned at unreliable zone, according to said coordinate (x, y) and the position between said second standard straight-line judge whether said short message to be discriminated is junk short message; Otherwise said coordinate (x, y) and the position between said first standard straight-line judge whether said short message to be discriminated is junk short message.

Above-mentioned short message identification device, wherein:

F＝(μ+1)·PR/(μP+R)；

P＝A/B；

R＝A/C；

μ is the importance adjustment factor, and said μ is more than or equal to 0;

The value of said α and β is the value that makes that said F is maximum.

The embodiment of the invention has following beneficial effect:

In the embodiment of the invention; Differentiate according to the message in the sample set for short message to be discriminated coordinate in a coordinate system; And since the short message in the sample set from the user, and type of message (whether being junk short message promptly) confirmed by the user in advance, so the embodiment of the invention can satisfy different personal users; Personalized intercepting rubbish short message can be provided, therefore can improve the efficient of short message identification.

Description of drawings

Fig. 1 is the schematic flow sheet of the recognition feature storehouse acquisition methods of the embodiment of the invention;

Fig. 2 is the schematic flow sheet of the short message identification method of the embodiment of the invention;

Fig. 3 is the sketch map in unreliable zone.

Embodiment

In recognition feature storehouse acquisition methods, device and the short message identification method of the embodiment of the invention, the device; Utilize the definite note of type of reporting of user to form the analyzing samples collection; And obtain the corresponding refuse messages feature database of user based on this analyzing samples collection; Utilize model-naive Bayesian that short message to be identified is differentiated then; Because the refuse messages feature database is based on the definite note analysis of type of reporting of user and obtains, and therefore can distinctive personalized intercepting rubbish short message be provided for the personal user.

As shown in Figure 1, the recognition feature storehouse acquisition methods of the embodiment of the invention comprises:

Step 11 is utilized a plurality of from user and the predetermined short message formation of type of message sample set;

Step 12 is carried out character string to each short message in the sample set and is extracted, and obtains first string assemble; Each character string in said first string assemble is all different;

Step 13 to each character string in said first string assemble, is added up the number of short that comprises this character string in the short message of said sample set;

Step 14 is according to the mutual information of statistics calculating character string corresponding to the short message classification;

Step 15 according to mutual information order from big to small, selects part or all of character string to form the recognition feature storehouse from said first string assemble.

In specific embodiment of the present invention, need be according to the mutual information MI of statistics calculating character string corresponding to the short message classification, its concrete computing formula is following:

MI (t_{m}, c_{i}) = Σ_{i = 1}^{n} P (t_{m}, c_{i}) \log \frac{P (t_{m}, c_{i})}{P (t_{m}) P (c_{i})}

Wherein:

c _iRepresent i classification in the predefined short message classification; Like two types of junk short message and normal short messages;

P (t _m) expression said sample set short message in comprise this character string t _mQuantity and the ratio of the number of short in the sample set of short message; As suppose that 5 short messages are arranged in the sample set, and character string " XX " occurs in 3 short messages, then P (t _m) be 3/5;

P (c _i) expression said sample set short message in classification be c _iQuantity and the ratio of the number of short in the sample set of short message; As suppose that 5 short messages are arranged in the sample set, and be defined as spam type c in advance ₁Number of short be 3, P (c then ₁) be 3/5;

P (t _m, c _i) represent to comprise this character string t in the said sample set _m, and classification is c _iThe ratio of quantity of quantity and the short message that sample set comprises of short message.As suppose that 5 short messages are arranged in the sample set, and comprise this character string t _kShort message be 3, belong to spam type c again in these 3 short messages ₁Short message be 1, P (t then _m, c ₁) be 1/5.

The recognition feature storehouse deriving means of the embodiment of the invention comprises:

The character string abstraction module is used for each short message of sample set is carried out the character string extraction, obtains first string assemble; Each character string in said first string assemble is all different;

In specific embodiment of the present invention, consider that classification capacity increases along with the increase of the quantity of the character string in the recognition feature storehouse, but be not the relation of linear increment between the quantity of classification capacity and character string; More after a little while, along with the increase of the quantity of character string, classification capacity can obviously strengthen in the total number of character string; But when the total number of character string surpassed certain thresholding, along with the increase of the quantity of character string, classification capacity can't significantly strengthen; But the increase of the quantity of character string can bring the amount of calculation of classification processing to strengthen; Therefore, in the specific embodiment of the invention, the character string in the recognition feature storehouse (characteristic) can be limited in certain scale.

Increasing the classification capacity (like classification accuracy) that is brought like the quantity in certain character string increases when being lower than a preset thresholding, then no longer increases the character string quantity in the recognition feature storehouse.

Certainly,, perhaps do not consider under the situation of computational processing, can not control the character string quantity in the recognition feature storehouse yet if require the recognition capability maximization.

The recognition feature storehouse deriving means of the embodiment of the invention can be separately exists with the mode of server, also can run on mobile phone end.

After obtaining above-mentioned recognition feature storehouse, short message identification is carried out in the above-mentioned recognition feature storehouse that obtains promptly capable of using, and as shown in Figure 2, the short message identification method of the embodiment of the invention comprises:

Step 21 is obtained a short message to be discriminated, and short message to be discriminated is carried out character string extract, and obtains second string assemble;

Step 22, the character string of from said feature database, selecting to be included in said second string assemble is formed the three-character doctrine set of strings; Character string in the said feature database extracts the character string that obtains and selects to obtain according to the mutual information between character string and the type of message through character string for the short message from sample set; Said sample set comprises a plurality of from user and the predetermined short message of type of message;

Step 23, according to the type of message distribution situation of first short message in the said sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y); Said first short message is the short message that comprises the character string in the said three-character doctrine set of strings in the said sample set;

Step 24, according to said coordinate (x, y) and said coordinate system in standard straight-line between the position judge whether said short message to be discriminated is junk short message; Said standard straight-line confirms that according to the type information of the short message in the said sample set and the type of message distribution situation of second short message said second short message is the short message that comprises the character string in the said feature database in the said sample set.

In the specific embodiment of the invention; Differentiate according to the message in the sample set for short message to be discriminated coordinate in a coordinate system; And since sample set in short message from the user; And type of message (whether being junk short message promptly) is confirmed by the user in advance, so the short message identification method of the embodiment of the invention can satisfy different personal users, personalized intercepting rubbish short message can be provided.

Step

12 and 21 all need be carried out character string to short message to be discriminated and extracted, and in specific embodiment of the present invention, adopts N metacharacter string to extract, and the N span is 2～4, and extracting with 2 metacharacter strings is that example is explained as follows.

The word content of supposing short message to be discriminated is following: purchase by group the South Mountain countdown! Ten li Lanshan County of blue light, this weekend 95 foldings purchase by group South Mountain forest garden house final opportunity, and other has special house type specially to enjoy pleasantly surprised discount, detailed inquiry 62586969, it is following then to adopt N metacharacter string to extract the result who obtains:

Purchase by group, purchase south, South Mountain, mountain and fall, fall meter ....

In specific embodiment of the present invention, short message to be discriminated is carried out after character string extracts, be example to comprise M character string in the recognition feature storehouse, can obtain following text vector:

d＝(W ₁，W ₂，...，W _M)

Wherein, W _i=0 or 1, if i characteristic in the recognition feature storehouse appears in the short message to be identified W _i=1, otherwise W _i=0.

The judgement parameter f (d) that short message to be identified is set is as follows:

f (d) = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})} + Σ_{k = 1}^{M} W_{k} \log \frac{p_{k 1}}{1 - p_{k 1}} - Σ_{k = 1}^{M} W_{k} \log \frac{p_{k 2}}{1 - p_{k 2}}

Wherein:

p _Ki(k=1 ..., M) expression: in the sample set, comprise that the short message of k characteristic in the recognition feature storehouse belongs to the probability of the short message of i type;

In specific embodiment of the present invention, this i=1,2, wherein, during i=1, short message is a junk short message, during i=2, short message is normal short message.

In the step 23, according to the type of message distribution situation of first short message in the said sample set confirm the coordinate of said short message to be discriminated in a coordinate system (x, y), wherein:

x = Σ_{k = 1}^{M} W_{k} \log \frac{p_{k 1}}{1 - p_{k 1}}

y = Σ_{k = 1}^{M} W_{k} \log \frac{p_{k 2}}{1 - p_{k 2}}

The short message to be identified that this x representative estimates according to characteristic belongs to the estimating of short message (junk short message) of the first kind; Y representes to belong to according to the short message to be identified that characteristic estimates the estimating of short message (normal short message) of second type.

After coordinate is confirmed; Because need be according to said coordinate (x; Y) and the position between the standard straight-line in the said coordinate system judge whether said short message to be discriminated is junk short message, therefore need to confirm a standard straight-line, in specific embodiment of the present invention; Standard straight-line can be the straight line of various ways, and explanation as follows respectively.

In mode one, this standard straight-line is following: x-y+Con=0

Wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Can find that under the situation that sample is confirmed, above-mentioned Con is a constant.

Under the situation that standard straight-line is confirmed, can whether set up judgement according to following formula:

x-y+Con≥0

When following formula is set up, show f (d) more than or equal to 0, short message to be discriminated is a junk short message, otherwise short message to be discriminated is normal short message.

In mode one, this standard straight-line is x-y+Con=0, and this moment is with a part of short message coordinates computed in the sample set; And concern according to the coordinate of short message and the position between the x-y+Con=0 and to judge; Can access the differentiation result of the short message in the sample set, can find that through analyzing the differentiation result of the part short message in the sample set (whether being junk short message) is different with predetermined type of message; Though quantity is few, still exists in a word and differentiate the inaccurate situation of result.

The short message of comparing and correctly classifying, closer by the position of the short message of misclassification in coordinate system to the distance of cutting apart straight line.According to observation, can the two dimensional surface that be made up of X and Y be divided into reliable and unreliable two zones, wherein:

Wherein as shown in Figure 3; Unreliable zone is to arrive the distance

of x-y+Con=0 at predetermined interval scope [dist2; Dist1] in the zone (being the residing zone of dotted line) formed of coordinate points, other zones then be reliable regional.

In specific embodiment of the present invention, this predetermined interval scope [dist2, dist1] can be obtained through following mode, explains as follows:

Utilize this straight line of x-y+Con=0 as judgment criteria; Each short message in the sample set is projected in the coordinate system; Obtain evaluation result according to the relation of the position between subpoint and the straight line then; The distribution situation of the subpoint of analysis and judgment mistake (inconsistent with predetermined type of message) decides [dist2, dist1] then, as:

Dist2 is for passing judgment on the short message that mistake and subpoint are arranged in first side of x-y+Con=0; Ultimate range between subpoint and the x-y+Con=0; And dist1 is for passing judgment on the short message that mistake and subpoint are arranged in the opposite side of x-y+Con=0, the ultimate range between subpoint and the x-y+Con=0.

Perhaps

According to the short message identification accuracy rate [dist2 is set; Dist1], as shown in Figure 3, [dist2 is set; Dist1], guarantee that the probability that subpoint is positioned at the short message that the area relative short message outside the dotted line is correctly validated gets final product greater than preset thresholding (as 95%).

In order to improve the accuracy of differentiation, when the position of short message to be discriminated in coordinate system is in unreliable zone, then utilize the another kind of mode of standard straight-line to differentiate, as follows:

α*X-Y+β*Con＝0

Above-mentioned α is a twiddle factor, and β is a shift factor;

Above-mentioned standard straight-line is that x-y+Con=0 obtains through rotation and translation, and the purpose of introducing α and β is to improve the accuracy of differentiation, and the acquisition process with regard to β and two parameters of α is elaborated below.

β is used for original straight line x-y+Con=0 of cutting apart is carried out translation, and α is used for straight line x-y+Con=0 is rotated.

In specific embodiment of the present invention, can confirm the optimum segmentation straight line in the unreliable zone that text distributes to the search of parameter beta and α through genetic algorithm.

The span of threshold value beta and α is relevant with the scope in insecure zone in the two-dimensional textual space, and in the specific embodiment of the invention, the span of concrete β is following:

When Con greater than 0 the time,

β &Element; (1 - \sqrt{2} * \frac{| Dist 2 |}{Con}, 1 + \sqrt{2} * \frac{| Dist 1 |}{Con});

When Con less than 0 the time,

β &Element; (1 + \sqrt{2} * \frac{| Dist 2 |}{Con}, 1 - \sqrt{2} * \frac{| Dist 1 |}{Con});

When Con equals 0, β=0.

In insecure zone, two-dimensional textual space, the scope desirable in theory 0 at text segmentation line and X axle clamp angle is spent to 90 degree, and in specific embodiment of the present invention, the preferable span of α is between 0.36 to 2.75.

Genetic algorithm (GA) is a kind of probabilistic search algorithm of overall importance based on biological evolutions such as natural selection and hereditary variation mechanism.The same with other heuristic search (like hill climbing method, simulated annealing method, Monte Carlo method) with the analytic method based on derivative, genetic algorithm (GA) also is a kind of alternative manner in form.

It progressively improves current separating from selected initial solution through continuous iteration, to the last searches optimal solution or satisfactory solution; In evolutionary computation; The iterative computation process has adopted the evolutionary mechanism of simulation organism, from one group separate (colony's), adopt the mode that is similar to natural selection and generative propagation; Inheriting on the basis of original excellent genes, generating the colony that the next generation with better performance index separates.

When generating progeny population, at first from excellent to bad, sort the chromosome of contemporary population, select a certain proportion of the next individuality to eliminate then; Superseded ratio can be made as 40%; In upper individuality, carry out evenly and intersect, the sub-individuality of generation is filled up in the population, to keep population scale constant; Carry out mutation operation according to the variation probability of setting at last, generate progeny population.

Because GA is at the problem space search good characteristic that optimal value showed, in the specific embodiment of the invention with GA be incorporated into based in the optimum Naive Bayes Classification model to confirm threshold value beta and α.

β and α are value real numbers within limits, can be regarded as the phenotype form of genetic algorithm, are called coding from phenotype to genotypic mapping.We adopt the binary coding form, with the individuality of β and α variate-value representative be expressed as one 0, the 1} binary string, certainly, the long precision of finding the solution that depends on of string.For example: the precision of finding the solution is accurate to 3 decimals, and siding-to-siding block length is 0.5, must the interval be divided into 0.5 * 10 ³Equal portions.Because 256=2 ⁸＜0.5 * 10 ³＜2 ⁹=512, so the encoded binary string grows to 9 of few needs.

Three main performances, efficiency evaluation index are arranged in short-message classified: accurate rate P, recall rate R and F-measure, wherein:

P＝A/B

Wherein, A is correctly validated the number of short into junk short message when utilizing said second standard straight-line to differentiate in the sample set; B is the number of short that said second standard straight-line of sample focus utilization is identified as junk short message when differentiating; P has defined the order of accuarcy of classification results, and how much promptly have in the classification results is correct.

R＝A/C

Wherein, C is the number of short that is defined as junk short message in the sample set in advance, and R has described the ability of correct classification, and it is correct that how many classification are promptly arranged in the classification results.

For once test, accuracy rate and recall rate generally are inversely proportional to.Improve accuracy rate, recall rate can descend; Improve recall rate, accuracy rate can descend.F-measure combines P and two indexs of R, can carry out the overall evaluation to grader, as follows:

F = \frac{(μ + 1) \cdot PR}{μP + R}

Wherein: μ is more than or equal to 0, is the constant of regulating P and the relative significance level of R, and μ is big more, and the significance level of R is high more, and when μ=0, F=P is accuracy rate;

Because F can develop and be following expression way:

F = \frac{(μ + 1) \cdot PR}{μP + R} = \frac{\frac{μ + 1}{μ} \cdot PR}{P + R / u}

And when μ → ∞, F=R is recall rate.

In specific embodiment of the present invention, under the situation that μ selectes, become in the embodiment of the invention and calculate as follows:

\underset{(α, β)}{\arg \max} \frac{(μ + 1) \cdot PR}{μP + R}

That is to say and calculate α and the β that makes that F is maximum.

Under normal conditions,, then get μ=1, at this moment obtain the most frequently used F (being called for short F1) if P and R equality are treated, as follows:

F_{1} = \frac{2 \times P \times R}{P + R}

Because β and α are value real numbers within limits, can be regarded as the phenotype form of genetic algorithm, therefore can utilize genetic algorithm to calculate to make the α and the β of F maximum.

Certainly; Also can α and β carried out five equilibrium; To the combination of each α and β five equilibrium, each short message in the sample set is projected in the coordinate system then, obtain evaluation result according to the relation of the position between subpoint and the straight line then; Calculate P and R according to evaluation result then, utilize P and R to calculate F then:

Select to make α that F is maximum and β as final result at last.

Illustrate as follows.

Suppose that α and β difference value is [0.36,2.75] and [1,3], obtain 10000 kind possible combinations with [0.36,2.75] and [1,3] difference 100 five equilibriums this moment.

Then these 10000 kinds of possible situation are discerned processing to sample set respectively, each makes up corresponding to a F, selects to make F maximum α and β to get final product as final result at last.

Certainly, can also calculate the α and the value of β that makes that F is maximum, in this detailed description one by one through other existing algorithm.

Because short message is differentiated the disposal ability all need consume the terminal each time, when being present in contacts list, show that this short message is that the people that the user is familiar with sends like the calling number of short message to be discriminated; This moment, unnecessary identification and when calling number is present in blacklist list, showed that this short message is that the user does not want the short message that receives; Need not discern yet, therefore, in order to improve treatment effeciency; In the specific embodiment of the invention, after obtaining said short message to be discriminated, also comprise:

The short message identification device of the embodiment of the invention comprises:

Above-mentioned short message identification device, said standard straight-line is: x-y+Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Above-mentioned short message identification device, said standard straight-line can also be to comprise first standard straight-line and second standard straight-line, and said first standard straight-line is: x-y+Con=0, and said second standard straight-line is: α * x-y+ β * Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Said α and β are respectively twiddle factor and shift factor;

Said recognition processing module specifically comprises:

Above-mentioned short message identification device, wherein:

F＝(μ+1)·PR/(μP+R)；

P＝A/B；

R＝A/C；

μ is the importance adjustment factor, and said μ is more than or equal to 0.

The value of said α and β is the value that makes that said F is maximum.

When thinking that P and R are of equal importance, μ gets 1; At this moment, the value of the value of said α and β for making that 2PR/ (P+R) is maximum.

In the specific embodiment of the invention; When recognition feature storehouse deriving means exists with server mode; Need the user to upload the short message that type of message is confirmed; And simultaneously, the terminal also need be from recognition feature storehouse that the server sync server calculates and the value of α and β, so that carry out short message identification in this locality.

The above only is a preferred implementation of the present invention; Should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; Can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.

Claims

1. a recognition feature storehouse acquisition methods is characterized in that, comprising:

2. recognition feature according to claim 1 storehouse acquisition methods is characterized in that, said character string is following corresponding to the mutual information MI of short message classification:

MI (t_{m}, c_{i}) = Σ_{i = 1}^{n} P (t_{m}, c_{i}) \log \frac{P (t_{m}, c_{i})}{P (t_{m}) P (c_{i})}

Wherein:

c _iRepresent i classification in the predefined short message classification;

3. a recognition feature storehouse deriving means is characterized in that, comprising:

4. recognition feature according to claim 3 storehouse deriving means is characterized in that, said character string is following corresponding to the mutual information MI of short message classification:

MI (t_{m}, c_{i}) = Σ_{i = 1}^{n} P (t_{m}, c_{i}) \log \frac{P (t_{m}, c_{i})}{P (t_{m}) P (c_{i})}

Wherein:

c _iRepresent i classification in the predefined short message classification;

5. a short message identification method of utilizing the recognition feature storehouse that claim 1 or 2 said recognition feature storehouse acquisition methods obtain is characterized in that, comprising:

6. short message identification method according to claim 5 is characterized in that, said standard straight-line is: x-y+Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

7. short message identification method according to claim 5; It is characterized in that said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0; Said second standard straight-line is: α * x-y+ β * Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Said α is a twiddle factor, and said β is a shift factor;

8. short message identification method according to claim 7 is characterized in that:

F＝(μ+1)·PR/(μP+R)；

P＝A/B；

R＝A/C；

μ is the importance adjustment factor, and said μ is more than or equal to 0;

The value of said α and β is the value that makes that said F is maximum.

9. according to any described short message identification method among the claim 5-8, it is characterized in that, after obtaining said short message to be discriminated, also comprise:

10. a short message identification device that utilizes the recognition feature storehouse that claim 1 or 2 said recognition feature storehouse acquisition methods obtain is characterized in that, comprising:

11. short message identification device according to claim 10 is characterized in that, said standard straight-line is: x-y+Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

12. short message identification device according to claim 10; It is characterized in that said standard straight-line comprises first standard straight-line and second standard straight-line, said first standard straight-line is: x-y+Con=0; Said second standard straight-line is: α * x-y+ β * Con=0, wherein:

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

Con = \log \frac{P {c_{1}}}{P {c_{2}}} + Σ_{k = 1}^{M} \frac{\log (1 - p_{k 1})}{\log (1 - p_{k 2})}

13. short message identification device according to claim 12 is characterized in that:

F＝(μ+1)·PR/(μP+R)；

P＝A/B；

R＝A/C；

μ is the importance adjustment factor, and said μ is more than or equal to 0;

The value of said α and β is the value that makes that said F is maximum.