CN106156120A

CN106156120A - The method and apparatus that character string is classified

Info

Publication number: CN106156120A
Application number: CN201510162076.9A
Authority: CN
Inventors: 李家宏
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-04-07
Filing date: 2015-04-07
Publication date: 2016-11-23
Anticipated expiration: 2035-04-07
Also published as: CN106156120B

Abstract

The invention discloses a kind of method and apparatus classifying character string, belong to computer communication technology field.Described method includes: obtain character string to be sorted；Extract multiple characteristic of division from described character string to be sorted；Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division；The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, is classified to described character string to be sorted, obtains the classification results of described character string to be sorted.Described device includes: acquisition module, the first extraction module, normalization module and sort module.The disaggregated model that the present invention is obtained by off-line training, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain the classification results of character string to be sorted, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.

Description

The method and apparatus that character string is classified

Technical field

The present invention relates to computer communication technology field, be specifically related to a kind of side classifying character string Method and device.

Background technology

With the development of computer communication technology, on the one hand the terminals such as computer, panel computer, mobile phone set The standby requisite life of people, the Work tool of being increasingly becoming, another aspect is provided that network, calculating Also get more and more Deng the service equipment of background service, and to the computing device such as terminal device and service equipment The requirement of service ability is also more and more higher.A lot of scenes (invalid accounts as a large amount of in register machine malicious registration, Attack plane malice forges a large amount of invalid domain name request etc.) in, computing device can receive substantial amounts of random character String (such as " aaaxbhzqegs-2 ", " 4s7pTDAOV-L# ", "！OC | w4&s " etc.), these Random string not in all senses, but is not aware that when computing device just receives, can be random by these Character string is as character string normal, significant (such as " alibaba-inc ", " helloworld " etc.) Process, thus affect the properly functioning of computing device.

In order to avoid affecting the properly functioning of computing device, the character string that can receive computing device is carried out Classification, separate which character string be random string, which character string be normal character string, in order to calculate Equipment can carry out different process to different character strings.At present, method character string classified It is: manually according to semanteme and the context of character string itself, character string is classified.

The existing method classifying character string, relies on artificial realization, and efficiency is very low.

Content of the invention

In order to solve existing technical problem, the invention provides a kind of method classifying character string And device, the disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, treat point Class character string is classified, and obtains the classification results of character string to be sorted, it is not necessary to rely on artificial, Being automatically obtained, efficiency is very high.

In order to solve the problems referred to above, the invention discloses a kind of method classifying character string, described Method includes:

Obtain character string to be sorted；

Extract multiple characteristic of division from described character string to be sorted；

Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division；

The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, to described Character string to be sorted is classified, and obtains the classification results of described character string to be sorted.

Further, before obtaining character string to be sorted, also include:

Each character string from described test set is extracted multiple described characteristic of division, and is normalized Process, obtain the multiple described of each character string in described test set and normalize characteristic of division；

By the multiple described normalization characteristic of division of each character string in described test set and described The classification results of each character string in test set, is set as the institute of described trained values to described undetermined parameter State disaggregated model to test, obtain test result；

The accuracy rate of described test result is compared with default accuracy rate threshold value；

If the accuracy rate of described test result is more than described default accuracy rate threshold value, it is determined that treat described Determine the described classification mould that parameter is set as that the described disaggregated model of described trained values obtains as off-line training Type, then performs the described step obtaining character string to be sorted.

Further, each character string from described test set is extracted multiple described characteristic of division it Before, also include:

Gather the sample set of the described disaggregated model preset, described sample set is divided into training set and test Collection；Wherein, described sample set includes presetting each in a character string, and described default character string The classification results of character string；

Each character string from described training set is extracted multiple described characteristic of division, and is normalized Process, obtain the multiple described of each character string in described training set and normalize characteristic of division；

By the multiple described normalization characteristic of division of each character string in described training set and described The classification results of each character string in training set, enters to the undetermined parameter in default described disaggregated model Row training, obtains the trained values of described undetermined parameter.

Further, after the accuracy rate of described test result and default accuracy rate threshold value being compared, Also include:

If the accuracy rate of described test result is less than or equal to described default accuracy rate threshold value, it is determined that described It is described that undetermined parameter is set as that the described disaggregated model of described trained values cannot function as that off-line training obtains Disaggregated model, then performs the step of the sample set of the described described disaggregated model gathering and presetting.

Further, the classification results of described character string to be sorted includes:

Described character string to be sorted is random string, or described character string to be sorted is normal character string.

Further, described disaggregated model includes:

Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near Adjacent method disaggregated model.

Further, described characteristic of division includes:

The longest adjacent vowel is away from, character string information entropy or string length；Wherein, described the longest adjacent Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.

In order to solve the problems referred to above, the invention also discloses a kind of device classifying character string, institute State device to include:

Acquisition module, is used for obtaining character string to be sorted；

First extraction module, for extracting multiple characteristic of division from described character string to be sorted；

Normalization module, for being normalized characteristic of division each described respectively, obtains multiple Normalization characteristic of division；

Sort module, for the disaggregated model being obtained by off-line training, according to multiple described normalization point Category feature, classifies to described character string to be sorted, obtains the classification results of described character string to be sorted.

Further, described device also includes:

Described sample set, for gathering the sample set of default described disaggregated model, is divided into by acquisition module Training set and test set；Wherein, described sample set includes presetting a character string, and described default The classification results of each character string in character string；

Second extraction module, extracts multiple described classification in each character string from described training set Feature, and being normalized, obtains the multiple described normalizing of each character string in described training set Change characteristic of division；

Training module, for by the multiple described normalization classification of each character string in described training set The classification results of each character string in feature, and described training set, to default described disaggregated model In undetermined parameter be trained, obtain the trained values of described undetermined parameter；

3rd extraction module, extracts multiple described classification in each character string from described test set Feature, and being normalized, obtains the multiple described normalizing of each character string in described test set Change characteristic of division；

Test module, for by the multiple described normalization classification of each character string in described test set The classification results of each character string in feature, and described test set, is set as to described undetermined parameter The described disaggregated model of described trained values is tested, and obtains test result；

Comparison module, for comparing the accuracy rate of described test result with default accuracy rate threshold value；

First determining module, if the accuracy rate for described test result is more than described default accuracy rate threshold Value, it is determined that described undetermined parameter is set as the described disaggregated model of described trained values as off-line training The described disaggregated model obtaining, then notifies that described acquisition module performs described acquisition character string to be sorted Step.

Further, described device also includes:

Second determining module, if the accuracy rate for described test result is preset accurately less than or equal to described Rate threshold value, it is determined that described undetermined parameter be set as the described disaggregated model of described trained values cannot function as from The described disaggregated model that line training obtains, then notifies that described acquisition module performs the described institute gathering and presetting State the step of the sample set of disaggregated model.

Further, described disaggregated model includes:

Further, described characteristic of division includes:

Compared with prior art, the present invention can obtain and include techniques below effect:

1) disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, is treated point Class character string is classified, and obtains the classification results of character string to be sorted, it is not necessary to rely on artificial, Being automatically obtained, efficiency is very high.

2) disaggregated model being obtained training by test set is tested, and can improve disaggregated model Accuracy.

3) to include growing most adjacent vowel permissible away from, character string information entropy or string length for characteristic of division Embody the feature of character string well, improve the accuracy of classification results.

Certainly, the arbitrary product implementing the present invention must be not necessarily required to reach all the above skill simultaneously Art effect.

Brief description

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention Point, the schematic description and description of the present invention is used for explaining the present invention, is not intended that to the present invention's Improper restriction.In the accompanying drawings:

Fig. 1 is the first method flow diagram classifying character string of the embodiment of the present invention；

Fig. 2 is the method flow diagram that character string is classified by the second of the embodiment of the present invention；

Fig. 3 is the first apparatus structure schematic diagram classifying character string of the embodiment of the present invention；

Fig. 4 is the apparatus structure schematic diagram that character string is classified by the second of the embodiment of the present invention；

Fig. 5 is the third apparatus structure schematic diagram classifying character string of the embodiment of the present invention.

Detailed description of the invention

Describe embodiments of the present invention below in conjunction with drawings and Examples in detail, thereby to the present invention How application technology means solve technical problem and reach technology effect realize that process can fully understand And implement according to this.

One typical configuration in, computing device include one or more processor (CPU), input/ Output interface, network interface and internal memory.

Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.

Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by Any method or technology realize that information stores.Information can be computer-readable instruction, data structure, The module of program or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), Other kinds of random access memory (RAM), read-only storage (ROM), electrically erasable Read-only storage (EEPROM), fast flash memory bank or other memory techniques, the read-only storage of read-only optical disc Device (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic Band magnetic rigid disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storing permissible The information being accessed by a computing device.Defining according to herein, computer-readable medium does not include non-temporary Computer readable media (transitory media), such as data-signal and the carrier wave of modulation.

Censure specific components as employed some vocabulary in the middle of specification and claim.This area skill Art personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This explanation Book and claim are not used as distinguishing in the way of assembly by the difference of title, but with assembly in function On difference be used as distinguish criterion.Such as " bag mentioned in the middle of specification in the whole text and claim Contain " it is an open language, therefore " comprise but be not limited to " should be construed to." substantially " refer to receive Error range in, those skilled in the art can solve described technical problem in the range of certain error, Basically reach described technique effect.Additionally, " coupling " word comprises any direct and indirectly electrical at this Couple means.Therefore, if a first device is coupled to one second device described in literary composition, then described is represented One device can directly be electrically coupled to described second device, or by other devices or couple means indirectly It is electrically coupled to described second device.Specification subsequent descriptions is the better embodiment of the enforcement present invention, Right described description is for the purpose of the rule that the present invention is described, is not limited to the model of the present invention Enclose.Protection scope of the present invention ought be as the criterion depending on the defined person of claims.

Also, it should be noted term " includes ", "comprising" or its any other variant are intended to non- Comprising of exclusiveness, so that include that the commodity of a series of key element or system not only include that those are wanted Element, but also include other key elements being not expressly set out, or also include for this commodity or be Unite intrinsic key element.In the case of there is no more restriction, limited by statement " including ... " Key element, it is not excluded that there is also other identical element in the commodity including described key element or system.

Embodiment describes

It is described further with the realization to the inventive method for the embodiment below.As it is shown in figure 1, for originally A kind of method flow diagram that character string is classified of inventive embodiments, the method includes:

S101: obtain character string to be sorted.

Specifically, any character string of input calculating equipment can be obtained, using the character string of acquisition as treating It is classified by classification character string.

S102: extract multiple characteristic of division from character string to be sorted.

Specifically, characteristic of division includes: the longest adjacent vowel is long away from, character string information entropy or character string Degree.

Specifically, the longest adjacent vowel is between all of adjacent vowel character representing arbitrary character string Spacing distance in elder, and the present embodiment ends up "-", character string also as vowel character Treat, and be not limited to this, can be according to actual needs by some other additional character in actual application Treat as vowel character.For example: the adjacent vowel character of character string " alibaba-inc " respectively: Ai, ia, aa, a-,-i, i character string end up, between the adjacent vowel character of " alibaba-inc " between Space from successively: 1 character length, 1 character length, 1 character length, 0 character length, 0 character length, 2 character lengths, thus the longest adjacent vowel of character string " alibaba-inc " away from It is 2 character lengths.

It should be noted that vowel drives vocal cords vibrations, sends sound, adjacent vowel is away from characterizing character The length of each syllable, the rhythm embodying pronunciation in string.Normally character string (significant word or Phrase etc.) syllable comparatively short, rhythm ratio is more uniform, sends sound to facilitate, its longest adjacent vowel Away from partially short, if all of adjacent vowel of " alibaba-inc " is away from for [1,1,1,0,0,2], appearance Adjacent vowel is away from for 2；And insignificant random string is not limited by pronunciation is related, thus its syllable ratio Longer, there is no rhythm, the probability that the probability (< 5/26) that additionally its vowel occurs occurs much smaller than non-vowel, The probability that non-vowel continuous several times occurs is bigger so that its longest adjacent vowel away from partially long, as The all of adjacent vowel of " aaaxbhzqegs-2 " away from for [0,0,5,2,1], the longest adjacent vowel away from for 5。

Specifically, character string information entropy H characterizes the random degree of character string, and its computing formula is:

H = - Σ_{i = 1}^{N} p_{i} \log_{2} p_{i} i

Wherein, N represents the number of the character in character string, p_iRepresent that i-th character goes out in character string Existing probability.

It should be noted that normal character string (significant word or expression etc.), its character arrangements is abided by From normalized written, it is impossible to arbitrary arrangement, degree of randomization is not high, and character string information entropy is on the low side, as The character string information entropy of " alibaba-inc " is 2.44；And the character arrangements of insignificant random string Then not limiting, degree of randomization is higher, and character string information entropy is higher, such as " aaaxbhzqegs-2 " Comentropy be 3.19.

S103: be normalized each characteristic of division respectively, obtains multiple normalization characteristic of division.

Specifically, in the present embodiment, normalized can use Z-score normalization method, and it calculates public affairs Formula is:

X_{j} = \frac{x_{j} - μ_{j}}{δ_{j}}

Wherein, X_jRepresent j-th normalization characteristic of division, x_jRepresent j-th characteristic of division, μ_jRepresent The sample average of j-th characteristic of division, δ_jRepresent the sample standard deviation of j-th characteristic of division, Ke Yi During sample set off-line training disaggregated model by the character string gathering, sample set is added up It is calculated μ_j、δ_j。

It should be noted that be not limited to use Z-score normalization method, other can also be used any Feasible method, is not specifically limited to this.

S104: the disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, is treated Classification character string is classified, and obtains the classification results of character string to be sorted.

Specifically, the disaggregated model that off-line training obtains can be SVM (Support Vector Machine SVMs) disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or k-nearest neighbor (K-NN) disaggregated model.Wherein, the concrete introduction of each disaggregated model refers to S105.

Specifically, treat classification character string classify when, need to use the classification that obtains of off-line training Model, it is therefore desirable to off-line obtains disaggregated model, in the preferred embodiment of the present invention, sees Fig. 2, Before obtaining character string to be sorted, also include:

S105: gather the sample set of the disaggregated model preset, sample set is divided into training set and test set； Wherein, sample set includes presetting a character string, and presets the classification of each character string in a character string Result.

Specifically, can randomly choose a large amount of (such as 1,000,000 etc.) random string, a large amount of (as 300000 etc.) normal character string is as the sample set of default disaggregated model.Sample set is pressed certain Ratio (such as 6:4 etc.) random division is training set and test set, and wherein, training set is used for training pre- If disaggregated model, test set is for testing to the disaggregated model that obtains of training.

It should be noted that during due to collecting sample collection, be by known random string and normal character String is as sample set, so the classification results of each character string in sample set is known, for the ease of Follow-up use, the character string that can would be classified as random string represents its type with 0, just would be classified as The character string of normal character string represents its type with 1.And, however it is not limited to distinguish two kinds by 0 and 1 Type, can also by other any feasible by way of make a distinction, this is not specifically limited.

Specifically, disaggregated model can use svm classifier model, Bayesian Classification Model, decision tree Disaggregated model or k-nearest neighbor (K-NN) disaggregated model etc..

Wherein, the formula of svm classifier model is expressed as follows:

Wherein, y represents the normalization characteristic vector being made up of multiple normalization characteristic of divisions, w^TRepresent system Number vector, b represents intercept, w^TIt is undetermined parameter with b.

And when being trained, it can be assumed that when being categorized as normal character string of character string, respective value is 1；When being categorized as random string of character string, respective value is 0.Thus correspondence D (y), it is assumed that if D (y) Result be more than 0, then judge the corresponding character string of this result as positive sample (as normal character string)；As Really the result of D (y) be less than or equal to 0, then judge the corresponding character string of this result as negative sample (as random words Symbol string).During actual application, it can be assumed that be other situations, as long as result is unanimously before and after Bao Zhenging, This is not specifically limited.

Wherein, the formula of Bayesian Classification Model is expressed as follows:

c (y) = \underset{k}{\arg \max} p (k) p (y_{1} | k) p (y_{2} | k) . . . p (y_{m} | k) = \underset{k}{\arg \max} p (k) Π_{j = 1}^{m} p (y_{j} | k)

Wherein, k has two kinds of values (for example: value can be the 0th, 1), two kinds points of corresponding character string Class situation (random string, normal character string), y represents normalization characteristic of division, y_jRepresent j-th Normalization characteristic of division, j ∈ [1, m], works as Y_jkMeet N (μ_jk,δ_jk) condition when,

P (y_{j} | k) = \frac{1}{\sqrt{2 π} σ_{jk}} e^{- \frac{{(y_{j} - μ_{jk})}^{2}}{2 σ_{jk}^{2}}} .

AndDuring row training, it can be assumed that when being categorized as normal character string of character string, respective value is 1；When being categorized as random string of character string, respective value is 0.Thus correspondence c (y), it is assumed that if c (y) Result be more than 0, then judge the corresponding character string of this result as positive sample (as normal character string)；As Really the result of c (y) be less than or equal to 0, then judge the corresponding character string of this result as negative sample (as random words Symbol string).During actual application, it can be assumed that be other situations, as long as result is unanimously before and after Bao Zhenging, This is not specifically limited.

Wherein, Decision-Tree Classifier Model can use ID3, C4.5, CART scheduling algorithm to set up model, When using C4.5 algorithm to set up model, by normalized the longest adjacent vowel away from, normalized character String comentropy and normalized string length as Split Attribute to be selected, classification results (normal character and Random character) as the result of decision, divide training set to maximize information gain-ratio as split criterion, Construct Decision-Tree Classifier Model by step.

In actual application, any one disaggregated model can be set up in conjunction with practical situations, near to K Adjacent method disaggregated models etc. no longer carry out citing and introduce.

S106: extract multiple characteristic of division in each character string from training set, and be normalized place Reason, obtains the multiple of each character string in training set and normalizes characteristic of divisions.

Specifically, the step being normalized is similar with S103, no longer repeats one by one herein.

S107: by multiple normalization characteristic of divisions of each character string in training set, and training set In the classification results of each character string, the undetermined parameter in default disaggregated model is trained, Trained values to undetermined parameter.

S108: extract multiple characteristic of division in each character string from test set, and be normalized place Reason, obtains the multiple of each character string in test set and normalizes characteristic of divisions.

S109: by multiple normalization characteristic of divisions of each character string in test set, and test set In the classification results of each character string, be set as that to undetermined parameter the disaggregated model of trained values is surveyed Examination, obtains test result.

S110: the accuracy rate of test result is compared with default accuracy rate threshold value, if test result Accuracy rate more than preset accuracy rate threshold value, then perform S111；If the accuracy rate of test result is less than It is equal to preset accuracy rate threshold value, then perform S112.

Wherein, preset accuracy rate threshold value to be configured according to actual application feature, as could be arranged to 50%th, 70% etc., this is not specifically limited.

S111: determine the disaggregated model that undetermined parameter is set as trained values as dividing that off-line training obtains Class model, then performs the step that S101 obtains character string to be sorted.

I.e. can carry out online classification operation.

It should be noted that in actual application, undetermined parameter is set as dividing of trained values by S111 determination The disaggregated model that class model obtains as off-line training, has i.e. obtained carrying out can making during online classification operation Disaggregated model, but when carry out online classification operation, i.e. when perform S101-S104, then may be used To set according to practical situations, it is not necessary to be to be carried out S101-S104 after S111 at once. S112: determine that undetermined parameter is set as that the disaggregated model of trained values cannot function as the classification that off-line training obtains Model, then performs the step that S105 gathers the sample set of disaggregated model.

I.e. re-start off-line training.

Furthermore, it is desirable to explanation, owing to practical situations constantly changes, when determining undetermined After parameter is set as the disaggregated model that the disaggregated model of trained values obtains as off-line training, can be every one The new sample set of time interval Resurvey preset, re-training obtains new disaggregated model, to original Disaggregated model be updated, to ensure the accuracy of classification results.

In addition, it is necessary to explanation, disaggregated model off-line training process is preferably used at distributed big data Reason system (such as ODPS, hadoop etc.), so that ensure can be can to the process of extensive sample and modeling Efficiently accomplish in the time accepting.

The method that character string is classified described in the present embodiment, the classification mould being obtained by off-line training Type, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain character to be sorted The classification results of string, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.By test set pair The disaggregated model that training obtains is tested, and can improve the accuracy of disaggregated model.Characteristic of division includes The longest adjacent vowel can embody the spy of character string well away from, character string information entropy or string length Levy, improve the accuracy of classification results.

As it is shown on figure 3, be a kind of structure drawing of device that character string is classified of the embodiment of the present invention, This device includes:

Acquisition module 201, is used for obtaining character string to be sorted；

First extraction module 202, for extracting multiple characteristic of division from character string to be sorted；

Normalization module 203, for being normalized each characteristic of division respectively, obtains multiple Normalization characteristic of division；

Sort module 204, for the disaggregated model being obtained by off-line training, according to multiple normalization point Category feature, treats classification character string and classifies, obtain the classification results of character string to be sorted.

Further, seeing Fig. 4, this device also includes:

Sample set, for gathering the sample set of default disaggregated model, is divided into training by acquisition module 205 Collection and test set；Wherein, sample set includes presetting a character string, and presets each in a character string The classification results of character string；

Second extraction module 206, extracts multiple characteristic of division in each character string from training set, And be normalized, obtain the multiple of each character string in training set and normalize characteristic of divisions；

Training module 207, for normalizing characteristic of divisions by the multiple of each character string in training set, And the classification results of each character string in training set, the undetermined parameter in default disaggregated model is entered Row training, obtains the trained values of undetermined parameter；

3rd extraction module 208, extracts multiple characteristic of division in each character string from test set, And be normalized, obtain the multiple of each character string in test set and normalize characteristic of divisions；

Test module 209, for normalizing characteristic of divisions by the multiple of each character string in test set, And the classification results of each character string in test set, it is set as the classification mould of trained values to undetermined parameter Type is tested, and obtains test result；

Comparison module 210, for comparing the accuracy rate of test result with default accuracy rate threshold value；

First determining module 211, if the accuracy rate for test result is more than default accuracy rate threshold value, Then determine the disaggregated model that undetermined parameter is set as, and the disaggregated model of trained values obtains as off-line training, Then notify that acquisition module 201 performs to obtain the step of character string to be sorted.

Further, seeing Fig. 5, this device also includes:

Second determining module 212, if the accuracy rate for test result is less than or equal to preset accuracy rate threshold Value, it is determined that undetermined parameter is set as that the disaggregated model of trained values cannot function as the classification that off-line training obtains Model, then notifies that acquisition module 205 performs to gather the step of the sample set of the disaggregated model preset.

Further, the classification results of character string to be sorted includes:

Character string to be sorted is random string, or character string to be sorted is normal character string.

Further, disaggregated model includes:

Further, characteristic of division includes:

The longest adjacent vowel is away from, character string information entropy or string length；Wherein, the longest adjacent vowel Elder in spacing distance between all of adjacent vowel character representing arbitrary character string.

The device that character string is classified described in the present embodiment, the classification mould being obtained by off-line training Type, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain character to be sorted The classification results of string, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.By test set pair The disaggregated model that training obtains is tested, and can improve the accuracy of disaggregated model.Characteristic of division includes The longest adjacent vowel can embody the spy of character string well away from, character string information entropy or string length Levy, improve the accuracy of classification results.

Described device describes corresponding with aforesaid method flow, and weak point is chatted with reference to said method flow process State, no longer repeat one by one.

Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should reason Solve the present invention and be not limited to form disclosed herein, be not to be taken as the eliminating to other embodiments, And can be used for various other combination, modification and environment, and can in invention contemplated scope described herein, It is modified by the technology or knowledge of above-mentioned teaching or association area.And those skilled in the art are carried out changes Move and change is without departing from the spirit and scope of the present invention, then all should be in the protection of claims of the present invention In the range of.

Claims

1. the method that character string is classified, it is characterised in that described method includes:

Obtain character string to be sorted；

2. the method for claim 1, it is characterised in that before obtaining character string to be sorted, Also include:

Each character string from test set is extracted multiple described characteristic of division, and is normalized place Reason, obtains the multiple described of each character string in described test set and normalizes characteristic of division；

3. method as claimed in claim 2, it is characterised in that each character string from test set Before the multiple described characteristic of division of middle extraction, also include:

4. method as claimed in claim 3, it is characterised in that by the accuracy rate of described test result After comparing with default accuracy rate threshold value, also include:

5. the method for claim 1, it is characterised in that the classification of described character string to be sorted Result includes:

6. the method for claim 1, it is characterised in that described disaggregated model includes:

7. the method as described in claim 1-6 any claim, it is characterised in that described classification Feature includes:

8. the device that character string is classified, it is characterised in that described device includes:

Acquisition module, is used for obtaining character string to be sorted；

9. device as claimed in claim 8, it is characterised in that described device also includes:

10. device as claimed in claim 9, it is characterised in that described device also includes:

11. devices as claimed in claim 8, it is characterised in that the classification of described character string to be sorted Result includes:

12. devices as claimed in claim 8, it is characterised in that described disaggregated model includes:

13. devices as described in claim 8-12 any claim, it is characterised in that described point Category feature includes: