CN106156120A - The method and apparatus that character string is classified - Google Patents

The method and apparatus that character string is classified Download PDF

Info

Publication number
CN106156120A
CN106156120A CN201510162076.9A CN201510162076A CN106156120A CN 106156120 A CN106156120 A CN 106156120A CN 201510162076 A CN201510162076 A CN 201510162076A CN 106156120 A CN106156120 A CN 106156120A
Authority
CN
China
Prior art keywords
character string
disaggregated model
sorted
characteristic
division
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510162076.9A
Other languages
Chinese (zh)
Other versions
CN106156120B (en
Inventor
李家宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510162076.9A priority Critical patent/CN106156120B/en
Publication of CN106156120A publication Critical patent/CN106156120A/en
Application granted granted Critical
Publication of CN106156120B publication Critical patent/CN106156120B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of method and apparatus classifying character string, belong to computer communication technology field.Described method includes: obtain character string to be sorted;Extract multiple characteristic of division from described character string to be sorted;Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division;The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, is classified to described character string to be sorted, obtains the classification results of described character string to be sorted.Described device includes: acquisition module, the first extraction module, normalization module and sort module.The disaggregated model that the present invention is obtained by off-line training, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain the classification results of character string to be sorted, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.

Description

The method and apparatus that character string is classified
Technical field
The present invention relates to computer communication technology field, be specifically related to a kind of side classifying character string Method and device.
Background technology
With the development of computer communication technology, on the one hand the terminals such as computer, panel computer, mobile phone set The standby requisite life of people, the Work tool of being increasingly becoming, another aspect is provided that network, calculating Also get more and more Deng the service equipment of background service, and to the computing device such as terminal device and service equipment The requirement of service ability is also more and more higher.A lot of scenes (invalid accounts as a large amount of in register machine malicious registration, Attack plane malice forges a large amount of invalid domain name request etc.) in, computing device can receive substantial amounts of random character String (such as " aaaxbhzqegs-2 ", " 4s7pTDAOV-L# ", "!OC | w4&s " etc.), these Random string not in all senses, but is not aware that when computing device just receives, can be random by these Character string is as character string normal, significant (such as " alibaba-inc ", " helloworld " etc.) Process, thus affect the properly functioning of computing device.
In order to avoid affecting the properly functioning of computing device, the character string that can receive computing device is carried out Classification, separate which character string be random string, which character string be normal character string, in order to calculate Equipment can carry out different process to different character strings.At present, method character string classified It is: manually according to semanteme and the context of character string itself, character string is classified.
The existing method classifying character string, relies on artificial realization, and efficiency is very low.
Content of the invention
In order to solve existing technical problem, the invention provides a kind of method classifying character string And device, the disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, treat point Class character string is classified, and obtains the classification results of character string to be sorted, it is not necessary to rely on artificial, Being automatically obtained, efficiency is very high.
In order to solve the problems referred to above, the invention discloses a kind of method classifying character string, described Method includes:
Obtain character string to be sorted;
Extract multiple characteristic of division from described character string to be sorted;
Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division;
The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, to described Character string to be sorted is classified, and obtains the classification results of described character string to be sorted.
Further, before obtaining character string to be sorted, also include:
Each character string from described test set is extracted multiple described characteristic of division, and is normalized Process, obtain the multiple described of each character string in described test set and normalize characteristic of division;
By the multiple described normalization characteristic of division of each character string in described test set and described The classification results of each character string in test set, is set as the institute of described trained values to described undetermined parameter State disaggregated model to test, obtain test result;
The accuracy rate of described test result is compared with default accuracy rate threshold value;
If the accuracy rate of described test result is more than described default accuracy rate threshold value, it is determined that treat described Determine the described classification mould that parameter is set as that the described disaggregated model of described trained values obtains as off-line training Type, then performs the described step obtaining character string to be sorted.
Further, each character string from described test set is extracted multiple described characteristic of division it Before, also include:
Gather the sample set of the described disaggregated model preset, described sample set is divided into training set and test Collection;Wherein, described sample set includes presetting each in a character string, and described default character string The classification results of character string;
Each character string from described training set is extracted multiple described characteristic of division, and is normalized Process, obtain the multiple described of each character string in described training set and normalize characteristic of division;
By the multiple described normalization characteristic of division of each character string in described training set and described The classification results of each character string in training set, enters to the undetermined parameter in default described disaggregated model Row training, obtains the trained values of described undetermined parameter.
Further, after the accuracy rate of described test result and default accuracy rate threshold value being compared, Also include:
If the accuracy rate of described test result is less than or equal to described default accuracy rate threshold value, it is determined that described It is described that undetermined parameter is set as that the described disaggregated model of described trained values cannot function as that off-line training obtains Disaggregated model, then performs the step of the sample set of the described described disaggregated model gathering and presetting.
Further, the classification results of described character string to be sorted includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
Further, described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near Adjacent method disaggregated model.
Further, described characteristic of division includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
In order to solve the problems referred to above, the invention also discloses a kind of device classifying character string, institute State device to include:
Acquisition module, is used for obtaining character string to be sorted;
First extraction module, for extracting multiple characteristic of division from described character string to be sorted;
Normalization module, for being normalized characteristic of division each described respectively, obtains multiple Normalization characteristic of division;
Sort module, for the disaggregated model being obtained by off-line training, according to multiple described normalization point Category feature, classifies to described character string to be sorted, obtains the classification results of described character string to be sorted.
Further, described device also includes:
Described sample set, for gathering the sample set of default described disaggregated model, is divided into by acquisition module Training set and test set;Wherein, described sample set includes presetting a character string, and described default The classification results of each character string in character string;
Second extraction module, extracts multiple described classification in each character string from described training set Feature, and being normalized, obtains the multiple described normalizing of each character string in described training set Change characteristic of division;
Training module, for by the multiple described normalization classification of each character string in described training set The classification results of each character string in feature, and described training set, to default described disaggregated model In undetermined parameter be trained, obtain the trained values of described undetermined parameter;
3rd extraction module, extracts multiple described classification in each character string from described test set Feature, and being normalized, obtains the multiple described normalizing of each character string in described test set Change characteristic of division;
Test module, for by the multiple described normalization classification of each character string in described test set The classification results of each character string in feature, and described test set, is set as to described undetermined parameter The described disaggregated model of described trained values is tested, and obtains test result;
Comparison module, for comparing the accuracy rate of described test result with default accuracy rate threshold value;
First determining module, if the accuracy rate for described test result is more than described default accuracy rate threshold Value, it is determined that described undetermined parameter is set as the described disaggregated model of described trained values as off-line training The described disaggregated model obtaining, then notifies that described acquisition module performs described acquisition character string to be sorted Step.
Further, described device also includes:
Second determining module, if the accuracy rate for described test result is preset accurately less than or equal to described Rate threshold value, it is determined that described undetermined parameter be set as the described disaggregated model of described trained values cannot function as from The described disaggregated model that line training obtains, then notifies that described acquisition module performs the described institute gathering and presetting State the step of the sample set of disaggregated model.
Further, the classification results of described character string to be sorted includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
Further, described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near Adjacent method disaggregated model.
Further, described characteristic of division includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
Compared with prior art, the present invention can obtain and include techniques below effect:
1) disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, is treated point Class character string is classified, and obtains the classification results of character string to be sorted, it is not necessary to rely on artificial, Being automatically obtained, efficiency is very high.
2) disaggregated model being obtained training by test set is tested, and can improve disaggregated model Accuracy.
3) to include growing most adjacent vowel permissible away from, character string information entropy or string length for characteristic of division Embody the feature of character string well, improve the accuracy of classification results.
Certainly, the arbitrary product implementing the present invention must be not necessarily required to reach all the above skill simultaneously Art effect.
Brief description
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention Point, the schematic description and description of the present invention is used for explaining the present invention, is not intended that to the present invention's Improper restriction.In the accompanying drawings:
Fig. 1 is the first method flow diagram classifying character string of the embodiment of the present invention;
Fig. 2 is the method flow diagram that character string is classified by the second of the embodiment of the present invention;
Fig. 3 is the first apparatus structure schematic diagram classifying character string of the embodiment of the present invention;
Fig. 4 is the apparatus structure schematic diagram that character string is classified by the second of the embodiment of the present invention;
Fig. 5 is the third apparatus structure schematic diagram classifying character string of the embodiment of the present invention.
Detailed description of the invention
Describe embodiments of the present invention below in conjunction with drawings and Examples in detail, thereby to the present invention How application technology means solve technical problem and reach technology effect realize that process can fully understand And implement according to this.
One typical configuration in, computing device include one or more processor (CPU), input/ Output interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by Any method or technology realize that information stores.Information can be computer-readable instruction, data structure, The module of program or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), Other kinds of random access memory (RAM), read-only storage (ROM), electrically erasable Read-only storage (EEPROM), fast flash memory bank or other memory techniques, the read-only storage of read-only optical disc Device (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic Band magnetic rigid disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storing permissible The information being accessed by a computing device.Defining according to herein, computer-readable medium does not include non-temporary Computer readable media (transitory media), such as data-signal and the carrier wave of modulation.
Censure specific components as employed some vocabulary in the middle of specification and claim.This area skill Art personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This explanation Book and claim are not used as distinguishing in the way of assembly by the difference of title, but with assembly in function On difference be used as distinguish criterion.Such as " bag mentioned in the middle of specification in the whole text and claim Contain " it is an open language, therefore " comprise but be not limited to " should be construed to." substantially " refer to receive Error range in, those skilled in the art can solve described technical problem in the range of certain error, Basically reach described technique effect.Additionally, " coupling " word comprises any direct and indirectly electrical at this Couple means.Therefore, if a first device is coupled to one second device described in literary composition, then described is represented One device can directly be electrically coupled to described second device, or by other devices or couple means indirectly It is electrically coupled to described second device.Specification subsequent descriptions is the better embodiment of the enforcement present invention, Right described description is for the purpose of the rule that the present invention is described, is not limited to the model of the present invention Enclose.Protection scope of the present invention ought be as the criterion depending on the defined person of claims.
Also, it should be noted term " includes ", "comprising" or its any other variant are intended to non- Comprising of exclusiveness, so that include that the commodity of a series of key element or system not only include that those are wanted Element, but also include other key elements being not expressly set out, or also include for this commodity or be Unite intrinsic key element.In the case of there is no more restriction, limited by statement " including ... " Key element, it is not excluded that there is also other identical element in the commodity including described key element or system.
Embodiment describes
It is described further with the realization to the inventive method for the embodiment below.As it is shown in figure 1, for originally A kind of method flow diagram that character string is classified of inventive embodiments, the method includes:
S101: obtain character string to be sorted.
Specifically, any character string of input calculating equipment can be obtained, using the character string of acquisition as treating It is classified by classification character string.
S102: extract multiple characteristic of division from character string to be sorted.
Specifically, characteristic of division includes: the longest adjacent vowel is long away from, character string information entropy or character string Degree.
Specifically, the longest adjacent vowel is between all of adjacent vowel character representing arbitrary character string Spacing distance in elder, and the present embodiment ends up "-", character string also as vowel character Treat, and be not limited to this, can be according to actual needs by some other additional character in actual application Treat as vowel character.For example: the adjacent vowel character of character string " alibaba-inc " respectively: Ai, ia, aa, a-,-i, i character string end up, between the adjacent vowel character of " alibaba-inc " between Space from successively: 1 character length, 1 character length, 1 character length, 0 character length, 0 character length, 2 character lengths, thus the longest adjacent vowel of character string " alibaba-inc " away from It is 2 character lengths.
It should be noted that vowel drives vocal cords vibrations, sends sound, adjacent vowel is away from characterizing character The length of each syllable, the rhythm embodying pronunciation in string.Normally character string (significant word or Phrase etc.) syllable comparatively short, rhythm ratio is more uniform, sends sound to facilitate, its longest adjacent vowel Away from partially short, if all of adjacent vowel of " alibaba-inc " is away from for [1,1,1,0,0,2], appearance Adjacent vowel is away from for 2;And insignificant random string is not limited by pronunciation is related, thus its syllable ratio Longer, there is no rhythm, the probability that the probability (< 5/26) that additionally its vowel occurs occurs much smaller than non-vowel, The probability that non-vowel continuous several times occurs is bigger so that its longest adjacent vowel away from partially long, as The all of adjacent vowel of " aaaxbhzqegs-2 " away from for [0,0,5,2,1], the longest adjacent vowel away from for 5。
Specifically, character string information entropy H characterizes the random degree of character string, and its computing formula is:
H = - &Sigma; i = 1 N p i log 2 p i i
Wherein, N represents the number of the character in character string, piRepresent that i-th character goes out in character string Existing probability.
It should be noted that normal character string (significant word or expression etc.), its character arrangements is abided by From normalized written, it is impossible to arbitrary arrangement, degree of randomization is not high, and character string information entropy is on the low side, as The character string information entropy of " alibaba-inc " is 2.44;And the character arrangements of insignificant random string Then not limiting, degree of randomization is higher, and character string information entropy is higher, such as " aaaxbhzqegs-2 " Comentropy be 3.19.
S103: be normalized each characteristic of division respectively, obtains multiple normalization characteristic of division.
Specifically, in the present embodiment, normalized can use Z-score normalization method, and it calculates public affairs Formula is:
X j = x j - &mu; j &delta; j
Wherein, XjRepresent j-th normalization characteristic of division, xjRepresent j-th characteristic of division, μjRepresent The sample average of j-th characteristic of division, δjRepresent the sample standard deviation of j-th characteristic of division, Ke Yi During sample set off-line training disaggregated model by the character string gathering, sample set is added up It is calculated μj、δj
It should be noted that be not limited to use Z-score normalization method, other can also be used any Feasible method, is not specifically limited to this.
S104: the disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, is treated Classification character string is classified, and obtains the classification results of character string to be sorted.
Specifically, the disaggregated model that off-line training obtains can be SVM (Support Vector Machine SVMs) disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or k-nearest neighbor (K-NN) disaggregated model.Wherein, the concrete introduction of each disaggregated model refers to S105.
Specifically, treat classification character string classify when, need to use the classification that obtains of off-line training Model, it is therefore desirable to off-line obtains disaggregated model, in the preferred embodiment of the present invention, sees Fig. 2, Before obtaining character string to be sorted, also include:
S105: gather the sample set of the disaggregated model preset, sample set is divided into training set and test set; Wherein, sample set includes presetting a character string, and presets the classification of each character string in a character string Result.
Specifically, can randomly choose a large amount of (such as 1,000,000 etc.) random string, a large amount of (as 300000 etc.) normal character string is as the sample set of default disaggregated model.Sample set is pressed certain Ratio (such as 6:4 etc.) random division is training set and test set, and wherein, training set is used for training pre- If disaggregated model, test set is for testing to the disaggregated model that obtains of training.
It should be noted that during due to collecting sample collection, be by known random string and normal character String is as sample set, so the classification results of each character string in sample set is known, for the ease of Follow-up use, the character string that can would be classified as random string represents its type with 0, just would be classified as The character string of normal character string represents its type with 1.And, however it is not limited to distinguish two kinds by 0 and 1 Type, can also by other any feasible by way of make a distinction, this is not specifically limited.
Specifically, disaggregated model can use svm classifier model, Bayesian Classification Model, decision tree Disaggregated model or k-nearest neighbor (K-NN) disaggregated model etc..
Wherein, the formula of svm classifier model is expressed as follows:
Wherein, y represents the normalization characteristic vector being made up of multiple normalization characteristic of divisions, wTRepresent system Number vector, b represents intercept, wTIt is undetermined parameter with b.
And when being trained, it can be assumed that when being categorized as normal character string of character string, respective value is 1;When being categorized as random string of character string, respective value is 0.Thus correspondence D (y), it is assumed that if D (y) Result be more than 0, then judge the corresponding character string of this result as positive sample (as normal character string);As Really the result of D (y) be less than or equal to 0, then judge the corresponding character string of this result as negative sample (as random words Symbol string).During actual application, it can be assumed that be other situations, as long as result is unanimously before and after Bao Zhenging, This is not specifically limited.
Wherein, the formula of Bayesian Classification Model is expressed as follows:
c ( y ) = arg max k p ( k ) p ( y 1 | k ) p ( y 2 | k ) . . . p ( y m | k ) = arg max k p ( k ) &Pi; j = 1 m p ( y j | k )
Wherein, k has two kinds of values (for example: value can be the 0th, 1), two kinds points of corresponding character string Class situation (random string, normal character string), y represents normalization characteristic of division, yjRepresent j-th Normalization characteristic of division, j ∈ [1, m], works as YjkMeet N (μjkjk) condition when, P ( y j | k ) = 1 2 &pi; &sigma; jk e - ( y j - &mu; jk ) 2 2 &sigma; jk 2 .
AndDuring row training, it can be assumed that when being categorized as normal character string of character string, respective value is 1;When being categorized as random string of character string, respective value is 0.Thus correspondence c (y), it is assumed that if c (y) Result be more than 0, then judge the corresponding character string of this result as positive sample (as normal character string);As Really the result of c (y) be less than or equal to 0, then judge the corresponding character string of this result as negative sample (as random words Symbol string).During actual application, it can be assumed that be other situations, as long as result is unanimously before and after Bao Zhenging, This is not specifically limited.
Wherein, Decision-Tree Classifier Model can use ID3, C4.5, CART scheduling algorithm to set up model, When using C4.5 algorithm to set up model, by normalized the longest adjacent vowel away from, normalized character String comentropy and normalized string length as Split Attribute to be selected, classification results (normal character and Random character) as the result of decision, divide training set to maximize information gain-ratio as split criterion, Construct Decision-Tree Classifier Model by step.
In actual application, any one disaggregated model can be set up in conjunction with practical situations, near to K Adjacent method disaggregated models etc. no longer carry out citing and introduce.
S106: extract multiple characteristic of division in each character string from training set, and be normalized place Reason, obtains the multiple of each character string in training set and normalizes characteristic of divisions.
Specifically, the step being normalized is similar with S103, no longer repeats one by one herein.
S107: by multiple normalization characteristic of divisions of each character string in training set, and training set In the classification results of each character string, the undetermined parameter in default disaggregated model is trained, Trained values to undetermined parameter.
S108: extract multiple characteristic of division in each character string from test set, and be normalized place Reason, obtains the multiple of each character string in test set and normalizes characteristic of divisions.
Specifically, the step being normalized is similar with S103, no longer repeats one by one herein.
S109: by multiple normalization characteristic of divisions of each character string in test set, and test set In the classification results of each character string, be set as that to undetermined parameter the disaggregated model of trained values is surveyed Examination, obtains test result.
S110: the accuracy rate of test result is compared with default accuracy rate threshold value, if test result Accuracy rate more than preset accuracy rate threshold value, then perform S111;If the accuracy rate of test result is less than It is equal to preset accuracy rate threshold value, then perform S112.
Wherein, preset accuracy rate threshold value to be configured according to actual application feature, as could be arranged to 50%th, 70% etc., this is not specifically limited.
S111: determine the disaggregated model that undetermined parameter is set as trained values as dividing that off-line training obtains Class model, then performs the step that S101 obtains character string to be sorted.
I.e. can carry out online classification operation.
It should be noted that in actual application, undetermined parameter is set as dividing of trained values by S111 determination The disaggregated model that class model obtains as off-line training, has i.e. obtained carrying out can making during online classification operation Disaggregated model, but when carry out online classification operation, i.e. when perform S101-S104, then may be used To set according to practical situations, it is not necessary to be to be carried out S101-S104 after S111 at once. S112: determine that undetermined parameter is set as that the disaggregated model of trained values cannot function as the classification that off-line training obtains Model, then performs the step that S105 gathers the sample set of disaggregated model.
I.e. re-start off-line training.
Furthermore, it is desirable to explanation, owing to practical situations constantly changes, when determining undetermined After parameter is set as the disaggregated model that the disaggregated model of trained values obtains as off-line training, can be every one The new sample set of time interval Resurvey preset, re-training obtains new disaggregated model, to original Disaggregated model be updated, to ensure the accuracy of classification results.
In addition, it is necessary to explanation, disaggregated model off-line training process is preferably used at distributed big data Reason system (such as ODPS, hadoop etc.), so that ensure can be can to the process of extensive sample and modeling Efficiently accomplish in the time accepting.
The method that character string is classified described in the present embodiment, the classification mould being obtained by off-line training Type, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain character to be sorted The classification results of string, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.By test set pair The disaggregated model that training obtains is tested, and can improve the accuracy of disaggregated model.Characteristic of division includes The longest adjacent vowel can embody the spy of character string well away from, character string information entropy or string length Levy, improve the accuracy of classification results.
As it is shown on figure 3, be a kind of structure drawing of device that character string is classified of the embodiment of the present invention, This device includes:
Acquisition module 201, is used for obtaining character string to be sorted;
First extraction module 202, for extracting multiple characteristic of division from character string to be sorted;
Normalization module 203, for being normalized each characteristic of division respectively, obtains multiple Normalization characteristic of division;
Sort module 204, for the disaggregated model being obtained by off-line training, according to multiple normalization point Category feature, treats classification character string and classifies, obtain the classification results of character string to be sorted.
Further, seeing Fig. 4, this device also includes:
Sample set, for gathering the sample set of default disaggregated model, is divided into training by acquisition module 205 Collection and test set;Wherein, sample set includes presetting a character string, and presets each in a character string The classification results of character string;
Second extraction module 206, extracts multiple characteristic of division in each character string from training set, And be normalized, obtain the multiple of each character string in training set and normalize characteristic of divisions;
Training module 207, for normalizing characteristic of divisions by the multiple of each character string in training set, And the classification results of each character string in training set, the undetermined parameter in default disaggregated model is entered Row training, obtains the trained values of undetermined parameter;
3rd extraction module 208, extracts multiple characteristic of division in each character string from test set, And be normalized, obtain the multiple of each character string in test set and normalize characteristic of divisions;
Test module 209, for normalizing characteristic of divisions by the multiple of each character string in test set, And the classification results of each character string in test set, it is set as the classification mould of trained values to undetermined parameter Type is tested, and obtains test result;
Comparison module 210, for comparing the accuracy rate of test result with default accuracy rate threshold value;
First determining module 211, if the accuracy rate for test result is more than default accuracy rate threshold value, Then determine the disaggregated model that undetermined parameter is set as, and the disaggregated model of trained values obtains as off-line training, Then notify that acquisition module 201 performs to obtain the step of character string to be sorted.
Further, seeing Fig. 5, this device also includes:
Second determining module 212, if the accuracy rate for test result is less than or equal to preset accuracy rate threshold Value, it is determined that undetermined parameter is set as that the disaggregated model of trained values cannot function as the classification that off-line training obtains Model, then notifies that acquisition module 205 performs to gather the step of the sample set of the disaggregated model preset.
Further, the classification results of character string to be sorted includes:
Character string to be sorted is random string, or character string to be sorted is normal character string.
Further, disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near Adjacent method disaggregated model.
Further, characteristic of division includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, the longest adjacent vowel Elder in spacing distance between all of adjacent vowel character representing arbitrary character string.
The device that character string is classified described in the present embodiment, the classification mould being obtained by off-line training Type, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain character to be sorted The classification results of string, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.By test set pair The disaggregated model that training obtains is tested, and can improve the accuracy of disaggregated model.Characteristic of division includes The longest adjacent vowel can embody the spy of character string well away from, character string information entropy or string length Levy, improve the accuracy of classification results.
Described device describes corresponding with aforesaid method flow, and weak point is chatted with reference to said method flow process State, no longer repeat one by one.
Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should reason Solve the present invention and be not limited to form disclosed herein, be not to be taken as the eliminating to other embodiments, And can be used for various other combination, modification and environment, and can in invention contemplated scope described herein, It is modified by the technology or knowledge of above-mentioned teaching or association area.And those skilled in the art are carried out changes Move and change is without departing from the spirit and scope of the present invention, then all should be in the protection of claims of the present invention In the range of.

Claims (13)

1. the method that character string is classified, it is characterised in that described method includes:
Obtain character string to be sorted;
Extract multiple characteristic of division from described character string to be sorted;
Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division;
The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, to described Character string to be sorted is classified, and obtains the classification results of described character string to be sorted.
2. the method for claim 1, it is characterised in that before obtaining character string to be sorted, Also include:
Each character string from test set is extracted multiple described characteristic of division, and is normalized place Reason, obtains the multiple described of each character string in described test set and normalizes characteristic of division;
By the multiple described normalization characteristic of division of each character string in described test set and described The classification results of each character string in test set, is set as the institute of described trained values to described undetermined parameter State disaggregated model to test, obtain test result;
The accuracy rate of described test result is compared with default accuracy rate threshold value;
If the accuracy rate of described test result is more than described default accuracy rate threshold value, it is determined that treat described Determine the described classification mould that parameter is set as that the described disaggregated model of described trained values obtains as off-line training Type, then performs the described step obtaining character string to be sorted.
3. method as claimed in claim 2, it is characterised in that each character string from test set Before the multiple described characteristic of division of middle extraction, also include:
Gather the sample set of the described disaggregated model preset, described sample set is divided into training set and test Collection;Wherein, described sample set includes presetting each in a character string, and described default character string The classification results of character string;
Each character string from described training set is extracted multiple described characteristic of division, and is normalized Process, obtain the multiple described of each character string in described training set and normalize characteristic of division;
By the multiple described normalization characteristic of division of each character string in described training set and described The classification results of each character string in training set, enters to the undetermined parameter in default described disaggregated model Row training, obtains the trained values of described undetermined parameter.
4. method as claimed in claim 3, it is characterised in that by the accuracy rate of described test result After comparing with default accuracy rate threshold value, also include:
If the accuracy rate of described test result is less than or equal to described default accuracy rate threshold value, it is determined that described It is described that undetermined parameter is set as that the described disaggregated model of described trained values cannot function as that off-line training obtains Disaggregated model, then performs the step of the sample set of the described described disaggregated model gathering and presetting.
5. the method for claim 1, it is characterised in that the classification of described character string to be sorted Result includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
6. the method for claim 1, it is characterised in that described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near Adjacent method disaggregated model.
7. the method as described in claim 1-6 any claim, it is characterised in that described classification Feature includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
8. the device that character string is classified, it is characterised in that described device includes:
Acquisition module, is used for obtaining character string to be sorted;
First extraction module, for extracting multiple characteristic of division from described character string to be sorted;
Normalization module, for being normalized characteristic of division each described respectively, obtains multiple Normalization characteristic of division;
Sort module, for the disaggregated model being obtained by off-line training, according to multiple described normalization point Category feature, classifies to described character string to be sorted, obtains the classification results of described character string to be sorted.
9. device as claimed in claim 8, it is characterised in that described device also includes:
Described sample set, for gathering the sample set of default described disaggregated model, is divided into by acquisition module Training set and test set;Wherein, described sample set includes presetting a character string, and described default The classification results of each character string in character string;
Second extraction module, extracts multiple described classification in each character string from described training set Feature, and being normalized, obtains the multiple described normalizing of each character string in described training set Change characteristic of division;
Training module, for by the multiple described normalization classification of each character string in described training set The classification results of each character string in feature, and described training set, to default described disaggregated model In undetermined parameter be trained, obtain the trained values of described undetermined parameter;
3rd extraction module, extracts multiple described classification in each character string from described test set Feature, and being normalized, obtains the multiple described normalizing of each character string in described test set Change characteristic of division;
Test module, for by the multiple described normalization classification of each character string in described test set The classification results of each character string in feature, and described test set, is set as to described undetermined parameter The described disaggregated model of described trained values is tested, and obtains test result;
Comparison module, for comparing the accuracy rate of described test result with default accuracy rate threshold value;
First determining module, if the accuracy rate for described test result is more than described default accuracy rate threshold Value, it is determined that described undetermined parameter is set as the described disaggregated model of described trained values as off-line training The described disaggregated model obtaining, then notifies that described acquisition module performs described acquisition character string to be sorted Step.
10. device as claimed in claim 9, it is characterised in that described device also includes:
Second determining module, if the accuracy rate for described test result is preset accurately less than or equal to described Rate threshold value, it is determined that described undetermined parameter be set as the described disaggregated model of described trained values cannot function as from The described disaggregated model that line training obtains, then notifies that described acquisition module performs the described institute gathering and presetting State the step of the sample set of disaggregated model.
11. devices as claimed in claim 8, it is characterised in that the classification of described character string to be sorted Result includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
12. devices as claimed in claim 8, it is characterised in that described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near Adjacent method disaggregated model.
13. devices as described in claim 8-12 any claim, it is characterised in that described point Category feature includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
CN201510162076.9A 2015-04-07 2015-04-07 Method and device for classifying character strings Active CN106156120B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510162076.9A CN106156120B (en) 2015-04-07 2015-04-07 Method and device for classifying character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510162076.9A CN106156120B (en) 2015-04-07 2015-04-07 Method and device for classifying character strings

Publications (2)

Publication Number Publication Date
CN106156120A true CN106156120A (en) 2016-11-23
CN106156120B CN106156120B (en) 2020-02-28

Family

ID=57335674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510162076.9A Active CN106156120B (en) 2015-04-07 2015-04-07 Method and device for classifying character strings

Country Status (1)

Country Link
CN (1) CN106156120B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650449A (en) * 2016-12-29 2017-05-10 哈尔滨安天科技股份有限公司 Script heuristic detection method and system based on variable name confusion degree
CN107807987A (en) * 2017-10-31 2018-03-16 广东工业大学 A kind of string sort method, system and a kind of string sort equipment
WO2018107953A1 (en) * 2016-12-12 2018-06-21 惠州Tcl移动通信有限公司 Smart terminal, and automatic application sorting method thereof
CN108920694A (en) * 2018-07-13 2018-11-30 北京神州泰岳软件股份有限公司 A kind of short text multi-tag classification method and device
CN109343802A (en) * 2018-09-04 2019-02-15 中国平安人寿保险股份有限公司 Declaration form printing data generating method, device, computer equipment and storage medium
CN109997104A (en) * 2017-09-30 2019-07-09 华为技术有限公司 A kind of notice display methods and terminal
CN110875959A (en) * 2018-08-13 2020-03-10 阿里巴巴集团控股有限公司 Data identification method, junk mailbox identification method and file identification method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737186A (en) * 2012-06-26 2012-10-17 腾讯科技(深圳)有限公司 Malicious file identification method, device and storage medium
CN103221960A (en) * 2012-12-10 2013-07-24 华为技术有限公司 Detection method and apparatus of malicious code
CN103593062A (en) * 2013-11-08 2014-02-19 北京奇虎科技有限公司 Data detection method and device
US20140093173A1 (en) * 2012-10-01 2014-04-03 Silverbrook Research Pty Ltd Classifying a string formed from hand-written characters
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737186A (en) * 2012-06-26 2012-10-17 腾讯科技(深圳)有限公司 Malicious file identification method, device and storage medium
US20140093173A1 (en) * 2012-10-01 2014-04-03 Silverbrook Research Pty Ltd Classifying a string formed from hand-written characters
CN103221960A (en) * 2012-12-10 2013-07-24 华为技术有限公司 Detection method and apparatus of malicious code
CN103593062A (en) * 2013-11-08 2014-02-19 北京奇虎科技有限公司 Data detection method and device
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018107953A1 (en) * 2016-12-12 2018-06-21 惠州Tcl移动通信有限公司 Smart terminal, and automatic application sorting method thereof
CN106650449A (en) * 2016-12-29 2017-05-10 哈尔滨安天科技股份有限公司 Script heuristic detection method and system based on variable name confusion degree
CN106650449B (en) * 2016-12-29 2020-05-22 哈尔滨安天科技集团股份有限公司 Script heuristic detection method and system based on variable name confusion degree
CN109997104B (en) * 2017-09-30 2021-06-22 华为技术有限公司 Notification display method and terminal
US11500507B2 (en) 2017-09-30 2022-11-15 Huawei Technologies Co.. Ltd. Notification display method and terminal
CN109997104A (en) * 2017-09-30 2019-07-09 华为技术有限公司 A kind of notice display methods and terminal
CN107807987A (en) * 2017-10-31 2018-03-16 广东工业大学 A kind of string sort method, system and a kind of string sort equipment
US11463476B2 (en) 2017-10-31 2022-10-04 Guangdong University Of Technology Character string classification method and system, and character string classification device
CN107807987B (en) * 2017-10-31 2021-07-02 广东工业大学 Character string classification method and system and character string classification equipment
CN108920694A (en) * 2018-07-13 2018-11-30 北京神州泰岳软件股份有限公司 A kind of short text multi-tag classification method and device
CN108920694B (en) * 2018-07-13 2020-08-28 鼎富智能科技有限公司 Short text multi-label classification method and device
CN110875959A (en) * 2018-08-13 2020-03-10 阿里巴巴集团控股有限公司 Data identification method, junk mailbox identification method and file identification method
CN110875959B (en) * 2018-08-13 2022-10-18 阿里巴巴集团控股有限公司 Data identification method, junk mailbox identification method and file identification method
CN109343802A (en) * 2018-09-04 2019-02-15 中国平安人寿保险股份有限公司 Declaration form printing data generating method, device, computer equipment and storage medium
CN109343802B (en) * 2018-09-04 2023-11-03 中国平安人寿保险股份有限公司 Policy print data generation method, device, computer device and storage medium

Also Published As

Publication number Publication date
CN106156120B (en) 2020-02-28

Similar Documents

Publication Publication Date Title
CN106156120A (en) The method and apparatus that character string is classified
Di Capua et al. Unsupervised cyber bullying detection in social networks
Jain et al. Application of machine learning techniques to sentiment analysis
US10437867B2 (en) Scenario generating apparatus and computer program therefor
CN103902570B (en) A kind of text classification feature extracting method, sorting technique and device
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
TWI554896B (en) Information Classification Method and Information Classification System Based on Product Identification
Anwar et al. Design and Implementation of a Machine Learning‐Based Authorship Identification Model
US20200220768A1 (en) Method, apparatus and article of manufacture for categorizing computerized messages into categories
CN101621391A (en) Method and system for classifying short texts based on probability topic
US11310200B1 (en) Classifying locator generation kits
KR20190135129A (en) Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning
CN103473380B (en) A kind of computer version sensibility classification method
WO2023272850A1 (en) Decision tree-based product matching method, apparatus and device, and storage medium
CN113055386A (en) Method and device for identifying and analyzing attack organization
CN105354327A (en) Interface API recommendation method and system based on massive data analysis
Mahmoudi et al. Web spam detection based on discriminative content and link features
CN109344913B (en) Network intrusion behavior detection method based on improved MajorCluster clustering
Coelho et al. Text Classification in the Brazilian Legal Domain.
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
CN106844596A (en) One kind is based on improved SVM Chinese Text Categorizations
US10122720B2 (en) System and method for automated web site content analysis
Rajaraman et al. Mining semantic networks for knowledge discovery
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN107766412A (en) A kind of mthods, systems and devices for establishing thematic map

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant