CN106156120A - The method and apparatus that character string is classified - Google Patents
The method and apparatus that character string is classified Download PDFInfo
- Publication number
- CN106156120A CN106156120A CN201510162076.9A CN201510162076A CN106156120A CN 106156120 A CN106156120 A CN 106156120A CN 201510162076 A CN201510162076 A CN 201510162076A CN 106156120 A CN106156120 A CN 106156120A
- Authority
- CN
- China
- Prior art keywords
- character string
- disaggregated model
- sorted
- characteristic
- division
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of method and apparatus classifying character string, belong to computer communication technology field.Described method includes: obtain character string to be sorted;Extract multiple characteristic of division from described character string to be sorted;Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division;The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, is classified to described character string to be sorted, obtains the classification results of described character string to be sorted.Described device includes: acquisition module, the first extraction module, normalization module and sort module.The disaggregated model that the present invention is obtained by off-line training, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain the classification results of character string to be sorted, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.
Description
Technical field
The present invention relates to computer communication technology field, be specifically related to a kind of side classifying character string
Method and device.
Background technology
With the development of computer communication technology, on the one hand the terminals such as computer, panel computer, mobile phone set
The standby requisite life of people, the Work tool of being increasingly becoming, another aspect is provided that network, calculating
Also get more and more Deng the service equipment of background service, and to the computing device such as terminal device and service equipment
The requirement of service ability is also more and more higher.A lot of scenes (invalid accounts as a large amount of in register machine malicious registration,
Attack plane malice forges a large amount of invalid domain name request etc.) in, computing device can receive substantial amounts of random character
String (such as " aaaxbhzqegs-2 ", " 4s7pTDAOV-L# ", "!OC | w4&s " etc.), these
Random string not in all senses, but is not aware that when computing device just receives, can be random by these
Character string is as character string normal, significant (such as " alibaba-inc ", " helloworld " etc.)
Process, thus affect the properly functioning of computing device.
In order to avoid affecting the properly functioning of computing device, the character string that can receive computing device is carried out
Classification, separate which character string be random string, which character string be normal character string, in order to calculate
Equipment can carry out different process to different character strings.At present, method character string classified
It is: manually according to semanteme and the context of character string itself, character string is classified.
The existing method classifying character string, relies on artificial realization, and efficiency is very low.
Content of the invention
In order to solve existing technical problem, the invention provides a kind of method classifying character string
And device, the disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, treat point
Class character string is classified, and obtains the classification results of character string to be sorted, it is not necessary to rely on artificial,
Being automatically obtained, efficiency is very high.
In order to solve the problems referred to above, the invention discloses a kind of method classifying character string, described
Method includes:
Obtain character string to be sorted;
Extract multiple characteristic of division from described character string to be sorted;
Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division;
The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, to described
Character string to be sorted is classified, and obtains the classification results of described character string to be sorted.
Further, before obtaining character string to be sorted, also include:
Each character string from described test set is extracted multiple described characteristic of division, and is normalized
Process, obtain the multiple described of each character string in described test set and normalize characteristic of division;
By the multiple described normalization characteristic of division of each character string in described test set and described
The classification results of each character string in test set, is set as the institute of described trained values to described undetermined parameter
State disaggregated model to test, obtain test result;
The accuracy rate of described test result is compared with default accuracy rate threshold value;
If the accuracy rate of described test result is more than described default accuracy rate threshold value, it is determined that treat described
Determine the described classification mould that parameter is set as that the described disaggregated model of described trained values obtains as off-line training
Type, then performs the described step obtaining character string to be sorted.
Further, each character string from described test set is extracted multiple described characteristic of division it
Before, also include:
Gather the sample set of the described disaggregated model preset, described sample set is divided into training set and test
Collection;Wherein, described sample set includes presetting each in a character string, and described default character string
The classification results of character string;
Each character string from described training set is extracted multiple described characteristic of division, and is normalized
Process, obtain the multiple described of each character string in described training set and normalize characteristic of division;
By the multiple described normalization characteristic of division of each character string in described training set and described
The classification results of each character string in training set, enters to the undetermined parameter in default described disaggregated model
Row training, obtains the trained values of described undetermined parameter.
Further, after the accuracy rate of described test result and default accuracy rate threshold value being compared,
Also include:
If the accuracy rate of described test result is less than or equal to described default accuracy rate threshold value, it is determined that described
It is described that undetermined parameter is set as that the described disaggregated model of described trained values cannot function as that off-line training obtains
Disaggregated model, then performs the step of the sample set of the described described disaggregated model gathering and presetting.
Further, the classification results of described character string to be sorted includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
Further, described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near
Adjacent method disaggregated model.
Further, described characteristic of division includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent
Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
In order to solve the problems referred to above, the invention also discloses a kind of device classifying character string, institute
State device to include:
Acquisition module, is used for obtaining character string to be sorted;
First extraction module, for extracting multiple characteristic of division from described character string to be sorted;
Normalization module, for being normalized characteristic of division each described respectively, obtains multiple
Normalization characteristic of division;
Sort module, for the disaggregated model being obtained by off-line training, according to multiple described normalization point
Category feature, classifies to described character string to be sorted, obtains the classification results of described character string to be sorted.
Further, described device also includes:
Described sample set, for gathering the sample set of default described disaggregated model, is divided into by acquisition module
Training set and test set;Wherein, described sample set includes presetting a character string, and described default
The classification results of each character string in character string;
Second extraction module, extracts multiple described classification in each character string from described training set
Feature, and being normalized, obtains the multiple described normalizing of each character string in described training set
Change characteristic of division;
Training module, for by the multiple described normalization classification of each character string in described training set
The classification results of each character string in feature, and described training set, to default described disaggregated model
In undetermined parameter be trained, obtain the trained values of described undetermined parameter;
3rd extraction module, extracts multiple described classification in each character string from described test set
Feature, and being normalized, obtains the multiple described normalizing of each character string in described test set
Change characteristic of division;
Test module, for by the multiple described normalization classification of each character string in described test set
The classification results of each character string in feature, and described test set, is set as to described undetermined parameter
The described disaggregated model of described trained values is tested, and obtains test result;
Comparison module, for comparing the accuracy rate of described test result with default accuracy rate threshold value;
First determining module, if the accuracy rate for described test result is more than described default accuracy rate threshold
Value, it is determined that described undetermined parameter is set as the described disaggregated model of described trained values as off-line training
The described disaggregated model obtaining, then notifies that described acquisition module performs described acquisition character string to be sorted
Step.
Further, described device also includes:
Second determining module, if the accuracy rate for described test result is preset accurately less than or equal to described
Rate threshold value, it is determined that described undetermined parameter be set as the described disaggregated model of described trained values cannot function as from
The described disaggregated model that line training obtains, then notifies that described acquisition module performs the described institute gathering and presetting
State the step of the sample set of disaggregated model.
Further, the classification results of described character string to be sorted includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
Further, described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near
Adjacent method disaggregated model.
Further, described characteristic of division includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent
Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
Compared with prior art, the present invention can obtain and include techniques below effect:
1) disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, is treated point
Class character string is classified, and obtains the classification results of character string to be sorted, it is not necessary to rely on artificial,
Being automatically obtained, efficiency is very high.
2) disaggregated model being obtained training by test set is tested, and can improve disaggregated model
Accuracy.
3) to include growing most adjacent vowel permissible away from, character string information entropy or string length for characteristic of division
Embody the feature of character string well, improve the accuracy of classification results.
Certainly, the arbitrary product implementing the present invention must be not necessarily required to reach all the above skill simultaneously
Art effect.
Brief description
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention
Point, the schematic description and description of the present invention is used for explaining the present invention, is not intended that to the present invention's
Improper restriction.In the accompanying drawings:
Fig. 1 is the first method flow diagram classifying character string of the embodiment of the present invention;
Fig. 2 is the method flow diagram that character string is classified by the second of the embodiment of the present invention;
Fig. 3 is the first apparatus structure schematic diagram classifying character string of the embodiment of the present invention;
Fig. 4 is the apparatus structure schematic diagram that character string is classified by the second of the embodiment of the present invention;
Fig. 5 is the third apparatus structure schematic diagram classifying character string of the embodiment of the present invention.
Detailed description of the invention
Describe embodiments of the present invention below in conjunction with drawings and Examples in detail, thereby to the present invention
How application technology means solve technical problem and reach technology effect realize that process can fully understand
And implement according to this.
One typical configuration in, computing device include one or more processor (CPU), input/
Output interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory
(RAM) and/or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash
RAM).Internal memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by
Any method or technology realize that information stores.Information can be computer-readable instruction, data structure,
The module of program or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory
(PRAM), static RAM (SRAM), dynamic random access memory (DRAM),
Other kinds of random access memory (RAM), read-only storage (ROM), electrically erasable
Read-only storage (EEPROM), fast flash memory bank or other memory techniques, the read-only storage of read-only optical disc
Device (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic
Band magnetic rigid disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storing permissible
The information being accessed by a computing device.Defining according to herein, computer-readable medium does not include non-temporary
Computer readable media (transitory media), such as data-signal and the carrier wave of modulation.
Censure specific components as employed some vocabulary in the middle of specification and claim.This area skill
Art personnel are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.This explanation
Book and claim are not used as distinguishing in the way of assembly by the difference of title, but with assembly in function
On difference be used as distinguish criterion.Such as " bag mentioned in the middle of specification in the whole text and claim
Contain " it is an open language, therefore " comprise but be not limited to " should be construed to." substantially " refer to receive
Error range in, those skilled in the art can solve described technical problem in the range of certain error,
Basically reach described technique effect.Additionally, " coupling " word comprises any direct and indirectly electrical at this
Couple means.Therefore, if a first device is coupled to one second device described in literary composition, then described is represented
One device can directly be electrically coupled to described second device, or by other devices or couple means indirectly
It is electrically coupled to described second device.Specification subsequent descriptions is the better embodiment of the enforcement present invention,
Right described description is for the purpose of the rule that the present invention is described, is not limited to the model of the present invention
Enclose.Protection scope of the present invention ought be as the criterion depending on the defined person of claims.
Also, it should be noted term " includes ", "comprising" or its any other variant are intended to non-
Comprising of exclusiveness, so that include that the commodity of a series of key element or system not only include that those are wanted
Element, but also include other key elements being not expressly set out, or also include for this commodity or be
Unite intrinsic key element.In the case of there is no more restriction, limited by statement " including ... "
Key element, it is not excluded that there is also other identical element in the commodity including described key element or system.
Embodiment describes
It is described further with the realization to the inventive method for the embodiment below.As it is shown in figure 1, for originally
A kind of method flow diagram that character string is classified of inventive embodiments, the method includes:
S101: obtain character string to be sorted.
Specifically, any character string of input calculating equipment can be obtained, using the character string of acquisition as treating
It is classified by classification character string.
S102: extract multiple characteristic of division from character string to be sorted.
Specifically, characteristic of division includes: the longest adjacent vowel is long away from, character string information entropy or character string
Degree.
Specifically, the longest adjacent vowel is between all of adjacent vowel character representing arbitrary character string
Spacing distance in elder, and the present embodiment ends up "-", character string also as vowel character
Treat, and be not limited to this, can be according to actual needs by some other additional character in actual application
Treat as vowel character.For example: the adjacent vowel character of character string " alibaba-inc " respectively:
Ai, ia, aa, a-,-i, i character string end up, between the adjacent vowel character of " alibaba-inc " between
Space from successively: 1 character length, 1 character length, 1 character length, 0 character length,
0 character length, 2 character lengths, thus the longest adjacent vowel of character string " alibaba-inc " away from
It is 2 character lengths.
It should be noted that vowel drives vocal cords vibrations, sends sound, adjacent vowel is away from characterizing character
The length of each syllable, the rhythm embodying pronunciation in string.Normally character string (significant word or
Phrase etc.) syllable comparatively short, rhythm ratio is more uniform, sends sound to facilitate, its longest adjacent vowel
Away from partially short, if all of adjacent vowel of " alibaba-inc " is away from for [1,1,1,0,0,2], appearance
Adjacent vowel is away from for 2;And insignificant random string is not limited by pronunciation is related, thus its syllable ratio
Longer, there is no rhythm, the probability that the probability (< 5/26) that additionally its vowel occurs occurs much smaller than non-vowel,
The probability that non-vowel continuous several times occurs is bigger so that its longest adjacent vowel away from partially long, as
The all of adjacent vowel of " aaaxbhzqegs-2 " away from for [0,0,5,2,1], the longest adjacent vowel away from for
5。
Specifically, character string information entropy H characterizes the random degree of character string, and its computing formula is:
Wherein, N represents the number of the character in character string, piRepresent that i-th character goes out in character string
Existing probability.
It should be noted that normal character string (significant word or expression etc.), its character arrangements is abided by
From normalized written, it is impossible to arbitrary arrangement, degree of randomization is not high, and character string information entropy is on the low side, as
The character string information entropy of " alibaba-inc " is 2.44;And the character arrangements of insignificant random string
Then not limiting, degree of randomization is higher, and character string information entropy is higher, such as " aaaxbhzqegs-2 "
Comentropy be 3.19.
S103: be normalized each characteristic of division respectively, obtains multiple normalization characteristic of division.
Specifically, in the present embodiment, normalized can use Z-score normalization method, and it calculates public affairs
Formula is:
Wherein, XjRepresent j-th normalization characteristic of division, xjRepresent j-th characteristic of division, μjRepresent
The sample average of j-th characteristic of division, δjRepresent the sample standard deviation of j-th characteristic of division, Ke Yi
During sample set off-line training disaggregated model by the character string gathering, sample set is added up
It is calculated μj、δj。
It should be noted that be not limited to use Z-score normalization method, other can also be used any
Feasible method, is not specifically limited to this.
S104: the disaggregated model being obtained by off-line training, according to multiple normalization characteristic of divisions, is treated
Classification character string is classified, and obtains the classification results of character string to be sorted.
Specifically, the disaggregated model that off-line training obtains can be SVM (Support Vector Machine
SVMs) disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or k-nearest neighbor
(K-NN) disaggregated model.Wherein, the concrete introduction of each disaggregated model refers to S105.
Specifically, treat classification character string classify when, need to use the classification that obtains of off-line training
Model, it is therefore desirable to off-line obtains disaggregated model, in the preferred embodiment of the present invention, sees Fig. 2,
Before obtaining character string to be sorted, also include:
S105: gather the sample set of the disaggregated model preset, sample set is divided into training set and test set;
Wherein, sample set includes presetting a character string, and presets the classification of each character string in a character string
Result.
Specifically, can randomly choose a large amount of (such as 1,000,000 etc.) random string, a large amount of (as
300000 etc.) normal character string is as the sample set of default disaggregated model.Sample set is pressed certain
Ratio (such as 6:4 etc.) random division is training set and test set, and wherein, training set is used for training pre-
If disaggregated model, test set is for testing to the disaggregated model that obtains of training.
It should be noted that during due to collecting sample collection, be by known random string and normal character
String is as sample set, so the classification results of each character string in sample set is known, for the ease of
Follow-up use, the character string that can would be classified as random string represents its type with 0, just would be classified as
The character string of normal character string represents its type with 1.And, however it is not limited to distinguish two kinds by 0 and 1
Type, can also by other any feasible by way of make a distinction, this is not specifically limited.
Specifically, disaggregated model can use svm classifier model, Bayesian Classification Model, decision tree
Disaggregated model or k-nearest neighbor (K-NN) disaggregated model etc..
Wherein, the formula of svm classifier model is expressed as follows:
Wherein, y represents the normalization characteristic vector being made up of multiple normalization characteristic of divisions, wTRepresent system
Number vector, b represents intercept, wTIt is undetermined parameter with b.
And when being trained, it can be assumed that when being categorized as normal character string of character string, respective value is
1;When being categorized as random string of character string, respective value is 0.Thus correspondence D (y), it is assumed that if D (y)
Result be more than 0, then judge the corresponding character string of this result as positive sample (as normal character string);As
Really the result of D (y) be less than or equal to 0, then judge the corresponding character string of this result as negative sample (as random words
Symbol string).During actual application, it can be assumed that be other situations, as long as result is unanimously before and after Bao Zhenging,
This is not specifically limited.
Wherein, the formula of Bayesian Classification Model is expressed as follows:
Wherein, k has two kinds of values (for example: value can be the 0th, 1), two kinds points of corresponding character string
Class situation (random string, normal character string), y represents normalization characteristic of division, yjRepresent j-th
Normalization characteristic of division, j ∈ [1, m], works as YjkMeet N (μjk,δjk) condition when,
AndDuring row training, it can be assumed that when being categorized as normal character string of character string, respective value is
1;When being categorized as random string of character string, respective value is 0.Thus correspondence c (y), it is assumed that if c (y)
Result be more than 0, then judge the corresponding character string of this result as positive sample (as normal character string);As
Really the result of c (y) be less than or equal to 0, then judge the corresponding character string of this result as negative sample (as random words
Symbol string).During actual application, it can be assumed that be other situations, as long as result is unanimously before and after Bao Zhenging,
This is not specifically limited.
Wherein, Decision-Tree Classifier Model can use ID3, C4.5, CART scheduling algorithm to set up model,
When using C4.5 algorithm to set up model, by normalized the longest adjacent vowel away from, normalized character
String comentropy and normalized string length as Split Attribute to be selected, classification results (normal character and
Random character) as the result of decision, divide training set to maximize information gain-ratio as split criterion,
Construct Decision-Tree Classifier Model by step.
In actual application, any one disaggregated model can be set up in conjunction with practical situations, near to K
Adjacent method disaggregated models etc. no longer carry out citing and introduce.
S106: extract multiple characteristic of division in each character string from training set, and be normalized place
Reason, obtains the multiple of each character string in training set and normalizes characteristic of divisions.
Specifically, the step being normalized is similar with S103, no longer repeats one by one herein.
S107: by multiple normalization characteristic of divisions of each character string in training set, and training set
In the classification results of each character string, the undetermined parameter in default disaggregated model is trained,
Trained values to undetermined parameter.
S108: extract multiple characteristic of division in each character string from test set, and be normalized place
Reason, obtains the multiple of each character string in test set and normalizes characteristic of divisions.
Specifically, the step being normalized is similar with S103, no longer repeats one by one herein.
S109: by multiple normalization characteristic of divisions of each character string in test set, and test set
In the classification results of each character string, be set as that to undetermined parameter the disaggregated model of trained values is surveyed
Examination, obtains test result.
S110: the accuracy rate of test result is compared with default accuracy rate threshold value, if test result
Accuracy rate more than preset accuracy rate threshold value, then perform S111;If the accuracy rate of test result is less than
It is equal to preset accuracy rate threshold value, then perform S112.
Wherein, preset accuracy rate threshold value to be configured according to actual application feature, as could be arranged to
50%th, 70% etc., this is not specifically limited.
S111: determine the disaggregated model that undetermined parameter is set as trained values as dividing that off-line training obtains
Class model, then performs the step that S101 obtains character string to be sorted.
I.e. can carry out online classification operation.
It should be noted that in actual application, undetermined parameter is set as dividing of trained values by S111 determination
The disaggregated model that class model obtains as off-line training, has i.e. obtained carrying out can making during online classification operation
Disaggregated model, but when carry out online classification operation, i.e. when perform S101-S104, then may be used
To set according to practical situations, it is not necessary to be to be carried out S101-S104 after S111 at once.
S112: determine that undetermined parameter is set as that the disaggregated model of trained values cannot function as the classification that off-line training obtains
Model, then performs the step that S105 gathers the sample set of disaggregated model.
I.e. re-start off-line training.
Furthermore, it is desirable to explanation, owing to practical situations constantly changes, when determining undetermined
After parameter is set as the disaggregated model that the disaggregated model of trained values obtains as off-line training, can be every one
The new sample set of time interval Resurvey preset, re-training obtains new disaggregated model, to original
Disaggregated model be updated, to ensure the accuracy of classification results.
In addition, it is necessary to explanation, disaggregated model off-line training process is preferably used at distributed big data
Reason system (such as ODPS, hadoop etc.), so that ensure can be can to the process of extensive sample and modeling
Efficiently accomplish in the time accepting.
The method that character string is classified described in the present embodiment, the classification mould being obtained by off-line training
Type, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain character to be sorted
The classification results of string, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.By test set pair
The disaggregated model that training obtains is tested, and can improve the accuracy of disaggregated model.Characteristic of division includes
The longest adjacent vowel can embody the spy of character string well away from, character string information entropy or string length
Levy, improve the accuracy of classification results.
As it is shown on figure 3, be a kind of structure drawing of device that character string is classified of the embodiment of the present invention,
This device includes:
Acquisition module 201, is used for obtaining character string to be sorted;
First extraction module 202, for extracting multiple characteristic of division from character string to be sorted;
Normalization module 203, for being normalized each characteristic of division respectively, obtains multiple
Normalization characteristic of division;
Sort module 204, for the disaggregated model being obtained by off-line training, according to multiple normalization point
Category feature, treats classification character string and classifies, obtain the classification results of character string to be sorted.
Further, seeing Fig. 4, this device also includes:
Sample set, for gathering the sample set of default disaggregated model, is divided into training by acquisition module 205
Collection and test set;Wherein, sample set includes presetting a character string, and presets each in a character string
The classification results of character string;
Second extraction module 206, extracts multiple characteristic of division in each character string from training set,
And be normalized, obtain the multiple of each character string in training set and normalize characteristic of divisions;
Training module 207, for normalizing characteristic of divisions by the multiple of each character string in training set,
And the classification results of each character string in training set, the undetermined parameter in default disaggregated model is entered
Row training, obtains the trained values of undetermined parameter;
3rd extraction module 208, extracts multiple characteristic of division in each character string from test set,
And be normalized, obtain the multiple of each character string in test set and normalize characteristic of divisions;
Test module 209, for normalizing characteristic of divisions by the multiple of each character string in test set,
And the classification results of each character string in test set, it is set as the classification mould of trained values to undetermined parameter
Type is tested, and obtains test result;
Comparison module 210, for comparing the accuracy rate of test result with default accuracy rate threshold value;
First determining module 211, if the accuracy rate for test result is more than default accuracy rate threshold value,
Then determine the disaggregated model that undetermined parameter is set as, and the disaggregated model of trained values obtains as off-line training,
Then notify that acquisition module 201 performs to obtain the step of character string to be sorted.
Further, seeing Fig. 5, this device also includes:
Second determining module 212, if the accuracy rate for test result is less than or equal to preset accuracy rate threshold
Value, it is determined that undetermined parameter is set as that the disaggregated model of trained values cannot function as the classification that off-line training obtains
Model, then notifies that acquisition module 205 performs to gather the step of the sample set of the disaggregated model preset.
Further, the classification results of character string to be sorted includes:
Character string to be sorted is random string, or character string to be sorted is normal character string.
Further, disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near
Adjacent method disaggregated model.
Further, characteristic of division includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, the longest adjacent vowel
Elder in spacing distance between all of adjacent vowel character representing arbitrary character string.
The device that character string is classified described in the present embodiment, the classification mould being obtained by off-line training
Type, according to multiple normalization characteristic of divisions, treats classification character string and classifies, obtain character to be sorted
The classification results of string, it is not necessary to relying on artificial, can being automatically obtained, efficiency is very high.By test set pair
The disaggregated model that training obtains is tested, and can improve the accuracy of disaggregated model.Characteristic of division includes
The longest adjacent vowel can embody the spy of character string well away from, character string information entropy or string length
Levy, improve the accuracy of classification results.
Described device describes corresponding with aforesaid method flow, and weak point is chatted with reference to said method flow process
State, no longer repeat one by one.
Described above illustrate and describes some preferred embodiments of the present invention, but as previously mentioned, it should reason
Solve the present invention and be not limited to form disclosed herein, be not to be taken as the eliminating to other embodiments,
And can be used for various other combination, modification and environment, and can in invention contemplated scope described herein,
It is modified by the technology or knowledge of above-mentioned teaching or association area.And those skilled in the art are carried out changes
Move and change is without departing from the spirit and scope of the present invention, then all should be in the protection of claims of the present invention
In the range of.
Claims (13)
1. the method that character string is classified, it is characterised in that described method includes:
Obtain character string to be sorted;
Extract multiple characteristic of division from described character string to be sorted;
Characteristic of division each described is normalized respectively, obtains multiple normalization characteristic of division;
The disaggregated model being obtained by off-line training, according to multiple described normalization characteristic of divisions, to described
Character string to be sorted is classified, and obtains the classification results of described character string to be sorted.
2. the method for claim 1, it is characterised in that before obtaining character string to be sorted,
Also include:
Each character string from test set is extracted multiple described characteristic of division, and is normalized place
Reason, obtains the multiple described of each character string in described test set and normalizes characteristic of division;
By the multiple described normalization characteristic of division of each character string in described test set and described
The classification results of each character string in test set, is set as the institute of described trained values to described undetermined parameter
State disaggregated model to test, obtain test result;
The accuracy rate of described test result is compared with default accuracy rate threshold value;
If the accuracy rate of described test result is more than described default accuracy rate threshold value, it is determined that treat described
Determine the described classification mould that parameter is set as that the described disaggregated model of described trained values obtains as off-line training
Type, then performs the described step obtaining character string to be sorted.
3. method as claimed in claim 2, it is characterised in that each character string from test set
Before the multiple described characteristic of division of middle extraction, also include:
Gather the sample set of the described disaggregated model preset, described sample set is divided into training set and test
Collection;Wherein, described sample set includes presetting each in a character string, and described default character string
The classification results of character string;
Each character string from described training set is extracted multiple described characteristic of division, and is normalized
Process, obtain the multiple described of each character string in described training set and normalize characteristic of division;
By the multiple described normalization characteristic of division of each character string in described training set and described
The classification results of each character string in training set, enters to the undetermined parameter in default described disaggregated model
Row training, obtains the trained values of described undetermined parameter.
4. method as claimed in claim 3, it is characterised in that by the accuracy rate of described test result
After comparing with default accuracy rate threshold value, also include:
If the accuracy rate of described test result is less than or equal to described default accuracy rate threshold value, it is determined that described
It is described that undetermined parameter is set as that the described disaggregated model of described trained values cannot function as that off-line training obtains
Disaggregated model, then performs the step of the sample set of the described described disaggregated model gathering and presetting.
5. the method for claim 1, it is characterised in that the classification of described character string to be sorted
Result includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
6. the method for claim 1, it is characterised in that described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near
Adjacent method disaggregated model.
7. the method as described in claim 1-6 any claim, it is characterised in that described classification
Feature includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent
Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
8. the device that character string is classified, it is characterised in that described device includes:
Acquisition module, is used for obtaining character string to be sorted;
First extraction module, for extracting multiple characteristic of division from described character string to be sorted;
Normalization module, for being normalized characteristic of division each described respectively, obtains multiple
Normalization characteristic of division;
Sort module, for the disaggregated model being obtained by off-line training, according to multiple described normalization point
Category feature, classifies to described character string to be sorted, obtains the classification results of described character string to be sorted.
9. device as claimed in claim 8, it is characterised in that described device also includes:
Described sample set, for gathering the sample set of default described disaggregated model, is divided into by acquisition module
Training set and test set;Wherein, described sample set includes presetting a character string, and described default
The classification results of each character string in character string;
Second extraction module, extracts multiple described classification in each character string from described training set
Feature, and being normalized, obtains the multiple described normalizing of each character string in described training set
Change characteristic of division;
Training module, for by the multiple described normalization classification of each character string in described training set
The classification results of each character string in feature, and described training set, to default described disaggregated model
In undetermined parameter be trained, obtain the trained values of described undetermined parameter;
3rd extraction module, extracts multiple described classification in each character string from described test set
Feature, and being normalized, obtains the multiple described normalizing of each character string in described test set
Change characteristic of division;
Test module, for by the multiple described normalization classification of each character string in described test set
The classification results of each character string in feature, and described test set, is set as to described undetermined parameter
The described disaggregated model of described trained values is tested, and obtains test result;
Comparison module, for comparing the accuracy rate of described test result with default accuracy rate threshold value;
First determining module, if the accuracy rate for described test result is more than described default accuracy rate threshold
Value, it is determined that described undetermined parameter is set as the described disaggregated model of described trained values as off-line training
The described disaggregated model obtaining, then notifies that described acquisition module performs described acquisition character string to be sorted
Step.
10. device as claimed in claim 9, it is characterised in that described device also includes:
Second determining module, if the accuracy rate for described test result is preset accurately less than or equal to described
Rate threshold value, it is determined that described undetermined parameter be set as the described disaggregated model of described trained values cannot function as from
The described disaggregated model that line training obtains, then notifies that described acquisition module performs the described institute gathering and presetting
State the step of the sample set of disaggregated model.
11. devices as claimed in claim 8, it is characterised in that the classification of described character string to be sorted
Result includes:
Described character string to be sorted is random string, or described character string to be sorted is normal character string.
12. devices as claimed in claim 8, it is characterised in that described disaggregated model includes:
Support vector machines disaggregated model, Decision-Tree Classifier Model, Bayesian Classification Model or K are near
Adjacent method disaggregated model.
13. devices as described in claim 8-12 any claim, it is characterised in that described point
Category feature includes:
The longest adjacent vowel is away from, character string information entropy or string length;Wherein, described the longest adjacent
Elder in spacing distance between all of adjacent vowel character representing arbitrary character string for the vowel.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510162076.9A CN106156120B (en) | 2015-04-07 | 2015-04-07 | Method and device for classifying character strings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510162076.9A CN106156120B (en) | 2015-04-07 | 2015-04-07 | Method and device for classifying character strings |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106156120A true CN106156120A (en) | 2016-11-23 |
CN106156120B CN106156120B (en) | 2020-02-28 |
Family
ID=57335674
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510162076.9A Active CN106156120B (en) | 2015-04-07 | 2015-04-07 | Method and device for classifying character strings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156120B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650449A (en) * | 2016-12-29 | 2017-05-10 | 哈尔滨安天科技股份有限公司 | Script heuristic detection method and system based on variable name confusion degree |
CN107807987A (en) * | 2017-10-31 | 2018-03-16 | 广东工业大学 | A kind of string sort method, system and a kind of string sort equipment |
WO2018107953A1 (en) * | 2016-12-12 | 2018-06-21 | 惠州Tcl移动通信有限公司 | Smart terminal, and automatic application sorting method thereof |
CN108920694A (en) * | 2018-07-13 | 2018-11-30 | 北京神州泰岳软件股份有限公司 | A kind of short text multi-tag classification method and device |
CN109343802A (en) * | 2018-09-04 | 2019-02-15 | 中国平安人寿保险股份有限公司 | Declaration form printing data generating method, device, computer equipment and storage medium |
CN109997104A (en) * | 2017-09-30 | 2019-07-09 | 华为技术有限公司 | A kind of notice display methods and terminal |
CN110875959A (en) * | 2018-08-13 | 2020-03-10 | 阿里巴巴集团控股有限公司 | Data identification method, junk mailbox identification method and file identification method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737186A (en) * | 2012-06-26 | 2012-10-17 | 腾讯科技(深圳)有限公司 | Malicious file identification method, device and storage medium |
CN103221960A (en) * | 2012-12-10 | 2013-07-24 | 华为技术有限公司 | Detection method and apparatus of malicious code |
CN103593062A (en) * | 2013-11-08 | 2014-02-19 | 北京奇虎科技有限公司 | Data detection method and device |
US20140093173A1 (en) * | 2012-10-01 | 2014-04-03 | Silverbrook Research Pty Ltd | Classifying a string formed from hand-written characters |
CN104462058A (en) * | 2014-10-24 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Character string identification method and device |
-
2015
- 2015-04-07 CN CN201510162076.9A patent/CN106156120B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737186A (en) * | 2012-06-26 | 2012-10-17 | 腾讯科技(深圳)有限公司 | Malicious file identification method, device and storage medium |
US20140093173A1 (en) * | 2012-10-01 | 2014-04-03 | Silverbrook Research Pty Ltd | Classifying a string formed from hand-written characters |
CN103221960A (en) * | 2012-12-10 | 2013-07-24 | 华为技术有限公司 | Detection method and apparatus of malicious code |
CN103593062A (en) * | 2013-11-08 | 2014-02-19 | 北京奇虎科技有限公司 | Data detection method and device |
CN104462058A (en) * | 2014-10-24 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Character string identification method and device |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018107953A1 (en) * | 2016-12-12 | 2018-06-21 | 惠州Tcl移动通信有限公司 | Smart terminal, and automatic application sorting method thereof |
CN106650449A (en) * | 2016-12-29 | 2017-05-10 | 哈尔滨安天科技股份有限公司 | Script heuristic detection method and system based on variable name confusion degree |
CN106650449B (en) * | 2016-12-29 | 2020-05-22 | 哈尔滨安天科技集团股份有限公司 | Script heuristic detection method and system based on variable name confusion degree |
CN109997104B (en) * | 2017-09-30 | 2021-06-22 | 华为技术有限公司 | Notification display method and terminal |
US11500507B2 (en) | 2017-09-30 | 2022-11-15 | Huawei Technologies Co.. Ltd. | Notification display method and terminal |
CN109997104A (en) * | 2017-09-30 | 2019-07-09 | 华为技术有限公司 | A kind of notice display methods and terminal |
CN107807987A (en) * | 2017-10-31 | 2018-03-16 | 广东工业大学 | A kind of string sort method, system and a kind of string sort equipment |
US11463476B2 (en) | 2017-10-31 | 2022-10-04 | Guangdong University Of Technology | Character string classification method and system, and character string classification device |
CN107807987B (en) * | 2017-10-31 | 2021-07-02 | 广东工业大学 | Character string classification method and system and character string classification equipment |
CN108920694A (en) * | 2018-07-13 | 2018-11-30 | 北京神州泰岳软件股份有限公司 | A kind of short text multi-tag classification method and device |
CN108920694B (en) * | 2018-07-13 | 2020-08-28 | 鼎富智能科技有限公司 | Short text multi-label classification method and device |
CN110875959A (en) * | 2018-08-13 | 2020-03-10 | 阿里巴巴集团控股有限公司 | Data identification method, junk mailbox identification method and file identification method |
CN110875959B (en) * | 2018-08-13 | 2022-10-18 | 阿里巴巴集团控股有限公司 | Data identification method, junk mailbox identification method and file identification method |
CN109343802A (en) * | 2018-09-04 | 2019-02-15 | 中国平安人寿保险股份有限公司 | Declaration form printing data generating method, device, computer equipment and storage medium |
CN109343802B (en) * | 2018-09-04 | 2023-11-03 | 中国平安人寿保险股份有限公司 | Policy print data generation method, device, computer device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106156120B (en) | 2020-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106156120A (en) | The method and apparatus that character string is classified | |
Di Capua et al. | Unsupervised cyber bullying detection in social networks | |
Jain et al. | Application of machine learning techniques to sentiment analysis | |
US10437867B2 (en) | Scenario generating apparatus and computer program therefor | |
CN103902570B (en) | A kind of text classification feature extracting method, sorting technique and device | |
CN107992633A (en) | Electronic document automatic classification method and system based on keyword feature | |
TWI554896B (en) | Information Classification Method and Information Classification System Based on Product Identification | |
Anwar et al. | Design and Implementation of a Machine Learning‐Based Authorship Identification Model | |
US20200220768A1 (en) | Method, apparatus and article of manufacture for categorizing computerized messages into categories | |
CN101621391A (en) | Method and system for classifying short texts based on probability topic | |
US11310200B1 (en) | Classifying locator generation kits | |
KR20190135129A (en) | Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning | |
CN103473380B (en) | A kind of computer version sensibility classification method | |
WO2023272850A1 (en) | Decision tree-based product matching method, apparatus and device, and storage medium | |
CN113055386A (en) | Method and device for identifying and analyzing attack organization | |
CN105354327A (en) | Interface API recommendation method and system based on massive data analysis | |
Mahmoudi et al. | Web spam detection based on discriminative content and link features | |
CN109344913B (en) | Network intrusion behavior detection method based on improved MajorCluster clustering | |
Coelho et al. | Text Classification in the Brazilian Legal Domain. | |
CN114416926A (en) | Keyword matching method and device, computing equipment and computer readable storage medium | |
CN106844596A (en) | One kind is based on improved SVM Chinese Text Categorizations | |
US10122720B2 (en) | System and method for automated web site content analysis | |
Rajaraman et al. | Mining semantic networks for knowledge discovery | |
CN116578700A (en) | Log classification method, log classification device, equipment and medium | |
CN107766412A (en) | A kind of mthods, systems and devices for establishing thematic map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |