CN107180022A - object classification method and device - Google Patents
object classification method and device Download PDFInfo
- Publication number
- CN107180022A CN107180022A CN201610134575.1A CN201610134575A CN107180022A CN 107180022 A CN107180022 A CN 107180022A CN 201610134575 A CN201610134575 A CN 201610134575A CN 107180022 A CN107180022 A CN 107180022A
- Authority
- CN
- China
- Prior art keywords
- classification
- string
- objects
- feature
- word combination
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a kind of object classification method and device, including:First determine the characteristics of objects word of object to be sorted;Characteristics of objects phrase is combined into characteristics of objects word combination afterwards, and characteristics of objects word combination is further converted into feature string;The target signature string consistent with default feature string is finally chosen from feature string, and the probable value of default classification is belonged to by calculating target signature string, to determine the target classification of object to be sorted.Thus, it is possible to improve the classification effectiveness and classification accuracy of object.
Description
Technical field
The application is related to field of computer technology, more particularly to a kind of object classification method and device.
Background technology
Object classification is the important content that object is excavated, and is referred to according to pre-defined subject categories,
A classification is determined for each object in object set.For so that object is text as an example, by automatic
Text Classification System sorts out document, and people can be helped preferably to find the information and knowledge of needs.
In people, classification is a kind of most basic cognitive form to information.Traditional document classification research
There are abundant achievement in research and suitable realistic scale.But with the rapid growth of object information, especially
It is the surge of online object information on internet, automatic classification of objects has become processing and a large amount of texts of tissue
The key technology of file data.Now, object classification is just widely used in every field, for example,
In internet platform, the query language that server can be received according to user by client,
Object information corresponding to query language is classified, after determining the corresponding classification of the object information, according to
Answered according to the corresponding automatic query language to user of classification, push related information.
In the prior art, mainly classified according to following three kinds of algorithms to treat object of classification, the first
Algorithm is Bayes (Bayes) algorithm, is according to the assorting process of the algorithm:According to Bayes algorithms come
Predict that object to be sorted belongs to the possibility of each classification, a maximum classification conduct of selection possibility
The final classification of the object to be sorted, is separate between assuming attribute yet with Bayes algorithms
, and this assumes to be often invalid in practice, enters so treating object of classification according to the algorithm
Row classification can cause the problem of classification results are inaccurate;Second algorithm is SVMs (svm) algorithm,
It is according to the assorting process of the algorithm:Training sample in region calculates the decision-making curved surface in the region,
Thereby determine that the classification of object to be sorted in the region;It is complicated yet with the algorithm comparison, it is not easy to manage
Solution, the assorting process that consequently leads to object to be sorted is also more complicated and the problem of be difficult to optimize;The
Three kinds of algorithms are the nearest neighbors algorithms of K (k-Nearest Neighbor, kNN), according to the algorithm
Assorting process is:The K training sample nearest with object distance to be sorted is found out, by K training sample
The classification belonged to as object to be sorted classification;But the algorithm needs to calculate object to be sorted and every
The distance of individual training sample, this can influence the classification effectiveness of object to be sorted.
The content of the invention
The embodiment of the present application provides a kind of object classification method and device, can improve object to be sorted
Classification effectiveness and classification accuracy.
First aspect includes there is provided a kind of object classification method, this method:
The object to be sorted of acquisition is pre-processed, with obtain the object to be sorted at least one is right
As Feature Words;
According to default algorithm, at least one described characteristics of objects word is grouped, at least one is obtained
Characteristics of objects word combination, wherein, more than one characteristics of objects word is included in each characteristics of objects word combination;
At least one described characteristics of objects word combination is converted into corresponding feature string;
Target signature string is chosen from least one feature string, wherein, the target signature string is stored in pre-
If memory cell in, the default memory cell is used to store multiple default feature strings and each
Default feature string belongs to the probable value of at least one classification;
The probable value of at least one classification according to being belonged to the target signature string, is treated point it is determined that described
The target classification that class object is belonged to.
Second aspect includes there is provided a kind of object sorter, the device:Pretreatment unit, packet
Unit, converting unit, selection unit and determining unit;
The default processing unit, is pre-processed for the object to be sorted to acquisition, described to obtain
At least one characteristics of objects word of object to be sorted;
The grouped element, for described according to default algorithm, being obtained to the pretreatment unit extremely
A few characteristics of objects word is grouped, and obtains at least one characteristics of objects word combination, wherein, it is each right
As including more than one characteristics of objects word in feature word combination;
The converting unit, for by the grouped element packet obtain described at least one characteristics of objects
Word combination is converted to corresponding feature string;
The selection unit, for being chosen from least one feature string that the converting unit is converted to
Target signature string, wherein, the target signature string is stored in default memory cell, described default
Memory cell is used to store multiple default feature strings and each default feature string belongs at least one
The probable value of classification;
The determining unit, for belonging to institute according to the target signature string of the selection unit selection
The probable value of at least one classification is stated, the target classification that the object to be sorted is belonged to is determined.
Object classification method and device that the application is provided, first determine the characteristics of objects word of object to be sorted;
Characteristics of objects phrase is combined into characteristics of objects word combination afterwards, and further changed characteristics of objects word combination
It is characterized string;The target signature string consistent with default feature string is finally chosen from feature string, and is led to
Cross and calculate the probable value that target signature string belongs to default classification, to determine the target class of object to be sorted
Not.Namely the application is overcome in the prior art due to classifying inaccurate caused by the hypothesis of mistake simultaneously
Problem, the problem of assorting process is difficult to optimize caused by algorithm is excessively complicated and calculating process it is cumbersome and
The problem of caused classification effectiveness is low.
Brief description of the drawings
The method for building up flow chart for the default memory cell that Fig. 1 provides for the application;
The object classification method flow chart that Fig. 2 provides for a kind of embodiment of the application;
The object sorter schematic diagram that Fig. 3 provides for another embodiment of the application.
Embodiment
Below in conjunction with the accompanying drawings, embodiments of the invention are described.
In internet arena, it is often necessary to situations such as seeking advice from, complain in face of user, advising.Work as interconnection
, it is necessary to which the Similar Problems of processing can be very big when net has sizable user group.For problems,
If all handled by artificial, great manpower will be consumed, and be limited to this, can not be right in time
Handled in similar the problem of.Therefore, in internet arena, it will usually set up the instruction of contents of object
Practice set, wherein, generally include contents of object and belonging kinds.The problem of user is got it
Afterwards, by matching algorithm, matched one by one with substantial amounts of sample content in training set, inquiry is most matched
Sample, after the sample most matched is found, choose the affiliated classification of the sample, and according to affiliated
Classification, the problem of being proposed to user is handled, so as to save artificial.
But as data volume increases, the content bar mesh number in training set also accordingly increases, and works as sample strip
Mesh is reached after up to ten thousand, is matched by object to be sorted with the similarity of every sample contents of object,
The mode inquired about one by one, it is obviously desirable to expend longer time.In order to reduce that matching inquiry consumed when
Between, in the prior art, it can select to cut training set, reject the certain amount in training set
Sample object content, so as to reduce the time needed for object classification, but its defect is, in training set
During the cutting of conjunction, it will may be cropped with the immediate sample of object to be sorted, so that finally
Classification results go wrong, reduction object classification accuracy.Further, this problem may be led
Server efficient resource is caused to be taken on a small quantity.
Therefore, the embodiment of the present application provides a kind of object classification method, the embodiment can apply to mutually
In networking arenas, the including but not limited to business platform such as Alipay, Taobao can also be applied and internet
Search platform.
Before each step for the object classification method that the embodiment of the present application is provided is performed, it can first set up pre-
If memory cell, the application preferably proposes the step for setting up default memory cell as shown in Figure 1
Suddenly:
Step 110, multiple training samples are collected in advance, and the plurality of training sample belongs at least one classification.
Training sample herein can be collected in advance from forum, complaint platform, client etc. by server
The text message arrived, for example, the message such as complaint, suggestion, inquiry;Can also be it is any have one or
The non-textual information of multiple attributes.And above-mentioned classification can artificially be preset, e.g., " password is forgotten "
" how logging in " etc..It should be noted that for above-mentioned multiple training samples, each training sample
The classification of ownership can be known.
The multiple training samples collected in advance for so that training sample is text message as an example and each instruction
The belonging kinds for practicing sample can be as shown in table 1.
Table 1
In table 1, four training samples are have collected, four training samples belong to two classifications, wherein,
First training sample and the 4th training sample belong to " password is forgotten " this classification, and second
Training sample and the 3rd training sample belong to " how logging in " this classification.
Illustrate herein, be limited to length, table 1 only lists less example, the training actually collected
Sample may include up to ten thousand contents.
Following steps 120- is performed to each training sample in multiple training samples for being collected in step 110
The processing of step 150.
Step 120, training sample is pre-processed, to obtain at least one training sample of training sample
Feature Words.
By training sample be text message exemplified by for, the process pre-processed to training sample include but
It is not limited to:Participle, filter word and synonym merging treatment are carried out to training sample, so as in instruction
Practice in sample and extract several most important Feature Words.
Wherein, participle refers to text message being divided into some phrases;Filter word refers in text message
The useless word filtering in part;Synonym merging treatment, then be by one or two implication in text message
Identical word merging treatment, or it is replaced with the word in thesaurus.
For example, to the training sample " my password have forgotten, and how give for change " in table 1, it is necessary to filter
The useless word fallen include " I ", " ", " " and " how ", remaining word after filtering
Group is " password ", " forgetting " and " giving for change ";Afterwards can be same in default thesaurus
Adopted word, carries out synonym merging treatment, finally can just get at least one training sample of training sample
Eigen word.In this example embodiment, at least one training sample Feature Words difference of the training sample got
For:" password ", " forgetting " and " giving for change ".
It should be noted that when training sample is non-textual information, then can obtain otherwise
At least one training sample Feature Words of training sample.Such as, user's current scene in Alipay system
Under when being classified the problem of may put question to, then can obtain the user id of the user first, afterwards root again
The log-on message (whether having logged in) of user is obtained from information system according to user id, and according to user id
Transfer information (whether transferring accounts unsuccessfully) of user etc. is obtained from fund transfer system, and the login of acquisition is believed
Breath and transfer information as training sample training sample Feature Words.
Step 130, according to default algorithm, at least one training sample Feature Words is grouped, obtained
At least one training sample feature word combination, wherein, one is included in each training sample feature word combination
Above training sample Feature Words.
Herein, default algorithm can be permutation and combination algorithm.When according to permutation and combination algorithm, to obtaining
At least one training sample Feature Words when being grouped, the instruction in obtained training sample feature word combination
N can be not more than by practicing the number of sample characteristics word, wherein, n is the number of training sample Feature Words.Herein
In specification, so that its number is not more than 2 as an example for.Such as previous example, according to permutation and combination algorithm,
When being grouped to " password ", " forgetting " and " giving for change " three training sample Feature Words of acquisition,
Can obtain (password), (forgetting), (giving for change), (password is forgotten), (password is given for change) ...,
Nine training sample feature word combinations such as (forgetting, give for change).
Certainly, in actual applications, can also be according to other algorithms, at least one training to acquisition
Sample characteristics word is combined, e.g., directly carries out combination of two etc., the application is not construed as limiting to this.
Step 140, it is corresponding training characteristics string by least one training sample Feature Words Combination conversion.
Alternatively, before step 140 is performed, two or more training sample Feature Words will can be included
Training sample Feature Words in training sample feature word combination are ranked up.In one example, can be by
Initial according to the phonetic of training text Feature Words is ranked up, e.g., and combination (password is forgotten) is come
Say, the phonetic of " password " is " mima ", and its initial is " m ";And the phonetic of " forgetting " for " wangj i ",
Its initial is:" w ", because " m " is come before " w " in alphabet, to combinations thereof
It is after sequence (password is forgotten).Certainly, in actual applications, it can also enter otherwise
Row sequence, e.g., noun is preceding, and verb is rear etc., and the application is not construed as limiting to this.
Return at least one training sample Feature Words Combination conversion in step 140, step 140 to be corresponding
Training characteristics string, be specially:
To the first training sample feature word combination at least one training sample feature word combination, by first
Whole training sample Feature Words in training sample feature word combination are combined by splitting string, are obtained
Corresponding training characteristics string, wherein, segmentation string is predefined character or character combination.
Herein, the first training sample feature word combination can be times at least one training characteristics word combination
One training sample feature word combination.In addition, above-mentioned segmentation string can be following spcial character or special word
The combination of symbol:“_”“#”、“*”、“<>" or " (.*) " etc..In this description,
For exemplified by splitting string and be " _ ".Such as by six training sample feature word combinations in previous example:It is (close
Code), (forgetting), (giving for change), (password is forgotten), (password is given for change) and (forget,
Give for change) be converted to corresponding training characteristics string after, be respectively:" password ", " forgetting ", " look for
Return ", " password _ forget ", " password _ give for change " and " forget _ give for change ".
, can be as shown in table 2 after the training sample in table 1 is converted into corresponding training characteristics string.
Table 2
Step 150, statistics training characteristics string belongs to the probable value of each classification at least one classification, will
The probable value that training characteristics string and training characteristics string belong to each classification at least one classification is stored in
In default memory cell.
In one example, above-mentioned probable value can be counted according to formula 1.
Wherein, f1For any training characteristics string;P(f1) it is f1Probable value;C1For f1In current class
The number of times of not middle appearance;C2For f1The number of times occurred in all categories;In using current class as table 2
" password is forgotten ", and f1For exemplified by " password _ forget ", C1Value be 2, C2Value also be 2.
Each training characteristics string in table 2 is obtained according to the statistics of formula 1 belongs to each in two classifications
After the probable value of classification, each training characteristics string and each training characteristics string can be belonged in two classifications
The probable value of each classification is stored in the default memory cell shown in table 3, default storage herein
Unit is referred to as default disaggregated model.
Table 3
Training characteristics string | Belonging kinds | Occurrence number | Probable value |
Password _ forget | Password is forgotten | 2 | 1.0 |
Forget _ log in | Password is forgotten | 1 | 0.33 |
Forget _ log in | How to log in | 2 | 0.67 |
Login _ password | Password is forgotten | 1 | 1.0 |
Log in | How to log in | 2 | 0.67 |
Log in | Password is forgotten | 1 | 0.33 |
Forget | Password is forgotten | 2 | 0.5 |
Forget | How to log in | 2 | 0.5 |
Illustrate herein, be limited to length, table 3 only lists part training characteristics string and its belongs to two
Contained in classification in the probable value of each classification, actual upper table 3 training characteristics string all in table 2 with
And its belong to the probable value of each classification in two classifications.
In table 3, the number of times that each training characteristics string occurs in each category is also recorded for, so as to
Count the total degree that each training characteristics string occurs.Such as, in table 3, " forget _ log in " is in " password
Forget " number of times that occurs in classification is 1, the number of times occurred in " how logging in " classification is 2, then " forgets
The total degree that note _ login " occurs is 1+2=3 times.Statistics obtain it is each training string occur total degree it
Afterwards, the application can also carry out filtration treatment further according to the total degree to table 3.Such as, can be pre-
First given threshold (e.g., 1), and it is pre- to judge whether total degree that each training characteristics string occurs is not more than this
If threshold value, if so, can then ignore the training characteristics string belongs to each classification with the training characteristics string
The corresponding relation of probable value.Such as, in table 3, the total degree that " login _ password " occurs is 1, then can be with
Delete the row where the training characteristics string.
Certainly, it is described herein the step of set up default memory cell can also using it is other it is various can
Capable step, as long as can be correct, safe sets up the default memory cell.In addition, pre-
If memory cell in the content that records be also not limited to the content enumerated in table 3, e.g., can be with
Including the corresponding training sample feature string combination of each training characteristics string, the application is not construed as limiting to this.
After above-mentioned default memory cell is established, it is possible to which pair of the embodiment of the present application offer is provided
As each step of sorting technique.Fig. 2 is the object classification method flow chart that a kind of embodiment of the application is provided,
From Figure 2 it can be seen that the present embodiment can include:
Step 210, the object to be sorted of acquisition is pre-processed, to obtain the object to be sorted extremely
A few characteristics of objects word.
Object to be sorted herein can be received by server from forum, complaint platform, client etc.
Text message, for example, complain, suggestion, inquiry etc. message;Can also any there is one or many
The non-textual information of individual attribute.The process that is pre-processed to the object to be sorted of acquisition with to training sample
The process pre-processed is similar, does not repeat again herein, and not detailed part may refer to the phase of step 120
Close description.
For exemplified by using object to be sorted as " my login password have forgotten, and how find ", acquisition
At least one characteristics of objects word can be " login ", " password ", " forgetting " and " searching ".
Step 220, according to default algorithm, at least one characteristics of objects word is grouped, obtained at least
One characteristics of objects word combination, wherein, more than one characteristics of objects is included in each characteristics of objects word combination
Word.
Default algorithm herein can be the permutation and combination algorithm being mentioned above.To extremely in step 220
Process that a few characteristics of objects word is grouped at least one training sample Feature Words with being grouped
Process is similar, does not repeat again herein, the not detailed associated description for pointing out to may refer to step 130.As before
Example is stated, to " login ", " password ", " forgetting " and " searching " four characteristics of objects of acquisition
When word is grouped, can obtain (login), (password), (forgetting), (searching), (step on
Record, password), (log in, forget) ..., ten characteristics of objects word combinations such as (forgetting, find).
Alternatively, when obtaining multiple characteristics of objects word combinations, the embodiment of the present application can also include as follows
Step:
To any two characteristics of objects word combination in the multiple characteristics of objects word combination, first pair is judged
As characteristics of objects word whole in feature word combination whether be included in the second characteristics of objects word combination in, if so,
The first characteristics of objects word combination is then rejected from multiple characteristics of objects word combinations.Herein, the first object is removed
The operation of feature word combination, more corresponds to actual needs, and can avoid due to the weight of single characteristics of objects word
The problem of object classification is inaccurate caused by too high.
Herein, the first characteristics of objects phrase is combined into any object Feature Words in multiple characteristics of objects word combinations
Combination, the second characteristics of objects phrase, which is combined into multiple characteristics of objects word combinations, removes the first characteristics of objects word combination
Outer any object feature word combination.For example, because pair whole in (login) and (password)
In being included in and (log in, password) as Feature Words, whole characteristics of objects words is included in (close in (forgetting)
Code, forgets) in, whole characteristics of objects words is included in (password is found) in (searching), institute
So that (login), (password), (forgetting), (searching) can be rejected, so that obtain (log in,
Password), (log in, forget), (logging in, find) ..., (forgetting, find) etc. six it is right
As feature word combination.
Step 230, at least one characteristics of objects word combination is converted into corresponding feature string.
Alternatively, before step 230 is performed, the object of two or more characteristics of objects word will can be included
Characteristics of objects word in feature word combination is ranked up, and the characteristics of objects word combination after sequence was carried out
Available characteristics of objects phrase is combined into after filter processing:(logging in, password), (forgetting, log in),
(logging in, find), (password is forgotten), (password is found) and (forgetting, find).
Step 230 is specifically as follows:To the first characteristics of objects word at least one characteristics of objects word combination
Combination, whole characteristics of objects words in the first characteristics of objects word combination are combined by splitting string,
Corresponding feature string is obtained, wherein, segmentation string is predefined character or character combination.
Herein, the first characteristics of objects word combination can be any object spy at least one feature word combination
Levy word combination.In addition, above-mentioned segmentation string can be the combination of following spcial character or spcial character:“_”
“#”、“*”、“<>" or " (.*) " etc..In this description, it is for " _ " to split string
For example.Such as by foregoing six characteristics of objects word combinations after rejecting, sequence and filtration treatment:
(log in, password), (forgetting, log in), (logging in, find), (password is forgotten), (close
Code, finds) and after (forget, find) be converted to corresponding feature string, respectively:" log in _ close
Code ", " forget _ log in ", " login _ searching ", " password _ forget ", " password _ searching " and
" forget _ find ".
Step 240, target signature string is chosen from least one feature string, wherein, the storage of target signature string
In default memory cell, default memory cell is used to store multiple default feature strings and each
Default feature string belongs to the probable value of at least one classification.
The number of target signature string herein can be one or more;Default memory cell can be such as table 3
It is shown, wherein, the training characteristics string in table 3 is above-mentioned default feature string.In step 240 to
The process of target signature string is chosen in a few feature string to be:By first at least one feature string
Individual feature string is compared one by one with each training characteristics string in table 3, if compare it is consistent, by this first
Individual feature string is chosen for target signature string;Otherwise, first feature string is ignored;By that analogy, until
Each feature string at least one feature string is compared and completed.Namely the application is directly by by feature
String (several words in object i.e. to be sorted) is compared with default feature string, to search target signature
String, so that avoiding needs to calculate object to be sorted and training sample in training sample set in the prior art
Similarity caused by classification effectiveness it is low the problem of.
Such as previous example, the target signature string of selection can be " forget _ log in " and " password _ forget ".
It should be noted that working as also includes the corresponding training of each training characteristics string in default memory cell
When sample characteristics string is combined, then the first characteristics of objects is rejected in the above-mentioned combination from least one characteristics of objects string
The step of string combination, can also perform after step 240 before step 250.
Step 250, the probable value of each classification at least one classification is belonged to according to target signature string, really
The target classification that fixed object to be sorted is belonged to.
Wherein, step 250 is specifically as follows:
The probable value of each classification at least one classification is belonged to according to target signature string, is determined to be sorted
Object belongs to the probable value of each classification;
When object to be sorted belongs to the probable value highest of second category, then second category is defined as treating
The target classification that object of classification is belonged to.
Second category herein can be any classification at least one classification.
Such as previous example, it can belong to that " password is forgotten according to " forget _ log in " and " password _ forget "
The probable value and " forget _ log in " of note " and " password _ forget " belong to the general of " how logging in "
Rate value, to determine that object to be sorted " my login password have forgotten, and how find " is respectively belonging to two
The probable value of individual classification.Specifically, the probable value that object to be sorted belongs to " password is forgotten " can be
0.33+1.0=1.33;And the probable value that object to be sorted belongs to " how logging in " can be 0.67;Cause
For 1.33>0.67, so the target classification that object to be sorted is belonged to is " password is forgotten ".
Certainly, in actual applications, the probable value of object to be sorted can also be determined according to other methods,
Such as, by the weighted average for the probable value for calculating each target signature string, to determine the general of object to be sorted
Rate value, the application is not construed as limiting to this.
After the classification of object to be sorted is determined, it is possible to automatic to user's according to corresponding classification
Query language is answered, and pushes related information.
The object classification method that the application is provided, first determines the characteristics of objects word of object to be sorted;Afterwards will
Characteristics of objects phrase is combined into characteristics of objects word combination, and is further characterized the conversion of characteristics of objects word combination
String;The target signature string consistent with default feature string is finally chosen from feature string, and by calculating
Target signature string belongs to the probable value of default classification, to determine the target classification of object to be sorted.
I.e. the application is classified according to the probability distribution for being easier to understand to treat object of classification, so as to solve
The problem of in the prior art being difficult to optimize because sorting algorithm is more complicated.
With above-mentioned object classification method accordingly, a kind of object sorter that the embodiment of the present application is also provided,
As shown in figure 3, the device includes:Pretreatment unit 301, grouped element 302, converting unit 303,
Choose unit 304 and determining unit 305.
Default processing unit 301, is pre-processed for the object to be sorted to acquisition, to obtain described treat
At least one characteristics of objects word of object of classification.
Grouped element 302, for described according to default algorithm, being obtained to pretreatment unit 301 at least
One characteristics of objects word is grouped, and obtains at least one characteristics of objects word combination, wherein, each object
More than one characteristics of objects word is included in feature word combination.
Converting unit 303, for by grouped element 302 packet obtain described at least one characteristics of objects word
Combination conversion is corresponding feature string.
Wherein, converting unit 303 specifically can be used for:
To the first characteristics of objects word combination at least one described feature word combination, by first object
Whole characteristics of objects words in feature word combination are combined by splitting string, obtain corresponding feature string,
Wherein, the segmentation string is predefined character or character combination.
Unit 304 is chosen, for choosing mesh from least one feature string that converting unit 303 is converted to
Feature string is marked, wherein, the target signature string is stored in default memory cell, described default to deposit
Storage unit is used to store multiple default feature strings and each default feature string belongs at least one class
Other probable value.
Determining unit 305, described in being belonged to according to the target signature string for choosing the selection of unit 304
The probable value of at least one classification, determines the target classification that the object to be sorted is belonged to.
Wherein it is determined that unit 305 specifically can be used for:
The probable value of each classification at least one classification according to being belonged to the target signature string, really
The fixed object to be sorted belongs to the probable value of each classification;
When the object to be sorted belongs to the probable value highest of second category, then by the second category
It is defined as the target classification that the object to be sorted is belonged to.
Alternatively, described device can also include:Unit 305 is set up, for collecting multiple training samples in advance
This, the multiple training sample belongs at least one described classification;
Following process is performed to each training sample in the multiple training sample:
The training sample is pre-processed, to obtain at least one training sample of the training sample
Feature Words;
According to the default algorithm, at least one described training sample Feature Words are grouped, obtained
At least one training sample feature word combination, wherein, one is included in each training sample feature word combination
Above training sample Feature Words;
It is corresponding training characteristics string by least one described training sample Feature Words Combination conversion;
The probable value that the training characteristics string belongs to each classification at least one described classification is counted, will
The training characteristics string and the training characteristics string belong to each classification at least one described classification
Probable value is stored in the default memory cell.
Alternatively, described device can also include:Processing unit 306, for counting the training characteristics string
The total degree of appearance, judges whether the total degree is less than predetermined threshold value, if so, then ignoring the training
Feature string belongs to the probable value of each classification at least one described classification.
Alternatively, described device can also include:Judging unit 307 and culling unit 308.
Judging unit 307, for any two characteristics of objects word in the multiple characteristics of objects word combination
Combination, judges whether characteristics of objects words whole in the first characteristics of objects word combination are included in the second object special
Levy in word combination;
Culling unit 308, if judging pairs whole in the first characteristics of objects word combination for judging unit 307
As Feature Words are included in the second characteristics of objects word combination, then picked from the multiple characteristics of objects word combination
Except the first characteristics of objects word combination.
The function of each functional module of the embodiment of the present application device, can pass through each of above method embodiment
Step realizes that therefore, the specific work process for the device that the application is provided herein is not repeated again.
The object sorter that the application is provided, the object to be sorted of default 301 pairs of acquisitions of processing unit enters
Row pretreatment, to obtain at least one characteristics of objects word of the object to be sorted;Grouped element 302 is pressed
According to default algorithm, at least one described characteristics of objects word is grouped, at least one object is obtained special
Levy word combination;At least one described characteristics of objects word combination is converted to corresponding feature by converting unit 303
String;Choose unit 304 and target signature string is chosen from least one feature string;The basis of determining unit 305
The target signature string belongs to the probable value of at least one classification, determines the object institute to be sorted
The target classification of ownership.Thus, it is possible to improve the classification effectiveness and classification accuracy of object.
Professional should further appreciate that, be described with reference to the embodiments described herein
The object and algorithm steps of each example, can be come with electronic hardware, computer software or the combination of the two
Realize, in order to clearly demonstrate the interchangeability of hardware and software, in the above description according to function
Generally describe the composition and step of each example.These functions are come with hardware or software mode actually
Perform, depending on the application-specific and design constraint of technical scheme.Professional and technical personnel can be to every
Described function is realized in individual specific application using distinct methods, but it is this realize it is not considered that
Beyond scope of the present application.
The step of method or algorithm for being described with reference to the embodiments described herein, can use hardware, processing
The software module that device is performed, or the two combination are implemented.Software module can be placed in random access memory
(RAM), internal memory, read-only storage (ROM), electrically programmable ROM, electrically erasable ROM,
Any other form well known in register, hard disk, moveable magnetic disc, CD-ROM or technical field
Storage medium in.
Above-described embodiment, purpose, technical scheme and beneficial effect to the application are carried out
Be further described, should be understood that the embodiment that the foregoing is only the application and
, it is not used to limit the protection domain of the application, it is all within spirit herein and principle, done
Any modification, equivalent substitution and improvements etc., should be included within the protection domain of the application.
Claims (12)
1. a kind of object classification method, it is characterised in that this method includes:
The object to be sorted of acquisition is pre-processed, with obtain the object to be sorted at least one is right
As Feature Words;
According to default algorithm, at least one described characteristics of objects word is grouped, at least one is obtained
Characteristics of objects word combination, wherein, more than one characteristics of objects word is included in each characteristics of objects word combination;
At least one described characteristics of objects word combination is converted into corresponding feature string;
Target signature string is chosen from least one feature string, wherein, the target signature string is stored in pre-
If memory cell in, the default memory cell is used to store multiple default feature strings and each
Default feature string belongs to the probable value of at least one classification;
The probable value of at least one classification according to being belonged to the target signature string, is treated point it is determined that described
The target classification that class object is belonged to.
2. according to the method described in claim 1, it is characterised in that methods described also includes:Set up institute
The step of stating default memory cell, including:
Multiple training samples are collected in advance, and the multiple training sample belongs at least one described classification;
Following process is performed to each training sample in the multiple training sample:
The training sample is pre-processed, to obtain at least one training sample of the training sample
Feature Words;
According to the default algorithm, at least one described training sample Feature Words are grouped, obtained
At least one training sample feature word combination, wherein, one is included in each training sample feature word combination
Above training sample Feature Words;
It is corresponding training characteristics string by least one described training sample Feature Words Combination conversion;
The probable value that the training characteristics string belongs to each classification at least one described classification is counted, will
The training characteristics string and the training characteristics string belong to each classification at least one described classification
Probable value is stored in the default memory cell.
3. method according to claim 2, it is characterised in that methods described also includes:
The total degree that the training characteristics string occurs is counted, judges whether the total degree is less than predetermined threshold value,
If so, then ignoring the probable value that the training characteristics string belongs to each classification at least one described classification.
4. the method according to claim any one of 1-3, it is characterised in that when obtaining multiple objects
During feature word combination, methods described also includes:
To any two characteristics of objects word combination in the multiple characteristics of objects word combination, first pair is judged
As characteristics of objects word whole in feature word combination whether be included in the second characteristics of objects word combination in, if so,
The first characteristics of objects word combination is then rejected from the multiple characteristics of objects word combination.
5. the method according to claim any one of 1-4, it is characterised in that described in the general at least
One characteristics of objects word combination is converted to corresponding feature string, is specially:
To the first characteristics of objects word combination at least one described feature word combination, by first object
Whole characteristics of objects words in feature word combination are combined by splitting string, obtain corresponding feature string,
Wherein, the segmentation string is predefined character or character combination.
6. the method according to claim any one of 1-5, it is characterised in that described according to the mesh
Mark feature string belongs to the probable value of each classification at least one described classification, and it is described to be sorted right to determine
As the target classification belonged to, it is specially:
The probable value of each classification at least one classification according to being belonged to the target signature string, really
The fixed object to be sorted belongs to the probable value of each classification;
When the object to be sorted belongs to the probable value highest of second category, then by the second category
It is defined as the target classification that the object to be sorted is belonged to.
7. a kind of object sorter, it is characterised in that the device includes:Pretreatment unit, packet are single
Member, converting unit, selection unit and determining unit;
The default processing unit, is pre-processed for the object to be sorted to acquisition, described to obtain
At least one characteristics of objects word of object to be sorted;
The grouped element, for described according to default algorithm, being obtained to the pretreatment unit extremely
A few characteristics of objects word is grouped, and obtains at least one characteristics of objects word combination, wherein, it is each right
As including more than one characteristics of objects word in feature word combination;
The converting unit, for by the grouped element packet obtain described at least one characteristics of objects
Word combination is converted to corresponding feature string;
The selection unit, for being chosen from least one feature string that the converting unit is converted to
Target signature string, wherein, the target signature string is stored in default memory cell, described default
Memory cell is used to store multiple default feature strings and each default feature string belongs at least one
The probable value of classification;
The determining unit, for belonging to institute according to the target signature string of the selection unit selection
The probable value of at least one classification is stated, the target classification that the object to be sorted is belonged to is determined.
8. device according to claim 7, it is characterised in that described device also includes:Set up single
Member, for collecting multiple training samples in advance, the multiple training sample belongs at least one described class
Not;
Following process is performed to each training sample in the multiple training sample:
The training sample is pre-processed, to obtain at least one training sample of the training sample
Feature Words;
According to the default algorithm, at least one described training sample Feature Words are grouped, obtained
At least one training sample feature word combination, wherein, one is included in each training sample feature word combination
Above training sample Feature Words;
It is corresponding training characteristics string by least one described training sample Feature Words Combination conversion;
The probable value that the training characteristics string belongs to each classification at least one described classification is counted, will
The training characteristics string and the training characteristics string belong to each classification at least one described classification
Probable value is stored in the default memory cell.
9. device according to claim 8, it is characterised in that described device also includes:Processing is single
Member, for counting the total degree that the training characteristics string occurs, judges whether the total degree is less than default
Threshold value, if so, then ignoring the training characteristics string belongs to each classification at least one described classification
Probable value.
10. the device according to claim any one of 7-9, it is characterised in that described device is also wrapped
Include:Judging unit and culling unit;
The judging unit, for any two characteristics of objects in the multiple characteristics of objects word combination
Word combination, judges whether characteristics of objects words whole in the first characteristics of objects word combination are included in the second object
In feature word combination;
The culling unit, if judging whole in the first characteristics of objects word combination for the judging unit
Characteristics of objects word is included in the second characteristics of objects word combination, then from the multiple characteristics of objects word combination
Reject the first characteristics of objects word combination.
11. the device according to claim any one of 7-10, it is characterised in that the converting unit
Specifically for:
To the first characteristics of objects word combination at least one described feature word combination, by first object
Whole characteristics of objects words in feature word combination are combined by splitting string, obtain corresponding feature string,
Wherein, the segmentation string is predefined character or character combination.
12. the device according to claim any one of 7-11, it is characterised in that the determining unit
Specifically for:
The probable value of each classification at least one classification according to being belonged to the target signature string, really
The fixed object to be sorted belongs to the probable value of each classification;
When the object to be sorted belongs to the probable value highest of second category, then by the second category
It is defined as the target classification that the object to be sorted is belonged to.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610134575.1A CN107180022A (en) | 2016-03-09 | 2016-03-09 | object classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610134575.1A CN107180022A (en) | 2016-03-09 | 2016-03-09 | object classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107180022A true CN107180022A (en) | 2017-09-19 |
Family
ID=59829534
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610134575.1A Pending CN107180022A (en) | 2016-03-09 | 2016-03-09 | object classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107180022A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110321557A (en) * | 2019-06-14 | 2019-10-11 | 广州多益网络股份有限公司 | A kind of file classification method, device, electronic equipment and storage medium |
CN111488950A (en) * | 2020-05-14 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Classification model information output method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101477544A (en) * | 2009-01-12 | 2009-07-08 | 腾讯科技(深圳)有限公司 | Rubbish text recognition method and system |
US20100070339A1 (en) * | 2008-09-15 | 2010-03-18 | Google Inc. | Associating an Entity with a Category |
CN102426585A (en) * | 2011-08-09 | 2012-04-25 | 中国科学技术信息研究所 | Automatic webpage classification method based on Bayesian network |
US20120278336A1 (en) * | 2011-04-29 | 2012-11-01 | Malik Hassan H | Representing information from documents |
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN104392006A (en) * | 2014-12-17 | 2015-03-04 | 中国农业银行股份有限公司 | Event query processing method and device |
-
2016
- 2016-03-09 CN CN201610134575.1A patent/CN107180022A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100070339A1 (en) * | 2008-09-15 | 2010-03-18 | Google Inc. | Associating an Entity with a Category |
CN101477544A (en) * | 2009-01-12 | 2009-07-08 | 腾讯科技(深圳)有限公司 | Rubbish text recognition method and system |
US20120278336A1 (en) * | 2011-04-29 | 2012-11-01 | Malik Hassan H | Representing information from documents |
CN102426585A (en) * | 2011-08-09 | 2012-04-25 | 中国科学技术信息研究所 | Automatic webpage classification method based on Bayesian network |
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN104392006A (en) * | 2014-12-17 | 2015-03-04 | 中国农业银行股份有限公司 | Event query processing method and device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110321557A (en) * | 2019-06-14 | 2019-10-11 | 广州多益网络股份有限公司 | A kind of file classification method, device, electronic equipment and storage medium |
CN111488950A (en) * | 2020-05-14 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Classification model information output method and device |
CN111488950B (en) * | 2020-05-14 | 2021-10-15 | 支付宝(杭州)信息技术有限公司 | Classification model information output method and device |
WO2021228152A1 (en) * | 2020-05-14 | 2021-11-18 | 支付宝(杭州)信息技术有限公司 | Classification model information output |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
CN104239539B (en) | A kind of micro-blog information filter method merged based on much information | |
CN107103043A (en) | A kind of Text Clustering Method and system | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN102945246B (en) | The disposal route of network information data and device | |
CN107609121A (en) | Newsletter archive sorting technique based on LDA and word2vec algorithms | |
CN103984703B (en) | Mail classification method and device | |
CN109241274A (en) | text clustering method and device | |
CN101477563B (en) | Short text clustering method and system, and its data processing device | |
CN106528642A (en) | TF-IDF feature extraction based short text classification method | |
Bates et al. | Counting clusters in twitter posts | |
CN103136266A (en) | Method and device for classification of mail | |
CN102411563A (en) | Method, device and system for identifying target words | |
CN103955453B (en) | A kind of method and device for finding neologisms automatic from document sets | |
CN108733816A (en) | A kind of microblogging incident detection method | |
CN105095223A (en) | Method for classifying texts and server | |
CN104317891B (en) | A kind of method and device that label is marked to the page | |
CN106909669B (en) | Method and device for detecting promotion information | |
CN105956031A (en) | Text classification method and apparatus | |
CN103886077B (en) | Short text clustering method and system | |
CN105893606A (en) | Text classifying method and device | |
CN108021545A (en) | A kind of case of administration of justice document is by extracting method and device | |
CN110990676A (en) | Social media hotspot topic extraction method and system | |
Jatana et al. | Bayesian spam classification: Time efficient radix encoded fragmented database approach | |
CN104572720B (en) | A kind of method, apparatus and computer readable storage medium of webpage information re-scheduling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170919 |