CN107180022A - object classification method and device - Google Patents

object classification method and device Download PDF

Info

Publication number
CN107180022A
CN107180022A CN201610134575.1A CN201610134575A CN107180022A CN 107180022 A CN107180022 A CN 107180022A CN 201610134575 A CN201610134575 A CN 201610134575A CN 107180022 A CN107180022 A CN 107180022A
Authority
CN
China
Prior art keywords
classification
string
objects
feature
word combination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610134575.1A
Other languages
Chinese (zh)
Inventor
焦盼盼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610134575.1A priority Critical patent/CN107180022A/en
Publication of CN107180022A publication Critical patent/CN107180022A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a kind of object classification method and device, including:First determine the characteristics of objects word of object to be sorted;Characteristics of objects phrase is combined into characteristics of objects word combination afterwards, and characteristics of objects word combination is further converted into feature string;The target signature string consistent with default feature string is finally chosen from feature string, and the probable value of default classification is belonged to by calculating target signature string, to determine the target classification of object to be sorted.Thus, it is possible to improve the classification effectiveness and classification accuracy of object.

Description

Object classification method and device
Technical field
The application is related to field of computer technology, more particularly to a kind of object classification method and device.
Background technology
Object classification is the important content that object is excavated, and is referred to according to pre-defined subject categories, A classification is determined for each object in object set.For so that object is text as an example, by automatic Text Classification System sorts out document, and people can be helped preferably to find the information and knowledge of needs. In people, classification is a kind of most basic cognitive form to information.Traditional document classification research There are abundant achievement in research and suitable realistic scale.But with the rapid growth of object information, especially It is the surge of online object information on internet, automatic classification of objects has become processing and a large amount of texts of tissue The key technology of file data.Now, object classification is just widely used in every field, for example, In internet platform, the query language that server can be received according to user by client, Object information corresponding to query language is classified, after determining the corresponding classification of the object information, according to Answered according to the corresponding automatic query language to user of classification, push related information.
In the prior art, mainly classified according to following three kinds of algorithms to treat object of classification, the first Algorithm is Bayes (Bayes) algorithm, is according to the assorting process of the algorithm:According to Bayes algorithms come Predict that object to be sorted belongs to the possibility of each classification, a maximum classification conduct of selection possibility The final classification of the object to be sorted, is separate between assuming attribute yet with Bayes algorithms , and this assumes to be often invalid in practice, enters so treating object of classification according to the algorithm Row classification can cause the problem of classification results are inaccurate;Second algorithm is SVMs (svm) algorithm, It is according to the assorting process of the algorithm:Training sample in region calculates the decision-making curved surface in the region, Thereby determine that the classification of object to be sorted in the region;It is complicated yet with the algorithm comparison, it is not easy to manage Solution, the assorting process that consequently leads to object to be sorted is also more complicated and the problem of be difficult to optimize;The Three kinds of algorithms are the nearest neighbors algorithms of K (k-Nearest Neighbor, kNN), according to the algorithm Assorting process is:The K training sample nearest with object distance to be sorted is found out, by K training sample The classification belonged to as object to be sorted classification;But the algorithm needs to calculate object to be sorted and every The distance of individual training sample, this can influence the classification effectiveness of object to be sorted.
The content of the invention
The embodiment of the present application provides a kind of object classification method and device, can improve object to be sorted Classification effectiveness and classification accuracy.
First aspect includes there is provided a kind of object classification method, this method:
The object to be sorted of acquisition is pre-processed, with obtain the object to be sorted at least one is right As Feature Words;
According to default algorithm, at least one described characteristics of objects word is grouped, at least one is obtained Characteristics of objects word combination, wherein, more than one characteristics of objects word is included in each characteristics of objects word combination;
At least one described characteristics of objects word combination is converted into corresponding feature string;
Target signature string is chosen from least one feature string, wherein, the target signature string is stored in pre- If memory cell in, the default memory cell is used to store multiple default feature strings and each Default feature string belongs to the probable value of at least one classification;
The probable value of at least one classification according to being belonged to the target signature string, is treated point it is determined that described The target classification that class object is belonged to.
Second aspect includes there is provided a kind of object sorter, the device:Pretreatment unit, packet Unit, converting unit, selection unit and determining unit;
The default processing unit, is pre-processed for the object to be sorted to acquisition, described to obtain At least one characteristics of objects word of object to be sorted;
The grouped element, for described according to default algorithm, being obtained to the pretreatment unit extremely A few characteristics of objects word is grouped, and obtains at least one characteristics of objects word combination, wherein, it is each right As including more than one characteristics of objects word in feature word combination;
The converting unit, for by the grouped element packet obtain described at least one characteristics of objects Word combination is converted to corresponding feature string;
The selection unit, for being chosen from least one feature string that the converting unit is converted to Target signature string, wherein, the target signature string is stored in default memory cell, described default Memory cell is used to store multiple default feature strings and each default feature string belongs at least one The probable value of classification;
The determining unit, for belonging to institute according to the target signature string of the selection unit selection The probable value of at least one classification is stated, the target classification that the object to be sorted is belonged to is determined.
Object classification method and device that the application is provided, first determine the characteristics of objects word of object to be sorted; Characteristics of objects phrase is combined into characteristics of objects word combination afterwards, and further changed characteristics of objects word combination It is characterized string;The target signature string consistent with default feature string is finally chosen from feature string, and is led to Cross and calculate the probable value that target signature string belongs to default classification, to determine the target class of object to be sorted Not.Namely the application is overcome in the prior art due to classifying inaccurate caused by the hypothesis of mistake simultaneously Problem, the problem of assorting process is difficult to optimize caused by algorithm is excessively complicated and calculating process it is cumbersome and The problem of caused classification effectiveness is low.
Brief description of the drawings
The method for building up flow chart for the default memory cell that Fig. 1 provides for the application;
The object classification method flow chart that Fig. 2 provides for a kind of embodiment of the application;
The object sorter schematic diagram that Fig. 3 provides for another embodiment of the application.
Embodiment
Below in conjunction with the accompanying drawings, embodiments of the invention are described.
In internet arena, it is often necessary to situations such as seeking advice from, complain in face of user, advising.Work as interconnection , it is necessary to which the Similar Problems of processing can be very big when net has sizable user group.For problems, If all handled by artificial, great manpower will be consumed, and be limited to this, can not be right in time Handled in similar the problem of.Therefore, in internet arena, it will usually set up the instruction of contents of object Practice set, wherein, generally include contents of object and belonging kinds.The problem of user is got it Afterwards, by matching algorithm, matched one by one with substantial amounts of sample content in training set, inquiry is most matched Sample, after the sample most matched is found, choose the affiliated classification of the sample, and according to affiliated Classification, the problem of being proposed to user is handled, so as to save artificial.
But as data volume increases, the content bar mesh number in training set also accordingly increases, and works as sample strip Mesh is reached after up to ten thousand, is matched by object to be sorted with the similarity of every sample contents of object, The mode inquired about one by one, it is obviously desirable to expend longer time.In order to reduce that matching inquiry consumed when Between, in the prior art, it can select to cut training set, reject the certain amount in training set Sample object content, so as to reduce the time needed for object classification, but its defect is, in training set During the cutting of conjunction, it will may be cropped with the immediate sample of object to be sorted, so that finally Classification results go wrong, reduction object classification accuracy.Further, this problem may be led Server efficient resource is caused to be taken on a small quantity.
Therefore, the embodiment of the present application provides a kind of object classification method, the embodiment can apply to mutually In networking arenas, the including but not limited to business platform such as Alipay, Taobao can also be applied and internet Search platform.
Before each step for the object classification method that the embodiment of the present application is provided is performed, it can first set up pre- If memory cell, the application preferably proposes the step for setting up default memory cell as shown in Figure 1 Suddenly:
Step 110, multiple training samples are collected in advance, and the plurality of training sample belongs at least one classification.
Training sample herein can be collected in advance from forum, complaint platform, client etc. by server The text message arrived, for example, the message such as complaint, suggestion, inquiry;Can also be it is any have one or The non-textual information of multiple attributes.And above-mentioned classification can artificially be preset, e.g., " password is forgotten " " how logging in " etc..It should be noted that for above-mentioned multiple training samples, each training sample The classification of ownership can be known.
The multiple training samples collected in advance for so that training sample is text message as an example and each instruction The belonging kinds for practicing sample can be as shown in table 1.
Table 1
In table 1, four training samples are have collected, four training samples belong to two classifications, wherein, First training sample and the 4th training sample belong to " password is forgotten " this classification, and second Training sample and the 3rd training sample belong to " how logging in " this classification.
Illustrate herein, be limited to length, table 1 only lists less example, the training actually collected Sample may include up to ten thousand contents.
Following steps 120- is performed to each training sample in multiple training samples for being collected in step 110 The processing of step 150.
Step 120, training sample is pre-processed, to obtain at least one training sample of training sample Feature Words.
By training sample be text message exemplified by for, the process pre-processed to training sample include but It is not limited to:Participle, filter word and synonym merging treatment are carried out to training sample, so as in instruction Practice in sample and extract several most important Feature Words.
Wherein, participle refers to text message being divided into some phrases;Filter word refers in text message The useless word filtering in part;Synonym merging treatment, then be by one or two implication in text message Identical word merging treatment, or it is replaced with the word in thesaurus.
For example, to the training sample " my password have forgotten, and how give for change " in table 1, it is necessary to filter The useless word fallen include " I ", " ", " " and " how ", remaining word after filtering Group is " password ", " forgetting " and " giving for change ";Afterwards can be same in default thesaurus Adopted word, carries out synonym merging treatment, finally can just get at least one training sample of training sample Eigen word.In this example embodiment, at least one training sample Feature Words difference of the training sample got For:" password ", " forgetting " and " giving for change ".
It should be noted that when training sample is non-textual information, then can obtain otherwise At least one training sample Feature Words of training sample.Such as, user's current scene in Alipay system Under when being classified the problem of may put question to, then can obtain the user id of the user first, afterwards root again The log-on message (whether having logged in) of user is obtained from information system according to user id, and according to user id Transfer information (whether transferring accounts unsuccessfully) of user etc. is obtained from fund transfer system, and the login of acquisition is believed Breath and transfer information as training sample training sample Feature Words.
Step 130, according to default algorithm, at least one training sample Feature Words is grouped, obtained At least one training sample feature word combination, wherein, one is included in each training sample feature word combination Above training sample Feature Words.
Herein, default algorithm can be permutation and combination algorithm.When according to permutation and combination algorithm, to obtaining At least one training sample Feature Words when being grouped, the instruction in obtained training sample feature word combination N can be not more than by practicing the number of sample characteristics word, wherein, n is the number of training sample Feature Words.Herein In specification, so that its number is not more than 2 as an example for.Such as previous example, according to permutation and combination algorithm, When being grouped to " password ", " forgetting " and " giving for change " three training sample Feature Words of acquisition, Can obtain (password), (forgetting), (giving for change), (password is forgotten), (password is given for change) ..., Nine training sample feature word combinations such as (forgetting, give for change).
Certainly, in actual applications, can also be according to other algorithms, at least one training to acquisition Sample characteristics word is combined, e.g., directly carries out combination of two etc., the application is not construed as limiting to this.
Step 140, it is corresponding training characteristics string by least one training sample Feature Words Combination conversion.
Alternatively, before step 140 is performed, two or more training sample Feature Words will can be included Training sample Feature Words in training sample feature word combination are ranked up.In one example, can be by Initial according to the phonetic of training text Feature Words is ranked up, e.g., and combination (password is forgotten) is come Say, the phonetic of " password " is " mima ", and its initial is " m ";And the phonetic of " forgetting " for " wangj i ", Its initial is:" w ", because " m " is come before " w " in alphabet, to combinations thereof It is after sequence (password is forgotten).Certainly, in actual applications, it can also enter otherwise Row sequence, e.g., noun is preceding, and verb is rear etc., and the application is not construed as limiting to this.
Return at least one training sample Feature Words Combination conversion in step 140, step 140 to be corresponding Training characteristics string, be specially:
To the first training sample feature word combination at least one training sample feature word combination, by first Whole training sample Feature Words in training sample feature word combination are combined by splitting string, are obtained Corresponding training characteristics string, wherein, segmentation string is predefined character or character combination.
Herein, the first training sample feature word combination can be times at least one training characteristics word combination One training sample feature word combination.In addition, above-mentioned segmentation string can be following spcial character or special word The combination of symbol:“_”“#”、“*”、“<>" or " (.*) " etc..In this description, For exemplified by splitting string and be " _ ".Such as by six training sample feature word combinations in previous example:It is (close Code), (forgetting), (giving for change), (password is forgotten), (password is given for change) and (forget, Give for change) be converted to corresponding training characteristics string after, be respectively:" password ", " forgetting ", " look for Return ", " password _ forget ", " password _ give for change " and " forget _ give for change ".
, can be as shown in table 2 after the training sample in table 1 is converted into corresponding training characteristics string.
Table 2
Step 150, statistics training characteristics string belongs to the probable value of each classification at least one classification, will The probable value that training characteristics string and training characteristics string belong to each classification at least one classification is stored in In default memory cell.
In one example, above-mentioned probable value can be counted according to formula 1.
Wherein, f1For any training characteristics string;P(f1) it is f1Probable value;C1For f1In current class The number of times of not middle appearance;C2For f1The number of times occurred in all categories;In using current class as table 2 " password is forgotten ", and f1For exemplified by " password _ forget ", C1Value be 2, C2Value also be 2.
Each training characteristics string in table 2 is obtained according to the statistics of formula 1 belongs to each in two classifications After the probable value of classification, each training characteristics string and each training characteristics string can be belonged in two classifications The probable value of each classification is stored in the default memory cell shown in table 3, default storage herein Unit is referred to as default disaggregated model.
Table 3
Training characteristics string Belonging kinds Occurrence number Probable value
Password _ forget Password is forgotten 2 1.0
Forget _ log in Password is forgotten 1 0.33
Forget _ log in How to log in 2 0.67
Login _ password Password is forgotten 1 1.0
Log in How to log in 2 0.67
Log in Password is forgotten 1 0.33
Forget Password is forgotten 2 0.5
Forget How to log in 2 0.5
Illustrate herein, be limited to length, table 3 only lists part training characteristics string and its belongs to two Contained in classification in the probable value of each classification, actual upper table 3 training characteristics string all in table 2 with And its belong to the probable value of each classification in two classifications.
In table 3, the number of times that each training characteristics string occurs in each category is also recorded for, so as to Count the total degree that each training characteristics string occurs.Such as, in table 3, " forget _ log in " is in " password Forget " number of times that occurs in classification is 1, the number of times occurred in " how logging in " classification is 2, then " forgets The total degree that note _ login " occurs is 1+2=3 times.Statistics obtain it is each training string occur total degree it Afterwards, the application can also carry out filtration treatment further according to the total degree to table 3.Such as, can be pre- First given threshold (e.g., 1), and it is pre- to judge whether total degree that each training characteristics string occurs is not more than this If threshold value, if so, can then ignore the training characteristics string belongs to each classification with the training characteristics string The corresponding relation of probable value.Such as, in table 3, the total degree that " login _ password " occurs is 1, then can be with Delete the row where the training characteristics string.
Certainly, it is described herein the step of set up default memory cell can also using it is other it is various can Capable step, as long as can be correct, safe sets up the default memory cell.In addition, pre- If memory cell in the content that records be also not limited to the content enumerated in table 3, e.g., can be with Including the corresponding training sample feature string combination of each training characteristics string, the application is not construed as limiting to this.
After above-mentioned default memory cell is established, it is possible to which pair of the embodiment of the present application offer is provided As each step of sorting technique.Fig. 2 is the object classification method flow chart that a kind of embodiment of the application is provided, From Figure 2 it can be seen that the present embodiment can include:
Step 210, the object to be sorted of acquisition is pre-processed, to obtain the object to be sorted extremely A few characteristics of objects word.
Object to be sorted herein can be received by server from forum, complaint platform, client etc. Text message, for example, complain, suggestion, inquiry etc. message;Can also any there is one or many The non-textual information of individual attribute.The process that is pre-processed to the object to be sorted of acquisition with to training sample The process pre-processed is similar, does not repeat again herein, and not detailed part may refer to the phase of step 120 Close description.
For exemplified by using object to be sorted as " my login password have forgotten, and how find ", acquisition At least one characteristics of objects word can be " login ", " password ", " forgetting " and " searching ".
Step 220, according to default algorithm, at least one characteristics of objects word is grouped, obtained at least One characteristics of objects word combination, wherein, more than one characteristics of objects is included in each characteristics of objects word combination Word.
Default algorithm herein can be the permutation and combination algorithm being mentioned above.To extremely in step 220 Process that a few characteristics of objects word is grouped at least one training sample Feature Words with being grouped Process is similar, does not repeat again herein, the not detailed associated description for pointing out to may refer to step 130.As before Example is stated, to " login ", " password ", " forgetting " and " searching " four characteristics of objects of acquisition When word is grouped, can obtain (login), (password), (forgetting), (searching), (step on Record, password), (log in, forget) ..., ten characteristics of objects word combinations such as (forgetting, find).
Alternatively, when obtaining multiple characteristics of objects word combinations, the embodiment of the present application can also include as follows Step:
To any two characteristics of objects word combination in the multiple characteristics of objects word combination, first pair is judged As characteristics of objects word whole in feature word combination whether be included in the second characteristics of objects word combination in, if so, The first characteristics of objects word combination is then rejected from multiple characteristics of objects word combinations.Herein, the first object is removed The operation of feature word combination, more corresponds to actual needs, and can avoid due to the weight of single characteristics of objects word The problem of object classification is inaccurate caused by too high.
Herein, the first characteristics of objects phrase is combined into any object Feature Words in multiple characteristics of objects word combinations Combination, the second characteristics of objects phrase, which is combined into multiple characteristics of objects word combinations, removes the first characteristics of objects word combination Outer any object feature word combination.For example, because pair whole in (login) and (password) In being included in and (log in, password) as Feature Words, whole characteristics of objects words is included in (close in (forgetting) Code, forgets) in, whole characteristics of objects words is included in (password is found) in (searching), institute So that (login), (password), (forgetting), (searching) can be rejected, so that obtain (log in, Password), (log in, forget), (logging in, find) ..., (forgetting, find) etc. six it is right As feature word combination.
Step 230, at least one characteristics of objects word combination is converted into corresponding feature string.
Alternatively, before step 230 is performed, the object of two or more characteristics of objects word will can be included Characteristics of objects word in feature word combination is ranked up, and the characteristics of objects word combination after sequence was carried out Available characteristics of objects phrase is combined into after filter processing:(logging in, password), (forgetting, log in), (logging in, find), (password is forgotten), (password is found) and (forgetting, find).
Step 230 is specifically as follows:To the first characteristics of objects word at least one characteristics of objects word combination Combination, whole characteristics of objects words in the first characteristics of objects word combination are combined by splitting string, Corresponding feature string is obtained, wherein, segmentation string is predefined character or character combination.
Herein, the first characteristics of objects word combination can be any object spy at least one feature word combination Levy word combination.In addition, above-mentioned segmentation string can be the combination of following spcial character or spcial character:“_” “#”、“*”、“<>" or " (.*) " etc..In this description, it is for " _ " to split string For example.Such as by foregoing six characteristics of objects word combinations after rejecting, sequence and filtration treatment: (log in, password), (forgetting, log in), (logging in, find), (password is forgotten), (close Code, finds) and after (forget, find) be converted to corresponding feature string, respectively:" log in _ close Code ", " forget _ log in ", " login _ searching ", " password _ forget ", " password _ searching " and " forget _ find ".
Step 240, target signature string is chosen from least one feature string, wherein, the storage of target signature string In default memory cell, default memory cell is used to store multiple default feature strings and each Default feature string belongs to the probable value of at least one classification.
The number of target signature string herein can be one or more;Default memory cell can be such as table 3 It is shown, wherein, the training characteristics string in table 3 is above-mentioned default feature string.In step 240 to The process of target signature string is chosen in a few feature string to be:By first at least one feature string Individual feature string is compared one by one with each training characteristics string in table 3, if compare it is consistent, by this first Individual feature string is chosen for target signature string;Otherwise, first feature string is ignored;By that analogy, until Each feature string at least one feature string is compared and completed.Namely the application is directly by by feature String (several words in object i.e. to be sorted) is compared with default feature string, to search target signature String, so that avoiding needs to calculate object to be sorted and training sample in training sample set in the prior art Similarity caused by classification effectiveness it is low the problem of.
Such as previous example, the target signature string of selection can be " forget _ log in " and " password _ forget ".
It should be noted that working as also includes the corresponding training of each training characteristics string in default memory cell When sample characteristics string is combined, then the first characteristics of objects is rejected in the above-mentioned combination from least one characteristics of objects string The step of string combination, can also perform after step 240 before step 250.
Step 250, the probable value of each classification at least one classification is belonged to according to target signature string, really The target classification that fixed object to be sorted is belonged to.
Wherein, step 250 is specifically as follows:
The probable value of each classification at least one classification is belonged to according to target signature string, is determined to be sorted Object belongs to the probable value of each classification;
When object to be sorted belongs to the probable value highest of second category, then second category is defined as treating The target classification that object of classification is belonged to.
Second category herein can be any classification at least one classification.
Such as previous example, it can belong to that " password is forgotten according to " forget _ log in " and " password _ forget " The probable value and " forget _ log in " of note " and " password _ forget " belong to the general of " how logging in " Rate value, to determine that object to be sorted " my login password have forgotten, and how find " is respectively belonging to two The probable value of individual classification.Specifically, the probable value that object to be sorted belongs to " password is forgotten " can be 0.33+1.0=1.33;And the probable value that object to be sorted belongs to " how logging in " can be 0.67;Cause For 1.33>0.67, so the target classification that object to be sorted is belonged to is " password is forgotten ".
Certainly, in actual applications, the probable value of object to be sorted can also be determined according to other methods, Such as, by the weighted average for the probable value for calculating each target signature string, to determine the general of object to be sorted Rate value, the application is not construed as limiting to this.
After the classification of object to be sorted is determined, it is possible to automatic to user's according to corresponding classification Query language is answered, and pushes related information.
The object classification method that the application is provided, first determines the characteristics of objects word of object to be sorted;Afterwards will Characteristics of objects phrase is combined into characteristics of objects word combination, and is further characterized the conversion of characteristics of objects word combination String;The target signature string consistent with default feature string is finally chosen from feature string, and by calculating Target signature string belongs to the probable value of default classification, to determine the target classification of object to be sorted. I.e. the application is classified according to the probability distribution for being easier to understand to treat object of classification, so as to solve The problem of in the prior art being difficult to optimize because sorting algorithm is more complicated.
With above-mentioned object classification method accordingly, a kind of object sorter that the embodiment of the present application is also provided, As shown in figure 3, the device includes:Pretreatment unit 301, grouped element 302, converting unit 303, Choose unit 304 and determining unit 305.
Default processing unit 301, is pre-processed for the object to be sorted to acquisition, to obtain described treat At least one characteristics of objects word of object of classification.
Grouped element 302, for described according to default algorithm, being obtained to pretreatment unit 301 at least One characteristics of objects word is grouped, and obtains at least one characteristics of objects word combination, wherein, each object More than one characteristics of objects word is included in feature word combination.
Converting unit 303, for by grouped element 302 packet obtain described at least one characteristics of objects word Combination conversion is corresponding feature string.
Wherein, converting unit 303 specifically can be used for:
To the first characteristics of objects word combination at least one described feature word combination, by first object Whole characteristics of objects words in feature word combination are combined by splitting string, obtain corresponding feature string, Wherein, the segmentation string is predefined character or character combination.
Unit 304 is chosen, for choosing mesh from least one feature string that converting unit 303 is converted to Feature string is marked, wherein, the target signature string is stored in default memory cell, described default to deposit Storage unit is used to store multiple default feature strings and each default feature string belongs at least one class Other probable value.
Determining unit 305, described in being belonged to according to the target signature string for choosing the selection of unit 304 The probable value of at least one classification, determines the target classification that the object to be sorted is belonged to.
Wherein it is determined that unit 305 specifically can be used for:
The probable value of each classification at least one classification according to being belonged to the target signature string, really The fixed object to be sorted belongs to the probable value of each classification;
When the object to be sorted belongs to the probable value highest of second category, then by the second category It is defined as the target classification that the object to be sorted is belonged to.
Alternatively, described device can also include:Unit 305 is set up, for collecting multiple training samples in advance This, the multiple training sample belongs at least one described classification;
Following process is performed to each training sample in the multiple training sample:
The training sample is pre-processed, to obtain at least one training sample of the training sample Feature Words;
According to the default algorithm, at least one described training sample Feature Words are grouped, obtained At least one training sample feature word combination, wherein, one is included in each training sample feature word combination Above training sample Feature Words;
It is corresponding training characteristics string by least one described training sample Feature Words Combination conversion;
The probable value that the training characteristics string belongs to each classification at least one described classification is counted, will The training characteristics string and the training characteristics string belong to each classification at least one described classification Probable value is stored in the default memory cell.
Alternatively, described device can also include:Processing unit 306, for counting the training characteristics string The total degree of appearance, judges whether the total degree is less than predetermined threshold value, if so, then ignoring the training Feature string belongs to the probable value of each classification at least one described classification.
Alternatively, described device can also include:Judging unit 307 and culling unit 308.
Judging unit 307, for any two characteristics of objects word in the multiple characteristics of objects word combination Combination, judges whether characteristics of objects words whole in the first characteristics of objects word combination are included in the second object special Levy in word combination;
Culling unit 308, if judging pairs whole in the first characteristics of objects word combination for judging unit 307 As Feature Words are included in the second characteristics of objects word combination, then picked from the multiple characteristics of objects word combination Except the first characteristics of objects word combination.
The function of each functional module of the embodiment of the present application device, can pass through each of above method embodiment Step realizes that therefore, the specific work process for the device that the application is provided herein is not repeated again.
The object sorter that the application is provided, the object to be sorted of default 301 pairs of acquisitions of processing unit enters Row pretreatment, to obtain at least one characteristics of objects word of the object to be sorted;Grouped element 302 is pressed According to default algorithm, at least one described characteristics of objects word is grouped, at least one object is obtained special Levy word combination;At least one described characteristics of objects word combination is converted to corresponding feature by converting unit 303 String;Choose unit 304 and target signature string is chosen from least one feature string;The basis of determining unit 305 The target signature string belongs to the probable value of at least one classification, determines the object institute to be sorted The target classification of ownership.Thus, it is possible to improve the classification effectiveness and classification accuracy of object.
Professional should further appreciate that, be described with reference to the embodiments described herein The object and algorithm steps of each example, can be come with electronic hardware, computer software or the combination of the two Realize, in order to clearly demonstrate the interchangeability of hardware and software, in the above description according to function Generally describe the composition and step of each example.These functions are come with hardware or software mode actually Perform, depending on the application-specific and design constraint of technical scheme.Professional and technical personnel can be to every Described function is realized in individual specific application using distinct methods, but it is this realize it is not considered that Beyond scope of the present application.
The step of method or algorithm for being described with reference to the embodiments described herein, can use hardware, processing The software module that device is performed, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only storage (ROM), electrically programmable ROM, electrically erasable ROM, Any other form well known in register, hard disk, moveable magnetic disc, CD-ROM or technical field Storage medium in.
Above-described embodiment, purpose, technical scheme and beneficial effect to the application are carried out Be further described, should be understood that the embodiment that the foregoing is only the application and , it is not used to limit the protection domain of the application, it is all within spirit herein and principle, done Any modification, equivalent substitution and improvements etc., should be included within the protection domain of the application.

Claims (12)

1. a kind of object classification method, it is characterised in that this method includes:
The object to be sorted of acquisition is pre-processed, with obtain the object to be sorted at least one is right As Feature Words;
According to default algorithm, at least one described characteristics of objects word is grouped, at least one is obtained Characteristics of objects word combination, wherein, more than one characteristics of objects word is included in each characteristics of objects word combination;
At least one described characteristics of objects word combination is converted into corresponding feature string;
Target signature string is chosen from least one feature string, wherein, the target signature string is stored in pre- If memory cell in, the default memory cell is used to store multiple default feature strings and each Default feature string belongs to the probable value of at least one classification;
The probable value of at least one classification according to being belonged to the target signature string, is treated point it is determined that described The target classification that class object is belonged to.
2. according to the method described in claim 1, it is characterised in that methods described also includes:Set up institute The step of stating default memory cell, including:
Multiple training samples are collected in advance, and the multiple training sample belongs at least one described classification;
Following process is performed to each training sample in the multiple training sample:
The training sample is pre-processed, to obtain at least one training sample of the training sample Feature Words;
According to the default algorithm, at least one described training sample Feature Words are grouped, obtained At least one training sample feature word combination, wherein, one is included in each training sample feature word combination Above training sample Feature Words;
It is corresponding training characteristics string by least one described training sample Feature Words Combination conversion;
The probable value that the training characteristics string belongs to each classification at least one described classification is counted, will The training characteristics string and the training characteristics string belong to each classification at least one described classification Probable value is stored in the default memory cell.
3. method according to claim 2, it is characterised in that methods described also includes:
The total degree that the training characteristics string occurs is counted, judges whether the total degree is less than predetermined threshold value, If so, then ignoring the probable value that the training characteristics string belongs to each classification at least one described classification.
4. the method according to claim any one of 1-3, it is characterised in that when obtaining multiple objects During feature word combination, methods described also includes:
To any two characteristics of objects word combination in the multiple characteristics of objects word combination, first pair is judged As characteristics of objects word whole in feature word combination whether be included in the second characteristics of objects word combination in, if so, The first characteristics of objects word combination is then rejected from the multiple characteristics of objects word combination.
5. the method according to claim any one of 1-4, it is characterised in that described in the general at least One characteristics of objects word combination is converted to corresponding feature string, is specially:
To the first characteristics of objects word combination at least one described feature word combination, by first object Whole characteristics of objects words in feature word combination are combined by splitting string, obtain corresponding feature string, Wherein, the segmentation string is predefined character or character combination.
6. the method according to claim any one of 1-5, it is characterised in that described according to the mesh Mark feature string belongs to the probable value of each classification at least one described classification, and it is described to be sorted right to determine As the target classification belonged to, it is specially:
The probable value of each classification at least one classification according to being belonged to the target signature string, really The fixed object to be sorted belongs to the probable value of each classification;
When the object to be sorted belongs to the probable value highest of second category, then by the second category It is defined as the target classification that the object to be sorted is belonged to.
7. a kind of object sorter, it is characterised in that the device includes:Pretreatment unit, packet are single Member, converting unit, selection unit and determining unit;
The default processing unit, is pre-processed for the object to be sorted to acquisition, described to obtain At least one characteristics of objects word of object to be sorted;
The grouped element, for described according to default algorithm, being obtained to the pretreatment unit extremely A few characteristics of objects word is grouped, and obtains at least one characteristics of objects word combination, wherein, it is each right As including more than one characteristics of objects word in feature word combination;
The converting unit, for by the grouped element packet obtain described at least one characteristics of objects Word combination is converted to corresponding feature string;
The selection unit, for being chosen from least one feature string that the converting unit is converted to Target signature string, wherein, the target signature string is stored in default memory cell, described default Memory cell is used to store multiple default feature strings and each default feature string belongs at least one The probable value of classification;
The determining unit, for belonging to institute according to the target signature string of the selection unit selection The probable value of at least one classification is stated, the target classification that the object to be sorted is belonged to is determined.
8. device according to claim 7, it is characterised in that described device also includes:Set up single Member, for collecting multiple training samples in advance, the multiple training sample belongs at least one described class Not;
Following process is performed to each training sample in the multiple training sample:
The training sample is pre-processed, to obtain at least one training sample of the training sample Feature Words;
According to the default algorithm, at least one described training sample Feature Words are grouped, obtained At least one training sample feature word combination, wherein, one is included in each training sample feature word combination Above training sample Feature Words;
It is corresponding training characteristics string by least one described training sample Feature Words Combination conversion;
The probable value that the training characteristics string belongs to each classification at least one described classification is counted, will The training characteristics string and the training characteristics string belong to each classification at least one described classification Probable value is stored in the default memory cell.
9. device according to claim 8, it is characterised in that described device also includes:Processing is single Member, for counting the total degree that the training characteristics string occurs, judges whether the total degree is less than default Threshold value, if so, then ignoring the training characteristics string belongs to each classification at least one described classification Probable value.
10. the device according to claim any one of 7-9, it is characterised in that described device is also wrapped Include:Judging unit and culling unit;
The judging unit, for any two characteristics of objects in the multiple characteristics of objects word combination Word combination, judges whether characteristics of objects words whole in the first characteristics of objects word combination are included in the second object In feature word combination;
The culling unit, if judging whole in the first characteristics of objects word combination for the judging unit Characteristics of objects word is included in the second characteristics of objects word combination, then from the multiple characteristics of objects word combination Reject the first characteristics of objects word combination.
11. the device according to claim any one of 7-10, it is characterised in that the converting unit Specifically for:
To the first characteristics of objects word combination at least one described feature word combination, by first object Whole characteristics of objects words in feature word combination are combined by splitting string, obtain corresponding feature string, Wherein, the segmentation string is predefined character or character combination.
12. the device according to claim any one of 7-11, it is characterised in that the determining unit Specifically for:
The probable value of each classification at least one classification according to being belonged to the target signature string, really The fixed object to be sorted belongs to the probable value of each classification;
When the object to be sorted belongs to the probable value highest of second category, then by the second category It is defined as the target classification that the object to be sorted is belonged to.
CN201610134575.1A 2016-03-09 2016-03-09 object classification method and device Pending CN107180022A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610134575.1A CN107180022A (en) 2016-03-09 2016-03-09 object classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610134575.1A CN107180022A (en) 2016-03-09 2016-03-09 object classification method and device

Publications (1)

Publication Number Publication Date
CN107180022A true CN107180022A (en) 2017-09-19

Family

ID=59829534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610134575.1A Pending CN107180022A (en) 2016-03-09 2016-03-09 object classification method and device

Country Status (1)

Country Link
CN (1) CN107180022A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321557A (en) * 2019-06-14 2019-10-11 广州多益网络股份有限公司 A kind of file classification method, device, electronic equipment and storage medium
CN111488950A (en) * 2020-05-14 2020-08-04 支付宝(杭州)信息技术有限公司 Classification model information output method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
US20100070339A1 (en) * 2008-09-15 2010-03-18 Google Inc. Associating an Entity with a Category
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Automatic webpage classification method based on Bayesian network
US20120278336A1 (en) * 2011-04-29 2012-11-01 Malik Hassan H Representing information from documents
CN103336766A (en) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 Short text garbage identification and modeling method and device
CN104392006A (en) * 2014-12-17 2015-03-04 中国农业银行股份有限公司 Event query processing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070339A1 (en) * 2008-09-15 2010-03-18 Google Inc. Associating an Entity with a Category
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
US20120278336A1 (en) * 2011-04-29 2012-11-01 Malik Hassan H Representing information from documents
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Automatic webpage classification method based on Bayesian network
CN103336766A (en) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 Short text garbage identification and modeling method and device
CN104392006A (en) * 2014-12-17 2015-03-04 中国农业银行股份有限公司 Event query processing method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321557A (en) * 2019-06-14 2019-10-11 广州多益网络股份有限公司 A kind of file classification method, device, electronic equipment and storage medium
CN111488950A (en) * 2020-05-14 2020-08-04 支付宝(杭州)信息技术有限公司 Classification model information output method and device
CN111488950B (en) * 2020-05-14 2021-10-15 支付宝(杭州)信息技术有限公司 Classification model information output method and device
WO2021228152A1 (en) * 2020-05-14 2021-11-18 支付宝(杭州)信息技术有限公司 Classification model information output

Similar Documents

Publication Publication Date Title
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN104239539B (en) A kind of micro-blog information filter method merged based on much information
CN107103043A (en) A kind of Text Clustering Method and system
CN103336766B (en) Short text garbage identification and modeling method and device
CN102945246B (en) The disposal route of network information data and device
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN103984703B (en) Mail classification method and device
CN109241274A (en) text clustering method and device
CN101477563B (en) Short text clustering method and system, and its data processing device
CN106528642A (en) TF-IDF feature extraction based short text classification method
Bates et al. Counting clusters in twitter posts
CN103136266A (en) Method and device for classification of mail
CN102411563A (en) Method, device and system for identifying target words
CN103955453B (en) A kind of method and device for finding neologisms automatic from document sets
CN108733816A (en) A kind of microblogging incident detection method
CN105095223A (en) Method for classifying texts and server
CN104317891B (en) A kind of method and device that label is marked to the page
CN106909669B (en) Method and device for detecting promotion information
CN105956031A (en) Text classification method and apparatus
CN103886077B (en) Short text clustering method and system
CN105893606A (en) Text classifying method and device
CN108021545A (en) A kind of case of administration of justice document is by extracting method and device
CN110990676A (en) Social media hotspot topic extraction method and system
Jatana et al. Bayesian spam classification: Time efficient radix encoded fragmented database approach
CN104572720B (en) A kind of method, apparatus and computer readable storage medium of webpage information re-scheduling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170919