CN110442722A

CN110442722A - Method and device for training classification model and method and device for data classification

Info

Publication number: CN110442722A
Application number: CN201910746175.XA
Authority: CN
Inventors: 王献; 唐剑波; 李长亮
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd; Chengdu Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd; Chengdu Kingsoft Digital Entertainment Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2019-11-12
Anticipated expiration: 2039-08-13
Also published as: CN110442722B

Abstract

The application provides a method and a device for training a classification model and a method and a device for classifying data, wherein the method for training the classification model comprises the following steps: acquiring a sample data set, wherein the sample data set comprises at least three category labels and characteristic data corresponding to the category labels, and counting the proportion of the number of each category label in the sample data set; dividing the class labels in the sample data set into at least two sample groups according to the proportion of the number of each class label in the sample data set; and inputting the sample group into a corresponding classification model for training until a training condition is reached. The class label proportion in the sample data set is unbalanced, the quality of the processed sample data set is greatly improved, the training effect of the classification model can be further ensured, and the classification accuracy of the trained classification model is greatly improved when the trained classification model is used for actual classification prediction.

Description

Method and device, the method and device of data classification of disaggregated model training

Technical field

This application involves data classification technology field, in particular to a kind of method and device, the data of disaggregated model training The method and device of classification calculates equipment, computer readable storage medium.

Background technique

Data classification is to carry out automatic classification marker, such as text point according to certain classification system or standard to data Class carries out automation classification according to certain bibliography system to the text of input.Text Classification has been widely used In natural language processing fields such as text audit, advertisement filter, sentiment analysis and anti-yellow identifications.

In the training method of existing disaggregated model, is usually directly concentrated from sample data and choose sample data and be input to point In class model, disaggregated model is trained, but sample data concentrates the sample number that will appear a kind of classification much larger than other The case where number of samples of classification, causes the classification for sample number in trained sample set unbalance, i.e., sample data is concentrated The classification of sample number is unbalanced, trained disaggregated model, is only the different classes of sample classification effect concentrated to training sample Fruit is good, but when classifying for raw data set, since raw data set and training sample concentrate the proportional difference of classification Larger, the classification results error rate of trained disaggregated model prediction is higher, and existing trained disaggregated model is difficult to carry out Actual application.

In the training of disaggregated model, to establish the training set of each sample class equilibrium, then need to spend a large amount of manpowers Processing material is found with material resources and obtains the training set of sample class equilibrium, and the cost of disaggregated model training will greatly improved in this.

Summary of the invention

In view of this, the embodiment of the present application provides a kind of method and device of disaggregated model training, the method for data classification And device, calculating equipment, computer readable storage medium, to solve technological deficiency existing in the prior art.

The embodiment of the present application discloses a kind of method of disaggregated model training, comprising:

Sample data set is obtained, the sample data set includes at least three kinds of class labels and the corresponding spy of class label Data are levied, the accounting that the quantity of each class label is concentrated in the sample data is counted；

According to the accounting that the quantity of each class label is concentrated in the sample data, the class that the sample data is concentrated Distinguishing label and its corresponding characteristic are divided at least two sample groups；

The sample group is input in corresponding disaggregated model and is trained until reaching training condition.

The embodiment of the present application also discloses a kind of method of data classification, comprising:

Receive characteristic to be sorted；

The characteristic to be sorted is input to the first disaggregated model；

In the case where first disaggregated model output is first category label, determine that the characteristic to be sorted is The corresponding classification of first category label；

It, will be described in the case where first disaggregated model output is remaining class label in addition to first category label Characteristic to be sorted is input to the second disaggregated model, determines the feature to be sorted according to the output result of the second disaggregated model The corresponding classification of data.

The embodiment of the present application discloses a kind of device of disaggregated model training characterized by comprising

Processing module, be configured as obtain sample data set, the sample data set include at least three kinds of class labels with And the corresponding characteristic of class label, count the accounting that the quantity of each class label is concentrated in the sample data；

Division module is configured as the accounting concentrated according to the quantity of each class label in the sample data, by institute The class label and its corresponding characteristic for stating sample data concentration are divided at least two sample groups；

Training module, is configured as the sample group being input in corresponding disaggregated model and is trained until reach instruction The condition of white silk.

The embodiment of the present application discloses a kind of device of data classification, comprising:

Receiving module is configured as receiving characteristic to be sorted；

Input module is configured as the characteristic to be sorted being input to the first disaggregated model；

First determining module is configured as in the case where it is first category label that first disaggregated model, which exports, really The fixed characteristic to be sorted is the corresponding classification of first category label；

Second determining module is configured as exporting in first disaggregated model as remaining class in addition to first category label In the case where distinguishing label, the characteristic to be sorted is input to the second disaggregated model, according to the output of the second disaggregated model As a result the corresponding classification of the characteristic to be sorted is determined.

The embodiment of the present application discloses a kind of calculating equipment, including memory, processor and storage are on a memory and can The computer instruction run on a processor, the processor realize disaggregated model training as described above when executing described instruction Method or data classification method the step of.

The embodiment of the present application discloses a kind of computer readable storage medium, is stored with computer instruction, the instruction quilt Processor execute when realize disaggregated model training as described above method or data classification method the step of.

Method and device, the method and device of data classification of a kind of disaggregated model training provided by the present application, statistics are every The accounting that the quantity of a class label is concentrated in the sample data；According to the quantity of each class label in the sample data The class label that the sample data is concentrated is divided at least two sample groups by the accounting of concentration；The class that sample data is concentrated Distinguishing label ratio is unbalanced, and the quality of treated sample data set greatly improves, and the sample group is input to correspondence Disaggregated model in be trained until reach training condition, and then can ensure the training effect of disaggregated model, it is trained Disaggregated model greatly improves the classification accuracy of trained disaggregated model when actual classification is predicted.

Detailed description of the invention

Fig. 1 is the structural schematic diagram of the calculating equipment of the embodiment of the present application；

Fig. 2 is the schematic flow chart of the method for the disaggregated model training of one embodiment of the application；

Fig. 3 is the schematic flow chart of the method for the disaggregated model training of another embodiment of the application；

Fig. 4 is the schematic flow chart of the method for the data classification of one embodiment of the application；

Fig. 5 is the flow diagram that the corresponding classification of characteristic to be sorted is determined in the application；

Fig. 6 is the apparatus structure schematic diagram of the disaggregated model training of one embodiment of the application；

Fig. 7 is the apparatus structure schematic diagram of the data classification of one embodiment of the application.

Specific embodiment

Many details are explained in the following description in order to fully understand the application.But the application can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to the application intension the case where Under do similar popularization, therefore the application is not limited by following public specific implementation.

The term used in this specification one or more embodiment be only merely for for the purpose of describing particular embodiments, It is not intended to be limiting this specification one or more embodiment.In this specification one or more embodiment and appended claims The "an" of singular used in book, " described " and "the" are also intended to including most forms, unless context is clearly Indicate other meanings.It is also understood that term "and/or" used in this specification one or more embodiment refers to and includes One or more associated any or all of project listed may combine.

It will be appreciated that though may be retouched using term first, second etc. in this specification one or more embodiment Various information are stated, but these information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other It opens.For example, first can also be referred to as second, class in the case where not departing from this specification one or more scope of embodiments As, second can also be referred to as first.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... " or " in response to determination ".

Firstly, the vocabulary of terms being related to one or more embodiments of the invention explains.

Two disaggregated models: being the model that data are carried out with binary classification, and two disaggregated models can be one kind by supervised learning Mode carries out the generalized linear classifier of binary classification to data, and linear classifier is based on training sample and finds in two-dimensional space One hyperplane separates two class samples.

More disaggregated models: the model that can classify respectively to the data of multiple classifications, more disaggregated models can be one Kind promotes tree-model, and many tree-models are integrated, and a very strong classifier is formed, in other words, by many weak typings Device integrates to form a strong classifier, in more classification problems.

Fig. 1 is to show the structural block diagram of the calculating equipment 100 according to one embodiment of this specification.The calculating equipment 100 Component include but is not limited to memory 110 and processor 120.Processor 120 is connected with memory 110 by bus 130, Database 150 is for saving data.

Calculating equipment 100 further includes access device 140, access device 140 enable calculate equipment 100 via one or Multiple networks 160 communicate.The example of these networks includes public switched telephone network (PSTN), local area network (LAN), wide area network (WAN), the combination of the communication network of personal area network (PAN) or such as internet.Access device 140 may include wired or wireless One or more of any kind of network interface (for example, network interface card (NIC)), such as IEEE802.11 wireless local area Net (WLAN) wireless interface, worldwide interoperability for microwave accesses (Wi-MAX) interface, Ethernet interface, universal serial bus (USB) connect Mouth, cellular network interface, blue tooth interface, near-field communication (NFC) interface, etc..

In one embodiment of this specification, other unshowned portions in the above-mentioned component and Fig. 1 of equipment 100 are calculated Part can also be connected to each other, such as pass through bus.It should be appreciated that calculating device structure block diagram shown in FIG. 1 merely for the sake of Exemplary purpose, rather than the limitation to this specification range.Those skilled in the art can according to need, and increases or replaces it His component.

Calculating equipment 100 can be any kind of static or mobile computing device, including mobile computer or mobile meter Calculate equipment (for example, tablet computer, personal digital assistant, laptop computer, notebook computer, net book etc.), movement Phone (for example, smart phone), wearable calculating equipment (for example, smartwatch, intelligent glasses etc.) or other kinds of shifting Dynamic equipment, or the static calculating equipment of such as desktop computer or PC.Calculating equipment 100 can also be mobile or state type Server.

Wherein, processor 120 can execute the step in method shown in Fig. 2.Fig. 2 is to show to be implemented according to the application one The schematic flow chart of the method for the disaggregated model training of example, including step 202 is to step 206.

Step 202: obtaining sample data set, the sample data set includes at least three kinds of class labels and class label Corresponding characteristic counts the accounting that the quantity of each class label is concentrated in the sample data.

Above-mentioned sample data set is the training set for disaggregated model training, sample data set can for samples of text collection or Picture sample collection, for example the class label that samples of text is concentrated is that company receives the classifications of most legal documents, legal documents Classification can for lawsuit position paper, indictment, not indictment letter of decision, position paper of prosecuting, document of protesting, criminal complaint filed by procuratorate against a court decision book, Civil counter-appealing book, administration are protested book and suggestion from prosecutorial organizations book etc..

The corresponding characteristic of the class label is to input the data calculated in disaggregated model, and the characteristic can be with For history dispute data, financial statement data, company's categorical data, project data, funds data, the technical data, city of company Field data and management environment data.

It should be noted that use the task of practical application after train classification models in above-mentioned example is, according to above-mentioned History dispute data, financial statement data, company's categorical data and the project data of company in characteristic, disaggregated model output Company will receive the classification of most legal documents whithin a period of time, i.e., prediction company will receive most whithin a period of time The classification of more legal documents.

The accounting of the quantity of each class label is concentrated to count the sample data.

For example, being schematically illustrated so that the sample data set includes tetra- kinds of class labels of A, B, C, D as an example below, join It is shown in Table 1.

Table 1

Class label	A	B	C	D
					Classification	Lawsuit position paper	Indictment	Public prosecution position paper	Criminal complaint filed by procuratorate against a court decision book
Accounting	70%	15%	10%	5%

Class label A is lawsuit position paper, the corresponding class label A of characteristic, it can be understood as, the spy of a company It is class label A that it is corresponding, which to levy data, and class label A accounting is 70%, and meaning is that sample data concentrates the company for having 70% It is lawsuit position paper that most legal documents are received in sometime range.

Class label B is indictment, the corresponding class label B of characteristic, it can be understood as, the characteristic of a company According to it is corresponding be class label B, class label B accounting is 15%, and meaning is that sample data is concentrated and has 15% company at certain It is indictment that most legal documents are received in one time range.

Class label C is public prosecution position paper, and class label C accounting is 10%, and class label D is criminal complaint filed by procuratorate against a court decision book, classification Label D accounting is 5%, and referring specifically to above description, details are not described herein again.

Above-mentioned sample data set includes that tri- kinds of class label difference accountings of A, B, C, D are 70%, 15%, 10%, 5%.

Step 204: the accounting concentrated according to the quantity of each class label in the sample data, by the sample data The class label of concentration and its corresponding characteristic are divided at least two sample groups.

For example the accounting of the class label A in above-mentioned example reaches 70%, the ratio of above-mentioned four kinds of class labels of appearance is not Balanced situation.It above are only and class label accounting and characteristic are schematically illustrated, and be used for actual Trained sample data is concentrated, and the type of class label reaches ten kinds or more, and the situation that class label accounting is unbalance is even more serious, The unbalance influence disaggregated model training effect of class label accounting is solved the problems, such as by following step.

The highest first category label of accounting in the class label and its corresponding characteristic are divided into the first sample The sample data is concentrated remaining class label and its corresponding characteristic in addition to first category label to be divided by this group Second sample group.

In other words, the highest first category label of quantity accounting and its corresponding characteristic divide in the class label For first sample group, in above-mentioned example, the first sample group that class label A and its corresponding characteristic are divided will be described Sample data concentrates remaining class label and its corresponding characteristic in addition to class label A to be divided into the second sample group, i.e., Class label B, C, D and its corresponding characteristic are divided into the second sample group, the second sample group it can be appreciated that Non- first sample group, first sample group can be correspondingly arranged positive class label, and second sample group can be correspondingly arranged negative category Label, above-mentioned positive class label and negative class label can be set to 1 and 0.

Step 206: the sample group being input in corresponding disaggregated model and is trained until reaching training condition.

Such as in above-mentioned example, in the case where determining that class label accounting is balanced in second sample group, class label B Accounting 15% and class label C accounting 10% ratio be 1.5 be lower than trimming threshold 3.5, it is determined that second sample Class label accounting is balanced in group, and the first sample group and the second sample group are input to the first disaggregated model and are trained directly To training condition is reached, the characteristic for being divided into the second sample group and corresponding class label are input to the second disaggregated model In be trained until reach training condition.

In the above-mentioned implementation column of the application, by above-mentioned direct to the maximum class label of accounting and its corresponding characteristic Be divided into first sample group, by the sample data concentrate remaining class label in addition to the maximum first category label of accounting and Its corresponding characteristic is divided into the first category label and second of first sample group in the second sample group, such as above-mentioned example The all categories label accounting of sample group is respectively 70% and 30%, and the accounting of first sample group and the second sample group tends to be equal Weighing apparatus improves the training effect of the first disaggregated model, then by the characteristic for being divided into the second sample group and corresponding class label It is input in the second disaggregated model, wherein the accounting of class label B, C, D are respectively 15%, 10% and 5% in the second sample group, Class label accounting in second sample group equally tends to be balanced, so as to improve the training effect of the second disaggregated model, even The class label ratio that sample data is concentrated is unbalanced, and the quality of treated sample data set greatly improves, Jin Erneng Enough ensure the training effect of disaggregated model, trained disaggregated model is when actual classification is predicted, trained disaggregated model Classification accuracy it is high, guarantee in natural language processing fields realities such as text audit, advertisement filter, sentiment analysis and anti-yellow identifications The using effect of border application.

Below with reference to specific example, one embodiment of the application is described in detail.

Assuming that the sample data set obtained includes five kinds of class labels of A, B, C, D, E and its corresponding characteristic.

The accounting that the quantity of each class label is concentrated in the sample data is counted, table 2 shows five kinds of class labels Quantity respectively sample data concentrate accounting.

Table 2

Class label	A	B	C	D	E
						Classification	Lawsuit position paper	Indictment	Public prosecution position paper	Criminal complaint filed by procuratorate against a court decision book	Administration is protested book
Accounting	50%	30%	8%	7%	5%

According to the accounting that the quantity of each class label is concentrated in the sample data, the class that the sample data is concentrated Distinguishing label is divided at least two sample groups.

The highest first category label of accounting in the class label and its corresponding characteristic are divided into the first sample The sample data is concentrated remaining class label and its corresponding characteristic in addition to first category label to be divided by this group Second sample group, the first sample group that class label A and its corresponding characteristic are divided.

Remaining class label and its corresponding characteristic in addition to class label A is concentrated to be divided into the sample data Class label B, C, D, E and its corresponding characteristic are divided into the second sample group by the second sample group.

In remaining class label in addition to class label A, i.e., in class label B, C, D and E, the highest class of accounting The ratio of the accounting of distinguishing label B and the high class label C of accounting second is more than trimming threshold, the accounting 30% of class label B with It is more than trimming threshold 3.5 that the ratio of the accounting 8% of class label C, which is 3.75, it is determined that class label in second sample group Accounting is unbalance, by the highest second category label of accounting and its corresponding feature in the corresponding class label of second sample group Data are divided into third sample group, i.e., the third sample group divided class label B and its corresponding characteristic, by classification mark Label C, D, E and its corresponding characteristic are divided into the 4th sample group.

Due to the class label in the 4th sample group be it is balanced, no longer the 4th sample group is divided, when So, if the 4th sample group is unbalanced, continue to divide the 4th sample group.

Class label A accounting is 50% in first sample group, and the total accounting of class label B, C, D, E is in the second sample group 50%, the first sample group and the second sample group are input to two disaggregated models and are trained until reaching trained item Part.

Class label B accounting is 30% in third sample group, and the total accounting of class label C, D, E is in the 4th sample group 20%, the third sample group and the 4th sample group are input to another two disaggregated model and are trained until reaching trained article Part.

It is 8%, 7%, 5% that class label C, D, E, which distinguish accounting, in 4th sample group, will be divided into the feature of the 4th sample Data and corresponding class label, which are input in more disaggregated models, to be trained until reaching training condition.

In above-mentioned example, for the first sample group of two disaggregated model training training and the classification mark of the second sample group Label ratio is balanced, and for the third sample group of another two disaggregated models training and the class label of the 4th sample group Ratio equally tends to be balanced, and the ratio for all kinds of distinguishing labels in the 4th sample group of more disaggregated model training is also to become In equilibrium.It is greatly improved by the quality of treated sample data set, and then can ensure the training of disaggregated model Effect, trained disaggregated model greatly improve the classification accuracy of trained disaggregated model when actual classification is predicted.

It should be noted that concentrating the accounting of class label, the disaggregated model packet being trained according to sample data in table 2 Include two two disaggregated models and disaggregated model more than one, in actual classification model training, the type and quantity of specific disaggregated model The accounting of class label is concentrated to determine according to sample data.

Fig. 3 shows the schematic flow chart of the method for the disaggregated model training according to another embodiment of the application, including Step 302 is to step 310.

Step 302: obtaining sample data set, the sample data set includes at least three kinds of class labels and class label Corresponding characteristic counts the accounting that the quantity of each class label is concentrated in the sample data.

Step 304: setting first threshold, delete the sample data concentrate class label of the accounting lower than first threshold and Its corresponding characteristic.

For example setting first threshold is 1%, sample data is then concentrated accounting low by the class label accounting less than 1% It is deleted in the class label of first threshold and its corresponding characteristic.

The influence very little of class label and its corresponding characteristic to model training due to accounting lower than first threshold, By concentrating accounting to delete lower than the class label of first threshold the sample data, to ensure that sample data concentrates class Distinguishing label accounting tends to be balanced, and the effect of disaggregated model training can be integrally improved in such following step.

Step 306: setting second threshold, the second threshold are greater than first threshold, accounting are located at the first threshold Class label between second threshold merges into combination sort label.

Can be set second threshold be 5%, the class label be located at first threshold be 1% and second threshold be 5% it Between, class label is merged into combination sort label, the harmony that sample data concentrates class label is further increased, by setting First threshold and second threshold are set, sample data set is handled, greatly improves the training effect of disaggregated model in following step Fruit.

It should be noted that the specific value of above-mentioned first threshold and second threshold, it can be according to actual sample data set Middle class label quantity and the accounting of class label determine.

Step 308: the highest first category label of accounting in the class label and its corresponding characteristic are divided For first sample group, the sample data is concentrated to remaining class label and its corresponding characteristic in addition to first category label According to being divided into the second sample group.

The first sample group that the highest first category label of accounting and its corresponding characteristic are divided, by the sample Remaining class label and combination sort label in data set in addition to first category label are divided into the second sample group.

Step 310: the first sample group and the second sample group being input to two disaggregated models and are trained until reaching The characteristic for being divided into negative sample and corresponding class label are input in more disaggregated models and are trained directly by training condition To reaching training condition.

In above-described embodiment of the application, on the one hand, by above-mentioned to the maximum class label of accounting and its corresponding spy Sign data are directly divided into first sample group, and the sample data is concentrated remaining in addition to the maximum first category label of accounting Class label and its corresponding characteristic are divided into the second sample group；On the other hand, concentrate accounting low the sample data It is deleted in the class label of first threshold, and the classification mark by accounting between the first threshold and second threshold Label merge into combination sort label, and sample data concentrates class label balanced to improve, so even sample data set In class label there is the unbalanced situation of ratio, the application can also ensure that the training effect of disaggregated model, greatly improve The accuracy rate of the prediction result of trained disaggregated model guarantees in text audit, advertisement filter, sentiment analysis and anti-yellow identification Etc. natural language processing fields practical application using effect.

Fig. 4 is to show the schematic flow chart of the method for the data classification according to one embodiment of the application, including step 402 to step 408.

Step 402: receiving characteristic to be sorted.

In above-mentioned example, for example to predict that a company receives most legal documents classifications in following a period of time, then Receive history dispute data, the financial statement data, company's categorical data, project data, funds data, technology number of the said firm According to, marketing data and management environment data.

Step 404: the characteristic to be sorted is input to the first disaggregated model.

Step 406: in the case where first disaggregated model output is first category label, determining the spy to be sorted Sign data are the corresponding classification of first category label.

Step 408: the case where first disaggregated model output is remaining class label in addition to first category label Under, the characteristic to be sorted is input to the second disaggregated model, according to the determination of the output result of the second disaggregated model The corresponding classification of characteristic to be sorted.

Shown in Figure 5, shown step 408 specifically includes step 502 to step 506.

Step 502: judging whether the characteristic to be sorted is combination sort, if so, step 504 is executed, if it is not, holding Row step 506.

Step 504: according to the corresponding classification of characteristic in combination sort, determining that the characteristic to be sorted is corresponding Classification.

The step 504 includes step 5042 to step 5044.

Step 5042: according to the corresponding class label of at least two classifications in combination sort in the ratio of sample data set.

According to the corresponding class label of classification in combination sort sample data set ratio, it is of all categories as upper to determine Company is stated to determine in the following probability for receiving most legal documents classifications for a period of time.

Step 5044: determining classification of the classification as the characteristic to be sorted in the combination sort.

According to the corresponding probability of classification each in said combination classification, to determine that a classification in combination sort is made at random Most legal documents classifications is received in following a period of time for the said firm, the accurate of the said firm's classification can be further increased Property.

Step 506: using the output result of the second disaggregated model as the corresponding classification of the characteristic to be sorted.

The above-mentioned implementation column of the application is by being input to the first disaggregated model for the characteristic to be sorted, if the first classification The output result of model is first category label, the classification of characteristic to be sorted is directly determined, if first disaggregated model Output result be remaining class label in addition to first category label, the characteristic to be sorted is input to the second classification Model determines the corresponding classification of the characteristic to be sorted according to the output result of the second disaggregated model, greatly improves data The accuracy of classification.

For example the specific implementation process of the method for the method and data classification of the model training of bright the application below.

The technical solution of the application in order to facilitate understanding, below for example the method for the model training of bright the application and The specific implementation process of the method for data classification.

Assuming that the sample data set obtained includes seven kinds of class labels of A, B, C, D, E, F, G and its corresponding characteristic.

The accounting that the quantity of each class label is concentrated in the sample data is counted, table 3 shows seven kinds of class labels Quantity respectively sample data concentrate accounting.

Table 3

For example setting first threshold is 1%, the accounting of the class label G is 0.5% less than 1%, by sample data set Middle accounting is deleted lower than the class label G of first threshold and its corresponding characteristic.

It is 5% that second threshold, which is arranged, and the ratio of the class label E is 2% and the ratio of class label F is 1% equal Positioned at first threshold be 1% and second threshold is that class label E and class label F are merged into combination sort mark between 5% Label.

The highest first category label of accounting is A in the class label, by the highest first category label A of accounting and its The sample data is concentrated remaining class label in addition to class label A by the first sample group that corresponding characteristic divides And its corresponding characteristic is divided into the second sample group, i.e., by class label B, C, D, combination sort label and its respectively corresponds Characteristic be divided into the second sample group.

The first sample group and the second sample group are input to two disaggregated models to be trained until reaching training condition.

The characteristic for being divided into the second sample group and corresponding class label are input in more disaggregated models and instructed Practice until reaching training condition, i.e., by class label B, C, D, combination tag and its corresponding characteristic in the second sample group According to being input in more disaggregated models, the training of disaggregated model is completed.

Method below by taking the trained disaggregated model of above-mentioned sample data set as an example, to illustrate data classification.

It is now to one company of prediction and receives most legal documents classifications in following a period of time.

Characteristic to be sorted is received, characteristic to be sorted can be the history dispute data, financial statement of the said firm Data, company's categorical data, project data, funds data, technical data, marketing data and management environment data.

The characteristic to be sorted is input to the first disaggregated model.

If the output class label A of two disaggregated model directly determines the corresponding class of the characteristic to be sorted Not Wei lawsuit position paper, i.e., it is lawsuit position paper that the said firm, which receives most legal documents classifications in following a period of time,.

If the output result of two disaggregated model is remaining class label in addition to class label A, then will be described wait divide Category feature data are input to more disaggregated models, determine that the characteristic to be sorted is corresponding according to the output result of more disaggregated models Classification, for example it is indictment or public prosecution position paper or punishment that the said firm, which receives most legal documents classifications in following a period of time, Thing is protested book.

If the output result of more disaggregated models is combination sort, protested book according to civil counter-appealing book and administration in combination sort Corresponding class label is respectively 2% and 1% in the ratio of sample data set.The classification for then determining prediction is civil counter-appealing book Probability is 2/3, and the classification of prediction is that the probability of administrative book of protesting is 1/3, according to probability stochastic prediction the said firm of two categories Most legal documents classifications is received in following a period of time.

Fig. 6 is the apparatus structure schematic diagram of the disaggregated model training of one embodiment of the application, the disaggregated model training Device includes:

Processing module 602 is configured as obtaining sample data set, and the sample data set includes at least three kinds of class labels And the corresponding characteristic of class label, count the accounting that the quantity of each class label is concentrated in the sample data.

Division module 604 is configured as the accounting concentrated according to the quantity of each class label in the sample data, will The class label and its corresponding characteristic that the sample data is concentrated are divided at least two sample groups.

Training module 606, is configured as the sample group being input in corresponding disaggregated model and is trained until reach To training condition.

Preferably, the division module 604 is further configured to the highest first kind of accounting in the class label Distinguishing label and its corresponding characteristic are divided into first sample group, and the sample data is concentrated in addition to first category label Remaining class label and its corresponding characteristic are divided into the second sample group.

Preferably, the training module 606 is further configured to input the first sample group and the second sample group It is trained to the first disaggregated model up to reaching training condition, the characteristic and corresponding class of the second sample group will be divided into Distinguishing label, which is input in the second disaggregated model, to be trained until reaching training condition.

The device of the application disaggregated model training passes through straight to the maximum class label of accounting and its corresponding characteristic It connects and is divided into first sample group, the sample data is concentrated into remaining class label in addition to the maximum first category label of accounting And its corresponding characteristic is divided into the second sample group, owns in first category label and the second sample group in first sample group The accounting of class label tends to be balanced, improves the training effect of the first disaggregated model, then will be divided into the second sample group negative sample Characteristic and corresponding class label be input in the second disaggregated model, the class label accounting in negative sample equally tends to Equilibrium, so as to improve the training effect of the second disaggregated model, it can be ensured that the training effect of disaggregated model improves trained point The accuracy rate of the prediction result of class model.

Preferably, the division module 604 is additionally configured to the highest first category mark of accounting in the class label Label and its corresponding characteristic are divided into first sample group, and the sample data is concentrated remaining in addition to first category label Class label and its corresponding characteristic are divided into the second sample group；

The accounting that remaining class label in addition to first category label is concentrated according to the sample data is determining described the It is in the case that class label accounting is unbalance in two sample groups, accounting in the corresponding class label of second sample group is highest Second category label and its corresponding characteristic are divided into third sample packet, by the corresponding classification mark of second sample group Remaining class label and its corresponding characteristic removed outside second category label in label is divided into the 4th sample packet；

The training module 606 is additionally configured to the first sample group and the second sample group being input to two disaggregated models It is trained until reaching training condition；The third sample group and the 4th sample group are input to two disaggregated models to be trained Until reaching training condition；4th sample group is input in more disaggregated models and is trained until reaching training condition.

Preferably, the device of the disaggregated model training, further includes:

Removing module is configured as setting first threshold；Deleting the sample data concentrates accounting lower than first threshold Class label and its corresponding characteristic.

Preferably, the device of the disaggregated model training, further includes:

Merging module, is configured as setting second threshold, and the second threshold is greater than first threshold；Accounting is located at described Class label between first threshold and second threshold merges into combination sort label.

The training module 606 is additionally configured to the first sample group and the second sample group being input to two disaggregated models It is trained until reaching training condition.

The training module 606 is additionally configured to that the characteristic of the second sample group and corresponding class label will be divided into It is input in more disaggregated models and is trained until reaching training condition.

Fig. 7 is the apparatus structure schematic diagram of the data classification of one embodiment of the application, and the device of the data classification includes:

Receiving module 702 is configured as receiving characteristic to be sorted；

Input module 704 is configured as the characteristic to be sorted being input to the first disaggregated model；

First determining module 706 is configured as in the case where it is first category label that first disaggregated model, which exports, Determine that the characteristic to be sorted is the corresponding classification of first category label；

Second determining module 708 is configured as in first disaggregated model output being its in addition to first category label In the case where remaining class label, the characteristic to be sorted is input to the second disaggregated model, according to the second disaggregated model Output result determines the corresponding classification of the characteristic to be sorted.

Preferably, second determining module 708 be configured to judge the characteristic to be sorted whether be Combination sort；

If so, determining the corresponding class of the characteristic to be sorted according to the corresponding classification of characteristic in combination sort Not；

If it is not, using the output result of the second disaggregated model as the corresponding classification of the characteristic to be sorted.

Preferably, second determining module 708 is configured to according at least two classification pair in combination sort Ratio of the class label answered in sample data set；

Determine classification of the classification as the characteristic to be sorted in the combination sort.

The device of the above-mentioned data classification of the application by the way that the characteristic to be sorted is input to the first disaggregated model, if The output result of first disaggregated model is first category label, directly determines the classification of characteristic to be sorted, if described first The output result of disaggregated model is remaining class label in addition to first category label, and the characteristic to be sorted is input to Second disaggregated model determines the corresponding classification of the characteristic to be sorted according to the output result of the second disaggregated model, substantially Improve the accuracy of data classification.

One embodiment of the application also provides a kind of calculating equipment, including memory, processor and storage are on a memory simultaneously The computer instruction that can be run on a processor, the processor realize disaggregated model training as previously described when executing described instruction Method or data classification method the step of.

One embodiment of the application also provides a kind of computer readable storage medium, is stored with computer instruction, the instruction Realized when being executed by processor disaggregated model training as previously described method or data classification method the step of.

A kind of exemplary scheme of above-mentioned computer readable storage medium for the present embodiment.It should be noted that this is deposited The technical solution of the method for the method or data classification of the technical solution of storage media and above-mentioned disaggregated model training belongs to same structure Think, the detail content that the technical solution of storage medium is not described in detail, may refer to above-mentioned disaggregated model training method or The description of the technical solution of the method for data classification.

One embodiment of the application also provides a kind of chip, is stored with computer instruction, when which is executed by processor Realize disaggregated model training as previously described method or data classification method the step of.

The computer instruction includes computer program code, the computer program code can for source code form, Object identification code form, executable file or certain intermediate forms etc..The computer-readable medium may include: that can carry institute State any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, the computer storage of computer program code Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Increase and decrease appropriate can be carried out according to the requirement made laws in jurisdiction with patent practice by holding, such as in certain jurisdictions of courts Area does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium.

It should be noted that for the various method embodiments described above, describing for simplicity, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules might not all be this Shen It please be necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

The application preferred embodiment disclosed above is only intended to help to illustrate the application.There is no detailed for alternative embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to preferably explain the application Principle and practical application, so that skilled artisan be enable to better understand and utilize the application.The application is only It is limited by claims and its full scope and equivalent.

Claims

1. a kind of method of disaggregated model training characterized by comprising

Sample data set is obtained, the sample data set includes at least three kinds of class labels and the corresponding characteristic of class label According to counting the accounting that the quantity of each class label is concentrated in the sample data；

According to the accounting that the quantity of each class label is concentrated in the sample data, the classification mark that the sample data is concentrated Label and its corresponding characteristic are divided at least two sample groups；

2. the method according to claim 1, wherein according to the quantity of each class label in the sample data Class label that the sample data is concentrated and its corresponding characteristic are divided at least two samples by the accounting of concentration Group, comprising:

The highest first category label of accounting in the class label and its corresponding characteristic are divided into first sample group, Remaining class label and its corresponding characteristic in addition to first category label is concentrated to be divided into second the sample data Sample group；

The sample group is input in corresponding disaggregated model and is trained until reaching training condition, comprising:

In the case that class label accounting is balanced in determining second sample group, by the first sample group and the second sample Group is input to the first disaggregated model and is trained until reach training condition, will be divided into the characteristic of the second sample group and right The class label answered, which is input in the second disaggregated model, to be trained until reaching training condition.

3. the method according to claim 1, wherein according to the quantity of each class label in the sample data Class label that the sample data is concentrated and its corresponding characteristic are divided at least two samples by the accounting of concentration Group, comprising:

In the case that class label accounting is unbalance in determining second sample group, by the corresponding classification of second sample group The highest second category label of accounting and its corresponding characteristic are divided into third sample packet in label, by second sample This organize in corresponding class label remove second category label outside remaining class label and its corresponding characteristic be divided into 4th sample packet；

In the case that class label accounting is balanced in determining the 4th sample group, by the first sample group and the second sample Group is input to two disaggregated models and is trained until reaching training condition；

The third sample group and the 4th sample group are input to two disaggregated models to be trained until reaching training condition；

4th sample group is input in more disaggregated models and is trained until reaching training condition.

4. the method according to claim 1, wherein counting the quantity of each class label in the sample data After the accounting of concentration, further includes:

First threshold is set；

It deletes the sample data and concentrates class label of the accounting lower than first threshold and its corresponding characteristic.

5. according to the method described in claim 2, it is characterized in that, the sample data is concentrated in addition to first category label Remaining class label and its corresponding characteristic are divided into before the second sample group, further includes:

Second threshold is set, and the second threshold is greater than first threshold；

Class label of the accounting between the first threshold and second threshold is merged into combination sort label.

6. according to the method described in claim 2, it is characterized in that, the first sample group and the second sample group are input to One disaggregated model is trained until reaching training condition, comprising:

7. the method according to claim 2 or 6, which is characterized in that the characteristic of the second sample group and right will be divided into The class label answered, which is input in the second disaggregated model, to be trained until reaching training condition, comprising:

The characteristic for being divided into the second sample group and corresponding class label are input in more disaggregated models and are trained directly To reaching training condition.

8. a kind of method of data classification characterized by comprising

Receive characteristic to be sorted；

The characteristic to be sorted is input to the first disaggregated model；

In the case where first disaggregated model output is first category label, determine that the characteristic to be sorted is first The corresponding classification of class label；

It, will be described wait divide in the case where first disaggregated model output is remaining class label in addition to first category label Category feature data are input to the second disaggregated model, determine the characteristic to be sorted according to the output result of the second disaggregated model Corresponding classification.

9. according to the method described in claim 8, it is characterized in that, according to the output result of the second disaggregated model determine it is described to The corresponding classification of characteristic classification data, comprising:

Judge whether the characteristic to be sorted is combination sort；

If so, determining the corresponding classification of the characteristic to be sorted according to the corresponding classification of characteristic in combination sort；

10. according to the method described in claim 9, it is characterized in that, according to the corresponding classification of characteristic in combination sort, really Determine the corresponding classification of the characteristic to be sorted, comprising:

According to the corresponding class label of at least two classifications in combination sort in the ratio of sample data set；

11. a kind of device of disaggregated model training characterized by comprising

Processing module is configured as obtaining sample data set, and the sample data set includes at least three kinds of class labels and class The corresponding characteristic of distinguishing label counts the accounting that the quantity of each class label is concentrated in the sample data；

Division module is configured as the accounting concentrated according to the quantity of each class label in the sample data, by the sample The class label and its corresponding characteristic that notebook data is concentrated are divided at least two sample groups；

Training module, is configured as the sample group being input in corresponding disaggregated model and is trained until reach trained item Part.

12. a kind of device of data classification characterized by comprising

Receiving module is configured as receiving characteristic to be sorted；

First determining module is configured as determining institute in the case where it is first category label that first disaggregated model, which exports, Stating characteristic to be sorted is the corresponding classification of first category label；

Second determining module is configured as exporting in first disaggregated model as remaining classification mark in addition to first category label In the case where label, the characteristic to be sorted is input to the second disaggregated model, according to the output result of the second disaggregated model Determine the corresponding classification of the characteristic to be sorted.

13. a kind of calculating equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine instruction, which is characterized in that the processor realizes side described in claim 1-7 or 8-10 any one when executing described instruction The step of method.

14. a kind of computer readable storage medium, is stored with computer instruction, which is characterized in that the instruction is held by processor The step of claim 1-7 or 8-10 any one the method are realized when row.