CN106933847B

CN106933847B - Method and device for establishing data classification model

Info

Publication number: CN106933847B
Application number: CN201511020749.3A
Authority: CN
Inventors: 赵磊; 吕伟胜; 梁德兴
Original assignee: Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Beijing Shenzhou Taiyue Software Co Ltd
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2019-12-27
Anticipated expiration: 2035-12-30
Also published as: CN106933847A

Abstract

The invention discloses a method and a device for establishing a data classification model, which comprises the following steps: acquiring source data of a specified service type and category information of the source data, and establishing a plurality of classification models; taking the classification model with the highest test score as the optimal classification model; wherein, establishing each classification model comprises: randomly extracting partial data from the source data; performing word segmentation on the partial data; calculating a correlation value between each word and each category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information; and inputting all the feature word sets and the category information thereof into a classifier to establish a corresponding classification model. The difference between the above obtained multiple classification models is that: the data extracted from the source data, the feature selection algorithm and/or the classifier are selected, so that the optimal classification model selected from the multiple classification models is the optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.

Description

Method and device for establishing data classification model

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for establishing a data classification model.

Background

The text classification technology is a technology for analyzing text information based on statistical thought, regularly dividing regions of different classes and finally creating a classification model to enable subsequent texts to be accurately classified. In the existing classification technology, a single text feature selection algorithm is adopted for text characterization to obtain a feature file, and a single classifier is adopted for reading the feature file to perform classification model creation and classification prediction operation. The disadvantages of this solution are: the strategy is single, and the accuracy and the stability of the training data aiming at different fields and qualities can not be guaranteed.

Disclosure of Invention

In view of the above, the present invention has been developed to provide a method and apparatus for building a data classification model that overcomes, or at least partially solves, the above-mentioned problems.

According to an aspect of the present invention, there is provided a method of building a data classification model, the method comprising:

acquiring source data of a specified service type and category information of the source data, and establishing a plurality of classification models;

respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying data of the specified service type according to the optimal classification model;

wherein, establishing each classification model comprises:

randomly extracting partial data from the source data;

performing word segmentation on the partial data to obtain a plurality of words;

for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information;

and inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier.

Optionally, the plurality of classification models includes:

a plurality of classification models established by using the same feature selection algorithm and the same classifier;

and/or the presence of a gas in the gas,

a plurality of classification models created using different feature selection algorithms and/or different classifiers.

Optionally, the method further comprises: providing a user input interface, and receiving training times and expected values input by a user through the user input interface;

the establishing of the plurality of classification models comprises: for each feature selection algorithm and each classifier, establishing a classification model according with the number of training times by using the feature selection algorithm and the classifier;

the step of using the classification model with the highest test score as the optimal classification model comprises the following steps: and taking the classification model with the highest test score and meeting the expected value as a final data classification model.

Optionally, before the performing the word segmentation operation on the partial data, the method further includes:

providing a user input interface through which user input of filter rules is received;

and filtering the partial data according to the filtering rule.

Optionally, the separately testing the plurality of classification models includes:

acquiring a plurality of test data and the category information of each test data;

for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.

Optionally, the calculating the association value between each word and the category information by using the feature selection algorithm includes:

for each word, the association value between the word and the category information is calculated using the CHI algorithm or the CLOSE algorithm.

Optionally, the inputting each feature word set and the corresponding category information into the classifier includes:

and inputting each feature word set and corresponding category information into a LibLinear classifier or a LibSVM classifier.

According to another aspect of the present invention, there is provided an apparatus for building a data classification model, the apparatus comprising:

a data acquisition unit adapted to acquire source data of a specified service type and category information of the source data,

the model establishing unit is suitable for establishing a plurality of classification models according to the acquired source data and the category information of the source data; wherein, establishing each classification model comprises: randomly extracting partial data from the source data; performing word segmentation on the partial data to obtain a plurality of words; for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information; inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier;

and the model selection unit is suitable for respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying the data of the specified service type according to the optimal classification model.

Optionally, the plurality of classification models includes:

and/or the presence of a gas in the gas,

Optionally, the data acquisition unit is further adapted to provide a user input interface through which the training times and the expected values input by the user are received;

the model establishing unit is suitable for establishing a classification model which accords with the number of the training times by utilizing the feature selection algorithm and the classifier for each feature selection algorithm and each classifier;

and the model selection unit is suitable for taking the classification model with the highest test score and meeting the expected value as the final data classification model.

Optionally, the data obtaining unit is further adapted to provide a user input interface through which the filtering rules input by the user are received;

and the model establishing unit is further suitable for filtering the partial data according to the filtering rule before the word segmentation operation is carried out on the partial data.

Optionally, the model selecting unit is adapted to obtain a plurality of test data and category information of each test data; for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.

Optionally, the model building unit is adapted to calculate, for each word, a correlation value between the word and the category information using a CHI algorithm or a CLOSE algorithm.

Optionally, the model establishing unit is adapted to input each feature word set and the corresponding category information into the LibLinear classifier or the LibSVM classifier together.

It can be known from the above that, in the technical scheme provided by the present invention, for a specific service type, the source data and the class information of the source data are obtained, a series of operation processes are performed by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set into a classifier to obtain a corresponding classification model, a plurality of operations are performed to obtain a plurality of classification models, and the classification models are tested respectively to select the classification model with the best test result as the optimal classification model, so as to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a flow diagram of a method of building a data classification model according to one embodiment of the invention;

FIG. 2 is a schematic diagram of an apparatus for modeling data classification according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

FIG. 1 shows a flow diagram of a method of building a data classification model according to one embodiment of the invention. As shown in fig. 1, the method includes:

step S110, obtaining the source data of the appointed service type and the class information of the source data, and establishing a plurality of classification models.

And step S120, respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying the data of the specified service type according to the optimal classification model.

The process of establishing each classification model in step S110 is as follows:

in step S112, partial data is randomly extracted from the source data.

And step S114, performing word segmentation on the partial data to obtain a plurality of words.

And step S116, for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information.

And step S118, inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier.

It can be seen that, in the method shown in fig. 1, for a specific service type, the source data and the class information of the source data are obtained, a series of operation processes are performed by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set into a classifier to obtain a corresponding classification model, a plurality of operations are performed to obtain a plurality of classification models, and the classification models are tested respectively to select the classification model with the best test result as an optimal classification model, so as to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.

In this embodiment, each time a classification model is built through the above steps S112 to S118, the part of data randomly extracted in step S112 may be different, which results in different word segmentation results in step S114, the feature selection algorithms used in step S116 may be different, which results in different feature word sets, and the classifiers used in step S118 may be different, which results in different finally built classification models. Thus, the plurality of classification models established in fig. 1 includes: a plurality of classification models established by using the same feature selection algorithm and the same classifier; and/or a plurality of classification models established using different feature selection algorithms and/or different classifiers.

For example, part of data is randomly extracted for the first time from a data source, a plurality of words are obtained by segmenting the part of data, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding first classification model is established according to a classifier b, part of data is randomly extracted for the second time from the data source, a plurality of words are obtained by segmenting the part of data, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding second classification model is established according to a classifier b, and a plurality of different classification models can be obtained through a plurality of operations because the part of data extracted for the first time is different from the part of data extracted for the second time. In addition, part of data is randomly extracted from the data source for the third time, word segmentation is carried out on the part of data to obtain a plurality of words, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding third classification model is established according to a classifier b, part of data is randomly extracted from the data source for the fourth time, word segmentation is carried out on the part of data to obtain a plurality of words, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding fourth classification model is established according to a classifier c, and different classification models can be obtained through corresponding operation on the basis of different feature selection algorithms and/or different classifiers because the part of data extracted for the third time is different from the part of data extracted for the fourth time and the utilized classifiers are different.

Further, before performing the word segmentation operation on the partial data in step S140, the method shown in fig. 1 further includes: and filtering partial data extracted from the source data, and filtering out meaningless characters such as numbers, English and the like which are irrelevant to the specified service type.

In a specific embodiment, a user input interface can be provided externally, and the user input interface is used for receiving the user input custom filtering rule; and filtering the partial data according to the filtering rule so as to reduce the data processing burden in the later model building process.

In an embodiment of the present invention, the step S120 of testing the plurality of classification models respectively includes: acquiring a plurality of test data and the category information of each test data; for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.

The implementation process of the scheme is illustrated by a specific example: selecting an article belonging to the physical research field as source data of the data research field, wherein the category information of the article comprises: microstructure, laser; and (3) carrying out multiple operations on the article to establish a corresponding classification model, wherein each operation process is as follows:

step S1102, acquiring data: portions of data (e.g., one or more paragraphs) are randomly extracted from the article.

Step S1104, preprocessing: filtering meaningless information such as publication time information, title information, author information and the like in the partial data, and performing word segmentation processing on the filtered partial data to obtain a plurality of words.

Step S1106, selecting a policy: in this example, the alternative feature selection algorithm includes: CHI algorithm and CLOSE algorithm; taking the CHI algorithm as an example to illustrate the principle, for a word T, when calculating the association between the word T and the category M, a plurality of articles need to be searched, and statistics is performed in the plurality of articles to obtain: if the number of articles including T and belonging to the category M is a, the number of articles including T and not belonging to the category M is B, the number of articles not including T and belonging to the category M is C, and the number of articles not including T and not belonging to the category M is D, then the association between the word T and the category M is: (AD-BC)²(AB + CD), reflecting the representation of the word T for the class M, a larger association value indicates greater representation. Similarly, the CLOSE algorithm also characterizes the representativeness of a word to a corresponding category by calculating the association of the word with the category. In this example, the alternative classifier includes: a LibLinear classifier and a LibSVM classifier; the LibSVM kernel adopts nonlinear calculation, and the LibLinear kernel adopts linear calculation such as: the division between men and women is in a linear fashion. The quality scene is much higher than the LibSVM efficiency in the case of large-amount data analysis by using LibLinear kernel calculation according to the data amount and the data division complexity. And inputting the feature word sets of different categories into a classifier, and modeling by the classifier according to the related information to obtain a corresponding classification model. According to the feature selection algorithm and the classifier provided in the present example, actually, the feature selection algorithm and the classifier totally include 4 combinations of feature selection algorithms and classifiers, that is, CHI + LibSVM, CLOSE + LibSVM, CHI + LibLinear, CLOSE + LibLinear, and one combination is selected for the operation.

Step S1108, characterizing: calculating a correlation value between each word and the category of microstructure by using a selected feature selection algorithm, putting the words with the correlation values higher than a first preset threshold value into a feature word set corresponding to the microstructure, calculating the correlation values between the words and the category of laser by using the feature selection algorithm, putting the words with the correlation values higher than the first preset threshold value into a feature word set corresponding to the laser, wherein the meaning of the feature word set corresponding to the microstructure obtained at this time is as follows: each word in the feature word set has a certain influence on judging whether the text is in a microstructure category or not, and the meaning of the feature word set corresponding to the laser is as follows: each word in the feature word set has some degree of influence on determining whether the text is in the "laser" category.

Step S1110, creating a classification model: after the feature word set corresponding to the category "microstructure" and the feature word set corresponding to the category "laser" are obtained, the two feature word sets and the category information corresponding to the two feature word sets are input into the selected classifier, and the classification model corresponding to the operation is obtained.

Step S1112, testing the accuracy of each classification model: providing a group of test data, wherein the class information corresponding to each test data is known, inputting the test data into each classification model for each classification model to obtain the classification result of each test data by the classification model, matching the classification result of each test data with the known class information of the test data, if the classification result of each test data is matched with the known class information of the test data, indicating that the test is passed, and taking the test passing rate in the group of test data as the test score of the classification model to represent the accuracy of the classification model.

The above steps S1102 to S1112 are a training process for the classification model, and more specifically, a user input interface may be provided to the outside, and a training frequency and an expected value input by the user are received, where the training frequency is a number of classification models that the user wants to obtain based on the same feature selection algorithm and the combination of the classifier, and the expected value is a highest expectation of the user on a test score of the obtained classification model. Table 1 shows the classification model training results with an expected value of 80% from the default times of 4 training:

TABLE 1

Feature selection algorithm and classifier	For the first time	For the second time	The third time	Fourth time
					CHI-Liblinear	78.6666％	82.3333％	80.3333％	76.6666％
CHI-LibSVM	72.6666％	74.6666％	76.6666％	74％
					CLOSE-Liblinear	79.3333％	80.3333％	78.6666％	72.6666％
CLOSE-LibSVM	76.6666％	74％	72.6666％	77.3333％

As can be seen from table 1, the classification model using CHI-libilinear for the second time, i.e., the classification model with the test score of 82.3333%, is finally selected as the optimal classification model for classifying the data in the future physical research field. Of course, after 4 times of training, if the expected value is not reached, the test score is automatically the highest as the optimal classification model. The automatic preference technology gives the selection right to the system, so that a user does not need to worry about the problem of selecting which scheme is more suitable, and only needs to provide expectation and training times to enable the system to calculate automatically.

Fig. 2 is a schematic diagram of an apparatus for modeling data classification according to an embodiment of the present invention, and as shown in fig. 2, the apparatus 200 for modeling data classification includes:

the data obtaining unit 210 is adapted to obtain source data of a specified service type and category information of the source data.

A model establishing unit 220 adapted to establish a plurality of classification models according to the acquired source data and the category information of the source data; wherein, establishing each classification model comprises: randomly extracting partial data from the source data; performing word segmentation on the partial data to obtain a plurality of words; for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information; and inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier.

The model selecting unit 230 is adapted to test the plurality of classification models respectively, take the classification model with the highest test score as the optimal classification model, and classify the data of the specified service type according to the optimal classification model.

It can be seen that, for a specific service type, the apparatus shown in fig. 2 obtains source data and class information of the source data, obtains a corresponding classification model by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set into a classifier to perform a series of learning operation processes, obtains a plurality of classification models by performing a plurality of operations, and tests the classification models respectively to select the classification model with the best test result as an optimal classification model to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.

In one embodiment of the present invention, the plurality of classification models includes: a plurality of classification models established by using the same feature selection algorithm and the same classifier; and/or a plurality of classification models established using different feature selection algorithms and/or different classifiers.

Specifically, the data obtaining unit 210 is further adapted to provide a user input interface, through which the training times and the expected values input by the user are received; a model establishing unit 220 adapted to establish, for each feature selection algorithm and each classifier, a classification model in accordance with the number of training times using the feature selection algorithm and the classifier; the model selecting unit 230 is adapted to use the classification model with the highest test score and satisfying the expected value as the final data classification model.

In an embodiment of the present invention, the data obtaining unit 210 is further adapted to provide a user input interface through which the user input of the filtering rules is received; the model building unit 220 is further adapted to filter the partial data according to the filtering rule before the word segmentation operation is performed on the partial data.

In an embodiment of the present invention, the model selecting unit 230 is adapted to obtain a plurality of test data and category information of each test data; for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.

In an embodiment of the invention the model establishing unit 220 is adapted to calculate for each word an association value between the word and the category information using the CHI algorithm or the CLOSE algorithm.

In an embodiment of the present invention, the model establishing unit 220 is adapted to input each feature word set and the corresponding category information into the LibLinear classifier or the LibSVM classifier.

It should be noted that the embodiments of the apparatus shown in fig. 2 are the same as the embodiments shown in fig. 1, and the detailed description is given above and will not be repeated herein.

In summary, the technical solution provided by the present invention obtains source data and class information of the source data for a specific service type, obtains a corresponding classification model by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set to a classifier, and performing a series of learning operations, obtains a plurality of classification models by performing a plurality of operations, and tests the classification models respectively to select the classification model with the best test result as an optimal classification model to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of building a data classification model, the method comprising:

wherein, establishing each classification model comprises:

randomly extracting partial data from the source data;

inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier;

the plurality of classification models includes:

and/or the presence of a gas in the gas,

2. The method of claim 1, further comprising: providing a user input interface, and receiving training times and expected values input by a user through the user input interface;

3. The method of claim 1, wherein prior to said performing a word segmentation operation on said portion of data, the method further comprises:

and filtering the partial data according to the filtering rule.

4. The method of claim 1, wherein said separately testing the plurality of classification models comprises:

5. The method of claim 1, wherein calculating the association value between each word and the category information using a feature selection algorithm comprises:

6. The method of claim 1, wherein inputting each feature word set into a classifier along with corresponding category information comprises:

7. An apparatus for modeling a classification of data, the apparatus comprising:

the model selection unit is suitable for respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying the data of the specified service type according to the optimal classification model;

the plurality of classification models includes:

and/or the presence of a gas in the gas,

8. The apparatus of claim 7,

the data acquisition unit is further suitable for providing a user input interface, and the training times and the expected values input by the user are received through the user input interface;