CN106933847B - Method and device for establishing data classification model - Google Patents

Method and device for establishing data classification model Download PDF

Info

Publication number
CN106933847B
CN106933847B CN201511020749.3A CN201511020749A CN106933847B CN 106933847 B CN106933847 B CN 106933847B CN 201511020749 A CN201511020749 A CN 201511020749A CN 106933847 B CN106933847 B CN 106933847B
Authority
CN
China
Prior art keywords
data
classification model
classification
classifier
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201511020749.3A
Other languages
Chinese (zh)
Other versions
CN106933847A (en
Inventor
赵磊
吕伟胜
梁德兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenzhou Taiyue Software Co Ltd
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201511020749.3A priority Critical patent/CN106933847B/en
Publication of CN106933847A publication Critical patent/CN106933847A/en
Application granted granted Critical
Publication of CN106933847B publication Critical patent/CN106933847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for establishing a data classification model, which comprises the following steps: acquiring source data of a specified service type and category information of the source data, and establishing a plurality of classification models; taking the classification model with the highest test score as the optimal classification model; wherein, establishing each classification model comprises: randomly extracting partial data from the source data; performing word segmentation on the partial data; calculating a correlation value between each word and each category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information; and inputting all the feature word sets and the category information thereof into a classifier to establish a corresponding classification model. The difference between the above obtained multiple classification models is that: the data extracted from the source data, the feature selection algorithm and/or the classifier are selected, so that the optimal classification model selected from the multiple classification models is the optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.

Description

Method and device for establishing data classification model
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for establishing a data classification model.
Background
The text classification technology is a technology for analyzing text information based on statistical thought, regularly dividing regions of different classes and finally creating a classification model to enable subsequent texts to be accurately classified. In the existing classification technology, a single text feature selection algorithm is adopted for text characterization to obtain a feature file, and a single classifier is adopted for reading the feature file to perform classification model creation and classification prediction operation. The disadvantages of this solution are: the strategy is single, and the accuracy and the stability of the training data aiming at different fields and qualities can not be guaranteed.
Disclosure of Invention
In view of the above, the present invention has been developed to provide a method and apparatus for building a data classification model that overcomes, or at least partially solves, the above-mentioned problems.
According to an aspect of the present invention, there is provided a method of building a data classification model, the method comprising:
acquiring source data of a specified service type and category information of the source data, and establishing a plurality of classification models;
respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying data of the specified service type according to the optimal classification model;
wherein, establishing each classification model comprises:
randomly extracting partial data from the source data;
performing word segmentation on the partial data to obtain a plurality of words;
for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information;
and inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier.
Optionally, the plurality of classification models includes:
a plurality of classification models established by using the same feature selection algorithm and the same classifier;
and/or the presence of a gas in the gas,
a plurality of classification models created using different feature selection algorithms and/or different classifiers.
Optionally, the method further comprises: providing a user input interface, and receiving training times and expected values input by a user through the user input interface;
the establishing of the plurality of classification models comprises: for each feature selection algorithm and each classifier, establishing a classification model according with the number of training times by using the feature selection algorithm and the classifier;
the step of using the classification model with the highest test score as the optimal classification model comprises the following steps: and taking the classification model with the highest test score and meeting the expected value as a final data classification model.
Optionally, before the performing the word segmentation operation on the partial data, the method further includes:
providing a user input interface through which user input of filter rules is received;
and filtering the partial data according to the filtering rule.
Optionally, the separately testing the plurality of classification models includes:
acquiring a plurality of test data and the category information of each test data;
for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.
Optionally, the calculating the association value between each word and the category information by using the feature selection algorithm includes:
for each word, the association value between the word and the category information is calculated using the CHI algorithm or the CLOSE algorithm.
Optionally, the inputting each feature word set and the corresponding category information into the classifier includes:
and inputting each feature word set and corresponding category information into a LibLinear classifier or a LibSVM classifier.
According to another aspect of the present invention, there is provided an apparatus for building a data classification model, the apparatus comprising:
a data acquisition unit adapted to acquire source data of a specified service type and category information of the source data,
the model establishing unit is suitable for establishing a plurality of classification models according to the acquired source data and the category information of the source data; wherein, establishing each classification model comprises: randomly extracting partial data from the source data; performing word segmentation on the partial data to obtain a plurality of words; for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information; inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier;
and the model selection unit is suitable for respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying the data of the specified service type according to the optimal classification model.
Optionally, the plurality of classification models includes:
a plurality of classification models established by using the same feature selection algorithm and the same classifier;
and/or the presence of a gas in the gas,
a plurality of classification models created using different feature selection algorithms and/or different classifiers.
Optionally, the data acquisition unit is further adapted to provide a user input interface through which the training times and the expected values input by the user are received;
the model establishing unit is suitable for establishing a classification model which accords with the number of the training times by utilizing the feature selection algorithm and the classifier for each feature selection algorithm and each classifier;
and the model selection unit is suitable for taking the classification model with the highest test score and meeting the expected value as the final data classification model.
Optionally, the data obtaining unit is further adapted to provide a user input interface through which the filtering rules input by the user are received;
and the model establishing unit is further suitable for filtering the partial data according to the filtering rule before the word segmentation operation is carried out on the partial data.
Optionally, the model selecting unit is adapted to obtain a plurality of test data and category information of each test data; for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.
Optionally, the model building unit is adapted to calculate, for each word, a correlation value between the word and the category information using a CHI algorithm or a CLOSE algorithm.
Optionally, the model establishing unit is adapted to input each feature word set and the corresponding category information into the LibLinear classifier or the LibSVM classifier together.
It can be known from the above that, in the technical scheme provided by the present invention, for a specific service type, the source data and the class information of the source data are obtained, a series of operation processes are performed by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set into a classifier to obtain a corresponding classification model, a plurality of operations are performed to obtain a plurality of classification models, and the classification models are tested respectively to select the classification model with the best test result as the optimal classification model, so as to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates a flow diagram of a method of building a data classification model according to one embodiment of the invention;
FIG. 2 is a schematic diagram of an apparatus for modeling data classification according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
FIG. 1 shows a flow diagram of a method of building a data classification model according to one embodiment of the invention. As shown in fig. 1, the method includes:
step S110, obtaining the source data of the appointed service type and the class information of the source data, and establishing a plurality of classification models.
And step S120, respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying the data of the specified service type according to the optimal classification model.
The process of establishing each classification model in step S110 is as follows:
in step S112, partial data is randomly extracted from the source data.
And step S114, performing word segmentation on the partial data to obtain a plurality of words.
And step S116, for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information.
And step S118, inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier.
It can be seen that, in the method shown in fig. 1, for a specific service type, the source data and the class information of the source data are obtained, a series of operation processes are performed by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set into a classifier to obtain a corresponding classification model, a plurality of operations are performed to obtain a plurality of classification models, and the classification models are tested respectively to select the classification model with the best test result as an optimal classification model, so as to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.
In this embodiment, each time a classification model is built through the above steps S112 to S118, the part of data randomly extracted in step S112 may be different, which results in different word segmentation results in step S114, the feature selection algorithms used in step S116 may be different, which results in different feature word sets, and the classifiers used in step S118 may be different, which results in different finally built classification models. Thus, the plurality of classification models established in fig. 1 includes: a plurality of classification models established by using the same feature selection algorithm and the same classifier; and/or a plurality of classification models established using different feature selection algorithms and/or different classifiers.
For example, part of data is randomly extracted for the first time from a data source, a plurality of words are obtained by segmenting the part of data, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding first classification model is established according to a classifier b, part of data is randomly extracted for the second time from the data source, a plurality of words are obtained by segmenting the part of data, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding second classification model is established according to a classifier b, and a plurality of different classification models can be obtained through a plurality of operations because the part of data extracted for the first time is different from the part of data extracted for the second time. In addition, part of data is randomly extracted from the data source for the third time, word segmentation is carried out on the part of data to obtain a plurality of words, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding third classification model is established according to a classifier b, part of data is randomly extracted from the data source for the fourth time, word segmentation is carried out on the part of data to obtain a plurality of words, a feature word set of each category of information of the data source is obtained according to a feature selection algorithm a, a corresponding fourth classification model is established according to a classifier c, and different classification models can be obtained through corresponding operation on the basis of different feature selection algorithms and/or different classifiers because the part of data extracted for the third time is different from the part of data extracted for the fourth time and the utilized classifiers are different.
Further, before performing the word segmentation operation on the partial data in step S140, the method shown in fig. 1 further includes: and filtering partial data extracted from the source data, and filtering out meaningless characters such as numbers, English and the like which are irrelevant to the specified service type.
In a specific embodiment, a user input interface can be provided externally, and the user input interface is used for receiving the user input custom filtering rule; and filtering the partial data according to the filtering rule so as to reduce the data processing burden in the later model building process.
In an embodiment of the present invention, the step S120 of testing the plurality of classification models respectively includes: acquiring a plurality of test data and the category information of each test data; for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.
The implementation process of the scheme is illustrated by a specific example: selecting an article belonging to the physical research field as source data of the data research field, wherein the category information of the article comprises: microstructure, laser; and (3) carrying out multiple operations on the article to establish a corresponding classification model, wherein each operation process is as follows:
step S1102, acquiring data: portions of data (e.g., one or more paragraphs) are randomly extracted from the article.
Step S1104, preprocessing: filtering meaningless information such as publication time information, title information, author information and the like in the partial data, and performing word segmentation processing on the filtered partial data to obtain a plurality of words.
Step S1106, selecting a policy: in this example, the alternative feature selection algorithm includes: CHI algorithm and CLOSE algorithm; taking the CHI algorithm as an example to illustrate the principle, for a word T, when calculating the association between the word T and the category M, a plurality of articles need to be searched, and statistics is performed in the plurality of articles to obtain: if the number of articles including T and belonging to the category M is a, the number of articles including T and not belonging to the category M is B, the number of articles not including T and belonging to the category M is C, and the number of articles not including T and not belonging to the category M is D, then the association between the word T and the category M is: (AD-BC)2(AB + CD), reflecting the representation of the word T for the class M, a larger association value indicates greater representation. Similarly, the CLOSE algorithm also characterizes the representativeness of a word to a corresponding category by calculating the association of the word with the category. In this example, the alternative classifier includes: a LibLinear classifier and a LibSVM classifier; the LibSVM kernel adopts nonlinear calculation, and the LibLinear kernel adopts linear calculation such as: the division between men and women is in a linear fashion. The quality scene is much higher than the LibSVM efficiency in the case of large-amount data analysis by using LibLinear kernel calculation according to the data amount and the data division complexity. And inputting the feature word sets of different categories into a classifier, and modeling by the classifier according to the related information to obtain a corresponding classification model. According to the feature selection algorithm and the classifier provided in the present example, actually, the feature selection algorithm and the classifier totally include 4 combinations of feature selection algorithms and classifiers, that is, CHI + LibSVM, CLOSE + LibSVM, CHI + LibLinear, CLOSE + LibLinear, and one combination is selected for the operation.
Step S1108, characterizing: calculating a correlation value between each word and the category of microstructure by using a selected feature selection algorithm, putting the words with the correlation values higher than a first preset threshold value into a feature word set corresponding to the microstructure, calculating the correlation values between the words and the category of laser by using the feature selection algorithm, putting the words with the correlation values higher than the first preset threshold value into a feature word set corresponding to the laser, wherein the meaning of the feature word set corresponding to the microstructure obtained at this time is as follows: each word in the feature word set has a certain influence on judging whether the text is in a microstructure category or not, and the meaning of the feature word set corresponding to the laser is as follows: each word in the feature word set has some degree of influence on determining whether the text is in the "laser" category.
Step S1110, creating a classification model: after the feature word set corresponding to the category "microstructure" and the feature word set corresponding to the category "laser" are obtained, the two feature word sets and the category information corresponding to the two feature word sets are input into the selected classifier, and the classification model corresponding to the operation is obtained.
Step S1112, testing the accuracy of each classification model: providing a group of test data, wherein the class information corresponding to each test data is known, inputting the test data into each classification model for each classification model to obtain the classification result of each test data by the classification model, matching the classification result of each test data with the known class information of the test data, if the classification result of each test data is matched with the known class information of the test data, indicating that the test is passed, and taking the test passing rate in the group of test data as the test score of the classification model to represent the accuracy of the classification model.
The above steps S1102 to S1112 are a training process for the classification model, and more specifically, a user input interface may be provided to the outside, and a training frequency and an expected value input by the user are received, where the training frequency is a number of classification models that the user wants to obtain based on the same feature selection algorithm and the combination of the classifier, and the expected value is a highest expectation of the user on a test score of the obtained classification model. Table 1 shows the classification model training results with an expected value of 80% from the default times of 4 training:
TABLE 1
Feature selection algorithm and classifier For the first time For the second time The third time Fourth time
CHI-Liblinear 78.6666% 82.3333% 80.3333% 76.6666%
CHI-LibSVM 72.6666% 74.6666% 76.6666% 74%
CLOSE-Liblinear 79.3333% 80.3333% 78.6666% 72.6666%
CLOSE-LibSVM 76.6666% 74% 72.6666% 77.3333%
As can be seen from table 1, the classification model using CHI-libilinear for the second time, i.e., the classification model with the test score of 82.3333%, is finally selected as the optimal classification model for classifying the data in the future physical research field. Of course, after 4 times of training, if the expected value is not reached, the test score is automatically the highest as the optimal classification model. The automatic preference technology gives the selection right to the system, so that a user does not need to worry about the problem of selecting which scheme is more suitable, and only needs to provide expectation and training times to enable the system to calculate automatically.
Fig. 2 is a schematic diagram of an apparatus for modeling data classification according to an embodiment of the present invention, and as shown in fig. 2, the apparatus 200 for modeling data classification includes:
the data obtaining unit 210 is adapted to obtain source data of a specified service type and category information of the source data.
A model establishing unit 220 adapted to establish a plurality of classification models according to the acquired source data and the category information of the source data; wherein, establishing each classification model comprises: randomly extracting partial data from the source data; performing word segmentation on the partial data to obtain a plurality of words; for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information; and inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier.
The model selecting unit 230 is adapted to test the plurality of classification models respectively, take the classification model with the highest test score as the optimal classification model, and classify the data of the specified service type according to the optimal classification model.
It can be seen that, for a specific service type, the apparatus shown in fig. 2 obtains source data and class information of the source data, obtains a corresponding classification model by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set into a classifier to perform a series of learning operation processes, obtains a plurality of classification models by performing a plurality of operations, and tests the classification models respectively to select the classification model with the best test result as an optimal classification model to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.
In one embodiment of the present invention, the plurality of classification models includes: a plurality of classification models established by using the same feature selection algorithm and the same classifier; and/or a plurality of classification models established using different feature selection algorithms and/or different classifiers.
Specifically, the data obtaining unit 210 is further adapted to provide a user input interface, through which the training times and the expected values input by the user are received; a model establishing unit 220 adapted to establish, for each feature selection algorithm and each classifier, a classification model in accordance with the number of training times using the feature selection algorithm and the classifier; the model selecting unit 230 is adapted to use the classification model with the highest test score and satisfying the expected value as the final data classification model.
In an embodiment of the present invention, the data obtaining unit 210 is further adapted to provide a user input interface through which the user input of the filtering rules is received; the model building unit 220 is further adapted to filter the partial data according to the filtering rule before the word segmentation operation is performed on the partial data.
In an embodiment of the present invention, the model selecting unit 230 is adapted to obtain a plurality of test data and category information of each test data; for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.
In an embodiment of the invention the model establishing unit 220 is adapted to calculate for each word an association value between the word and the category information using the CHI algorithm or the CLOSE algorithm.
In an embodiment of the present invention, the model establishing unit 220 is adapted to input each feature word set and the corresponding category information into the LibLinear classifier or the LibSVM classifier.
It should be noted that the embodiments of the apparatus shown in fig. 2 are the same as the embodiments shown in fig. 1, and the detailed description is given above and will not be repeated herein.
In summary, the technical solution provided by the present invention obtains source data and class information of the source data for a specific service type, obtains a corresponding classification model by extracting data, segmenting words, calculating a feature word set corresponding to each class information, inputting the feature word set to a classifier, and performing a series of learning operations, obtains a plurality of classification models by performing a plurality of operations, and tests the classification models respectively to select the classification model with the best test result as an optimal classification model to classify other data of the specific service type. According to the scheme, the obtained differences among the classification models are as follows: the data extracted from the source data are different, the utilized feature selection algorithm is different, and/or the utilized classifiers are different, so that the optimal classification model selected from the multiple classification models is an optimal classification strategy obtained by comprehensively considering the change parameters, and the method has high accuracy and stability.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (8)

1. A method of building a data classification model, the method comprising:
acquiring source data of a specified service type and category information of the source data, and establishing a plurality of classification models;
respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying data of the specified service type according to the optimal classification model;
wherein, establishing each classification model comprises:
randomly extracting partial data from the source data;
performing word segmentation on the partial data to obtain a plurality of words;
for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information;
inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier;
the plurality of classification models includes:
a plurality of classification models established by using the same feature selection algorithm and the same classifier;
and/or the presence of a gas in the gas,
a plurality of classification models created using different feature selection algorithms and/or different classifiers.
2. The method of claim 1, further comprising: providing a user input interface, and receiving training times and expected values input by a user through the user input interface;
the establishing of the plurality of classification models comprises: for each feature selection algorithm and each classifier, establishing a classification model according with the number of training times by using the feature selection algorithm and the classifier;
the step of using the classification model with the highest test score as the optimal classification model comprises the following steps: and taking the classification model with the highest test score and meeting the expected value as a final data classification model.
3. The method of claim 1, wherein prior to said performing a word segmentation operation on said portion of data, the method further comprises:
providing a user input interface through which user input of filter rules is received;
and filtering the partial data according to the filtering rule.
4. The method of claim 1, wherein said separately testing the plurality of classification models comprises:
acquiring a plurality of test data and the category information of each test data;
for each classification model, respectively inputting each test data into the classification model to obtain the classification result of each test data by the classification model; and taking the probability of the classification result in the plurality of test data hitting the corresponding classification information as the test score of the classification model.
5. The method of claim 1, wherein calculating the association value between each word and the category information using a feature selection algorithm comprises:
for each word, the association value between the word and the category information is calculated using the CHI algorithm or the CLOSE algorithm.
6. The method of claim 1, wherein inputting each feature word set into a classifier along with corresponding category information comprises:
and inputting each feature word set and corresponding category information into a LibLinear classifier or a LibSVM classifier.
7. An apparatus for modeling a classification of data, the apparatus comprising:
a data acquisition unit adapted to acquire source data of a specified service type and category information of the source data,
the model establishing unit is suitable for establishing a plurality of classification models according to the acquired source data and the category information of the source data; wherein, establishing each classification model comprises: randomly extracting partial data from the source data; performing word segmentation on the partial data to obtain a plurality of words; for each category of information, calculating a correlation value between each word and the category of information by using a feature selection algorithm, and putting the words with the correlation values higher than a first preset value into a feature word set of the category of information; inputting all the feature word sets and the category information thereof into a classifier, and establishing a corresponding classification model by using the classifier;
the model selection unit is suitable for respectively testing the plurality of classification models, taking the classification model with the highest test score as an optimal classification model, and classifying the data of the specified service type according to the optimal classification model;
the plurality of classification models includes:
a plurality of classification models established by using the same feature selection algorithm and the same classifier;
and/or the presence of a gas in the gas,
a plurality of classification models created using different feature selection algorithms and/or different classifiers.
8. The apparatus of claim 7,
the data acquisition unit is further suitable for providing a user input interface, and the training times and the expected values input by the user are received through the user input interface;
the model establishing unit is suitable for establishing a classification model which accords with the number of the training times by utilizing the feature selection algorithm and the classifier for each feature selection algorithm and each classifier;
and the model selection unit is suitable for taking the classification model with the highest test score and meeting the expected value as the final data classification model.
CN201511020749.3A 2015-12-30 2015-12-30 Method and device for establishing data classification model Active CN106933847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511020749.3A CN106933847B (en) 2015-12-30 2015-12-30 Method and device for establishing data classification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511020749.3A CN106933847B (en) 2015-12-30 2015-12-30 Method and device for establishing data classification model

Publications (2)

Publication Number Publication Date
CN106933847A CN106933847A (en) 2017-07-07
CN106933847B true CN106933847B (en) 2019-12-27

Family

ID=59441623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511020749.3A Active CN106933847B (en) 2015-12-30 2015-12-30 Method and device for establishing data classification model

Country Status (1)

Country Link
CN (1) CN106933847B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427457A (en) * 2019-06-28 2019-11-08 厦门美域中央信息科技有限公司 It is a kind of based on ANN database text classification in feature selection approach

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6192360B1 (en) * 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
CN104063458A (en) * 2014-06-26 2014-09-24 北京奇虎科技有限公司 Method and device for providing corresponding solution for terminal fault problem
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN104965822A (en) * 2015-07-29 2015-10-07 中南大学 Emotion analysis method for Chinese texts based on computer information processing technology
CN104965867A (en) * 2015-06-08 2015-10-07 南京师范大学 Text event classification method based on CHI feature selection
CN105069141A (en) * 2015-08-19 2015-11-18 北京工商大学 Construction method and construction system for stock standard news library

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6192360B1 (en) * 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
CN104063458A (en) * 2014-06-26 2014-09-24 北京奇虎科技有限公司 Method and device for providing corresponding solution for terminal fault problem
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN104965867A (en) * 2015-06-08 2015-10-07 南京师范大学 Text event classification method based on CHI feature selection
CN104965822A (en) * 2015-07-29 2015-10-07 中南大学 Emotion analysis method for Chinese texts based on computer information processing technology
CN105069141A (en) * 2015-08-19 2015-11-18 北京工商大学 Construction method and construction system for stock standard news library

Also Published As

Publication number Publication date
CN106933847A (en) 2017-07-07

Similar Documents

Publication Publication Date Title
CN110674881B (en) Trademark image retrieval model training method, system, storage medium and computer equipment
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN110069630B (en) Improved mutual information feature selection method
US9824313B2 (en) Filtering content in an online system based on text and image signals extracted from the content
CN106776566B (en) Method and device for recognizing emotion vocabulary
CN107273500A (en) Text classifier generation method, file classification method, device and computer equipment
CN110503143B (en) Threshold selection method, device, storage medium and device based on intention recognition
CN106897290B (en) Method and device for establishing keyword model
CN112711983B (en) Nuclear analysis system, method, electronic device, and readable storage medium
CN103824090A (en) Adaptive face low-level feature selection method and face attribute recognition method
CN111210402A (en) Face image quality scoring method and device, computer equipment and storage medium
CN105117740A (en) Font identification method and device
CN108280164A (en) A kind of short text filtering and sorting technique based on classification related words
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN106202274A (en) A kind of defective data automatic abstract sorting technique based on Bayesian network
CN112200862B (en) Training method of target detection model, target detection method and device
CN104809229A (en) Method and system for extracting text characteristic words
CN112785566B (en) Metaphase image scoring method, metaphase image scoring device, electronic equipment and storage medium
CN106933847B (en) Method and device for establishing data classification model
CN108090040A (en) A kind of text message sorting technique and system
CN108363967A (en) A kind of categorizing system of remote sensing images scene
CN107729909B (en) Application method and device of attribute classifier
CN112489689A (en) Cross-database voice emotion recognition method and device based on multi-scale difference confrontation
CN109214275B (en) Vulgar picture identification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Room 818, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080

Patentee after: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building 6 storey block A Room 601

Patentee before: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.