CN106909654B - Multi-level classification system and method based on news text information - Google Patents

Multi-level classification system and method based on news text information Download PDF

Info

Publication number
CN106909654B
CN106909654B CN201710103541.0A CN201710103541A CN106909654B CN 106909654 B CN106909654 B CN 106909654B CN 201710103541 A CN201710103541 A CN 201710103541A CN 106909654 B CN106909654 B CN 106909654B
Authority
CN
China
Prior art keywords
classification
text information
training
news text
classifiers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710103541.0A
Other languages
Chinese (zh)
Other versions
CN106909654A (en
Inventor
赵毅强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing time Ltd.
Original Assignee
Beijing Time Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Time Co ltd filed Critical Beijing Time Co ltd
Priority to CN201710103541.0A priority Critical patent/CN106909654B/en
Publication of CN106909654A publication Critical patent/CN106909654A/en
Application granted granted Critical
Publication of CN106909654B publication Critical patent/CN106909654B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a multi-stage classification system and method based on news text information, and relates to the technical field of file classification. Wherein, this system includes: the training module is used for training a preset training sample set by various machine learning algorithms aiming at all levels of classification of news text information and determining the number and types of classifiers corresponding to all levels of classification according to training results; the multi-stage classification module is used for configuring a corresponding multi-stage classification model according to the number and the types of classifiers corresponding to each stage of classification determined by the training module; and the result determining module is used for inputting the acquired news text information to be classified into the multi-stage classification model for classification, and determining the output result of the multi-stage classification model as the final classification result of the news text information to be classified. Therefore, the method and the device have the advantages that the problem of inaccurate classification result caused by unbalanced sample data is solved, the classification accuracy is effectively improved, and the classification efficiency is improved.

Description

Multi-level classification system and method based on news text information
Technical Field
The invention relates to the technical field of file classification, in particular to a multi-level classification system and method based on news text information.
Background
With the development of the internet era, network resources are more and more abundant and various. In order to effectively search and utilize various resources on the network, it is important to accurately and comprehensively classify the network resources. With the advent and development of machine learning algorithms, more and more people apply machine learning algorithms to news text information classification methods.
However, in the process of implementing the present invention, the inventors found that at least the following problems exist in the prior art: in many specific application scenarios, sample data distribution imbalance may occur for various reasons. When unbalanced data is encountered, the non-hierarchical news text information classification method realized by adopting the machine learning algorithm in the prior art can cause the machine learning algorithm to pay more attention to most samples due to the unbalance of sample data, so that few samples cannot be accurately identified, and the accuracy of the news text information classification method is integrally reduced.
Disclosure of Invention
In view of the above, the present invention has been developed to provide a multi-level classification system based on news text information and a corresponding method that overcome or at least partially solve the above-mentioned problems.
According to an aspect of the present invention, there is provided a multi-level classification system based on news text information, including: the training module is used for training a preset training sample set by various machine learning algorithms aiming at all levels of classification of news text information and determining the number and types of classifiers corresponding to all levels of classification according to training results; the multi-stage classification module is used for configuring a corresponding multi-stage classification model according to the number and the types of classifiers corresponding to each stage of classification determined by the training module; and the result determining module is used for inputting the acquired news text information to be classified into the multi-stage classification model for classification, and determining the output result of the multi-stage classification model as the final classification result of the news text information to be classified.
According to another aspect of the present invention, there is provided a multi-level classification method based on news text information, including: aiming at each level of classification of news text information, training a preset training sample set through various machine learning algorithms, and determining the number and types of classifiers corresponding to each level of classification according to training results; configuring a corresponding multi-stage classification model according to the number and the type of classifiers corresponding to each stage of classification; inputting the obtained news text information to be classified into a multi-stage classification model for classification, and determining the output result of the multi-stage classification model as the final classification result of the news text information to be classified.
Therefore, the invention provides a multi-level classification system and method based on news text information, which aim to solve the problem of inaccurate classification result caused by sample data imbalance by constructing a multi-level news text information classification system framework and configuring different multi-level classifiers at each level according to the types of the news text information, effectively improve the accuracy of classification of the news text information and improve the classification efficiency of the news text information.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic structural diagram of a multi-level classification system based on news text information according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a multi-level classification system based on news text information according to a second embodiment of the present invention;
fig. 3 is a flowchart of a multistage classification method based on news text information according to a third embodiment of the present invention;
fig. 4 is a flowchart of a multistage classification method based on news text information according to a fourth embodiment of the present invention;
fig. 5 is a flowchart of a multi-level classification system based on news text information according to a second embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The invention provides a multi-stage classification system and method based on news text information, which can at least solve the technical problem of inaccurate classification of the news text information caused by data imbalance in the prior art.
Example one
Fig. 1 shows a multi-level classification system based on news text information provided by the present invention, which includes: a training module 110, a multi-level classification module 120, and a result determination module 130.
The training module 110 is configured to train a preset training sample set through multiple machine learning algorithms for each level of classification of news text information, and determine the number and types of classifiers corresponding to each level of classification according to a training result.
In the process of classifying the news text information, different news text information can be classified into different categories according to the content of the news text information. In order to make the classification of the news text information accurate and fine, a multi-level classification system can be adopted. The multi-level classification system can be sequentially increased according to the abstract degree of the category or sequentially decreased according to the abstract degree of the category. For convenience of classification and according with habits, the embodiment adopts a three-level classification system with sequentially decreasing abstract degrees, for example, the first-level category of the word "hyperopic league" is "sports", the second-level category is "international football", and the third-level category is "hyperopic league". The invention is not particularly limited to the hierarchy and the classification basis of the classification system, and those skilled in the art can flexibly set the hierarchy and the classification basis according to the actual situation.
The problem of data imbalance is often encountered in the classification process of news text information, if only one classification algorithm is adopted to classify all data, the classification algorithm excessively pays attention to one part of data in a sample due to the characteristics of the classification algorithm, and the other part of data cannot be accurately identified, so that the classification accuracy of the whole classification system is reduced. In order to overcome the above problems, the present embodiment provides a multi-level classification system for news text information, and the present embodiment sets a corresponding classifier on each node of each level in the system, where the classifiers include but are not limited to: a root node classifier, a leaf node classifier, and an intermediate node classifier. In a specific application, the classifiers on the nodes may use the same classification algorithm or different classification algorithms, and preferably, different classification algorithms are selected according to data characteristics corresponding to different nodes of different hierarchies.
Specifically, in the solution provided in this embodiment, a corresponding training sample set needs to be preset for each node of each hierarchy, and data in each training sample set should include all or at least most features of the corresponding node category data. The training module 110 trains the training sample set corresponding to each node through a plurality of classification algorithms, and selects an optimal classification algorithm for each node, thereby determining the number and types of classifiers corresponding to each class.
In order to further improve the classification accuracy of the multi-level classification system, in the present embodiment, the various classification algorithms are preferably machine learning algorithms, wherein the machine learning algorithms specifically include, but are not limited to, support vector machine algorithms, convolutional neural network algorithms, cyclic neural network algorithms, and the like. Different algorithms have different advantages and disadvantages, so the invention does not limit the concrete machine learning algorithm adopted by the node specifically, and the technicians in the field can set the algorithm according to the actual application effect.
And a multi-stage classification module 120, configured to configure a corresponding multi-stage classification model according to the number and types of classifiers corresponding to each stage of classification.
The multi-stage classification model is a mixed model containing a plurality of algorithms, contains different classification algorithms adopted by classifiers on all nodes in the system, and records the connection relation among the classifiers through configuration files. In this embodiment, the multi-stage classification module 120 configures a corresponding multi-stage classification model according to the number and types of classifiers corresponding to each stage of classification determined by the training module 110, and generates a configuration file for recording information of each node classifier; after the news text information to be classified is input into the multi-level classification model, the multi-level classification module 120 queries the configuration file according to the obtained output result of the current node classifier, so as to determine the next node classifier of the current node classifier. The multi-level classification model is preferably a tree-like classification model comprising multi-level node classifiers.
And the result determining module 130 is configured to input the obtained news text information to be classified into a multi-stage classification model for classification, and determine an output result of the multi-stage classification model as a final classification result of the news text information to be classified.
Specifically, the result determining module 130 inputs the obtained text information of the news to be classified into the multi-level classification models in the multi-level classification module 120, the multi-level classification module 120 identifies and classifies the text information of the news to be classified according to the built-in classifier, and transmits the classification result to the result determining module 130, and the result determining module 130 determines the final classification result of the text information of the news to be classified according to the classification result output by the multi-level classification module 120.
Therefore, the multi-level classification system based on the news text information, provided by the invention, has the advantages that the problem of inaccurate classification result caused by unbalanced sample data is pertinently solved by constructing a multi-level news text information classification system framework and configuring different classifiers at each level, the accuracy of news text information classification is effectively improved, and the news text information classification efficiency is improved.
Example two
Fig. 2 shows a multi-level classification system based on news text information provided by the present invention, which includes: a training module 210, an evaluation module 220, a multi-level classification module 230, a model update module 240, and a result determination module 250.
The training module 210 is configured to train a preset training sample set through multiple machine learning algorithms for each level of classification of news text information, and determine the number and types of classifiers corresponding to each level of classification according to a training result.
Specifically, the training module 210 generates a training sample set according to the acquired labeling data, extracts training feature words included in the training sample set, and assigns corresponding weights to the extracted training feature words; then, the training module 210 generates a corresponding training feature vector according to the extracted training feature words and the weights thereof, and obtains a training result and a corresponding classifier according to the training feature vector. The training feature words may be extracted according to a preset dictionary, or may be extracted according to other rules, which is not specifically limited in the present invention. The present invention is not limited to a specific method for weighting the extracted training feature words, and those skilled in the art can flexibly set the method. For example, when the news text information to be classified is a plain text file, a TF-IDF (term-Inverse document frequency) algorithm may be adopted to assign corresponding weights to the extracted training feature words.
In order to further improve the classification accuracy of the multi-level classification system, in the present embodiment, the various classification algorithms are preferably machine learning algorithms, wherein the machine learning algorithms specifically include, but are not limited to, support vector machine algorithms, convolutional neural network algorithms, cyclic neural network algorithms, and the like. Different algorithms have different advantages and disadvantages, so the invention does not limit the concrete machine learning algorithm adopted by the node specifically, and the technicians in the field can set the algorithm according to the actual application effect.
And the evaluation module 220 is configured to evaluate the training result of the training module 210, and modify the number and the type of the classifiers corresponding to each level of classification according to the evaluation result.
To further improve the accuracy of the number and types of classifiers determined by the training module 210, an evaluation module 220 may be added. The evaluation module 220 evaluates the training result of the training module 210 according to a preset verification set, and modifies the number and types of classifiers corresponding to each level of classification determined by the training module 210 according to the evaluation result, so that the determined classifiers are more suitable for the classification of the level where the determined classifiers are located. The modification includes deletion, addition and/or replacement of the classifier. The verification set is a small part of the labeling data, does not participate in model training, and is specially used for evaluating different trained models, which has a better effect.
The evaluation module 220 may not only assist the training module 210 in determining a suitable classifier, but also perform continuous attempts on a newly added sample set and a newly adopted classification algorithm in the subsequent module operation process, and further evaluate each attempt result, thereby determining a better classifier. The specific evaluation method adopted by the evaluation module 220 is not specifically limited in the present invention, and those skilled in the art can flexibly set the evaluation method according to the actual situation.
And the multistage classification module 230 is configured to configure a corresponding multistage classification model according to the number and types of classifiers corresponding to each stage of classification determined by the training module.
In this embodiment, the multi-stage classification module 230 configures a corresponding multi-stage classification model according to the number and types of classifiers corresponding to each stage of classification determined by the training module 210, and generates a configuration file corresponding to the multi-stage classification model; each time the output result of the current node classifier is obtained, the multi-stage classification module 230 determines the next-stage node classifier of the current node classifier by querying the configuration file. The configuration file stores a plurality of configuration items corresponding to the node classifiers, and specifically, each configuration item includes description information of the corresponding node classifier, a classification type adapted to the node classifier, and/or a correspondence between each output result of the node classifier and a next-stage node classifier thereof. Therefore, the multi-stage classification module 230 can automatically select the most suitable classifier from the plurality of classifiers classified at the next stage for further classification operation through the configuration file.
In this embodiment, the multi-level classification model is a tree-like classification model including multi-level node classifiers, and the model includes a plurality of node classifiers of different types, for example, may include a root node classifier, a leaf node classifier and an intermediate node classifier, where the number of the leaf node classifier and the intermediate node classifier is usually plural, for example, may be a one-to-one relationship in which one node classifier corresponds to only one sub-classification; the node classifiers can also be in a many-to-one relationship that a plurality of node classifiers correspond to the same sub-classification, and under the condition of the many-to-one relationship, different types of node classifiers can be further selected according to factors such as news text information types and the like to carry out the sub-classification for identification; the node classifier can also be a one-to-many relationship corresponding to a plurality of sub-classifications, and at the moment, the classification rules of the sub-classifications are usually similar, so that the same node classifier can be used for identification. In addition, the number of the root node classifiers is usually one, but a plurality of root node classifiers of different types can be adopted, so that the method is suitable for different news text information types. The above description of the multi-level classification model structure is only an example, and is not a limitation of the present invention to the multi-level classification model structure, and those skilled in the art may adopt other suitable structures according to the actual situation.
And a model updating module 240 for updating the configured multi-level classification model according to the modification of the evaluating module 220.
In order to enable the system to achieve the optimal news text information recognition effect, the evaluation module 220 continuously modifies the number and types of classifiers determined by the training module 210, so the model updating module 240 updates the configured multi-level classification model according to the modification of the evaluation module 220, and meanwhile, the model updating module 240 also needs to correspondingly update the configuration file generated by the multi-level classification module 230 and perform matching update on the multi-level classification model according to the updated configuration file.
In order to improve the overall operating efficiency of the system, the updating operation of the model updating module 240 on the multi-level classification model may be a hot-switch type updating operation, that is, the type of the multi-level classification model used by the system may be quickly updated through the hot-switch operation according to the modification result of the evaluation module 220 when the new model is better than the on-line model in effect without shutting down the system. In order to cooperate with the hot-switch operation of the model update module 240, the configuration file generated by the multi-level classification module 230 may include a plurality of metadata corresponding to different classification models, each metadata records path and description information (for example, model type, etc.) of the corresponding classification model, the corresponding metadata is updated synchronously when the classification model is updated, and the model update module 240 may automatically complete the update operation according to the content recorded by the metadata when performing the hot-switch update operation of the model.
And the result determining module 250 is configured to input the obtained news text information to be classified into a multi-stage classification model for classification, and determine an output result of the multi-stage classification model as a final classification result of the news text information to be classified.
The text information of the news to be classified is generally a complete paragraph or article, and cannot be directly input into the multi-level classification model for identification, so before the text information of the news to be classified is input into the multi-level classification model, the result determining module 250 needs to perform a series of preprocessing operations on the text information of the news to be classified, and converts the text information of the news to be classified into a file type which can be identified by the multi-level classification model. Common preprocessing operations can be extracting the file feature words contained in the news text information to be classified, giving corresponding weights to the extracted file feature words, generating corresponding file feature vectors according to the extracted file feature words and the weights thereof, and the like. The rule for extracting the document feature words and assigning the corresponding weights may be consistent with the rule for similar operations in the training module 210, and will not be described herein again.
In addition, in practical applications, the sources of the news text messages to be classified are various, and therefore, the result determining module 250 also needs to perform a series of standardized processing on the news text messages to be classified, so as to facilitate subsequent preprocessing operations. The common normalization processing includes adjusting fonts in the news text information to be classified according to preset font setting rules and/or filtering vocabularies in the news text information to be classified according to preset filtering rules.
As described above, in the multi-level classification system based on news text information provided by the present invention, each node can be provided with different types and numbers of classifiers, so that the classifiers can be set in a targeted manner according to the type and content of the news text information to be classified. For example, when the news text information to be classified is a text type, the corresponding classifier can be set to adopt an algorithm suitable for text classification, such as a naive bayes algorithm; when the news text information to be classified is of a picture type, the corresponding classifier can be set to adopt algorithms suitable for picture classification, such as a deep learning algorithm. Therefore, the classifiers with different types and quantities can be arranged on different nodes to specifically identify various types of news text information to be classified, so that the final classification result of the news text information is more accurate. For example, when the news text information to be classified contains a picture type, the picture information contained in the news text information may be acquired first; then, determining a picture classification result corresponding to the picture information through a preset picture classification model; and finally, generating a file feature vector corresponding to the news text information according to the picture classification result, and determining a news text information classification result corresponding to the file feature vector through a preset news text information classification model. When the method is used for processing the news text information containing the pictures, the pictures can be quantized quickly and accurately, the pictures with huge data volume and variable forms are quantized into corresponding picture classification results, and the picture classification results have the advantages of small data volume, high processing speed, good classification effect and the like, so that the method also has the advantages of high processing speed, accurate classification results and the like when the news text information type is determined by using the picture classification results.
In order to further understand the workflow of the multi-level classification system based on news text information provided by the present invention, the workflow of the system is described in detail with reference to fig. 5 below: the multi-stage classification system provided by the invention can be roughly divided into two parts, namely a training part and a prediction part, wherein the training part is used for constructing and correcting a model, and the prediction part is used for identifying and classifying news text information to be classified by utilizing the constructed classification model. For the training part, firstly, inputting a pre-prepared labeled document into a system, wherein a training module of the system acquires labeled data from the labeled document, generates a training sample set by using the labeled data, extracts training feature words from the training sample set and stores the training feature words in a corresponding dictionary; then, the training module performs model training by using the training sample set and the dictionary, so as to obtain different classification models and metadata and dictionaries corresponding to each classification model; and then, the evaluation module evaluates and selects the most appropriate classification model for the identification and classification operation of the specific news text information to be classified according to the actual application condition of the model. For the 'prediction part', specifically, news text information to be classified is input into a system, a result determination module preprocesses the news text information to be classified and sends the preprocessed news text information to be classified to a multistage classification module; the multi-stage classification module identifies and classifies the news text information to be classified according to the multi-stage classification algorithm (such as the first-stage classification algorithm, the second-stage classification algorithm and the third-stage classification algorithm shown in the figure) contained in the selected multi-stage classification model, and sends the classification result to the result determination module; and finally, the result determining module determines the output result of the multi-stage classification model sent by the multi-stage classification module as the final classification result of the news text information to be classified.
Therefore, the multi-level classification system based on the news text information, provided by the invention, has the advantages that the problem of inaccurate classification results caused by unbalanced sample data is solved in a targeted manner by constructing a multi-level news text information classification system framework and configuring different classifiers at each level, the accuracy of news text information classification is effectively improved, and the news text information classification efficiency is improved. In addition, the multi-stage classification system also classifies news text information by using a machine learning algorithm, and realizes real-time correction of the system through an evaluation mechanism and a model updating mechanism with a hot switching function, so that the system can keep the optimal working state. Meanwhile, through preprocessing operation and standardized operation, the system can identify the text information of the news to be classified from different types and different sources, the adaptability of the system is further improved, and the application range of the system is widened.
EXAMPLE III
Fig. 3 shows a multi-level classification method based on news text information, which includes:
step S310: aiming at all levels of classification of news text information, a preset training sample set is trained through various machine learning algorithms, and the number and the types of classifiers corresponding to all levels of classification are determined according to training results.
Specifically, in the scheme provided in this embodiment, a corresponding training sample set needs to be preset for each node of each level, data in each training sample set should include all or at least most features of the class data of the corresponding node, then the training sample set corresponding to each node is trained through multiple classification algorithms, and an optimal classification algorithm is selected for each node, so as to determine the number and types of classifiers corresponding to each level of classification.
In order to further improve the classification accuracy of the multi-level classification system, in the present embodiment, the various classification algorithms are preferably machine learning algorithms, wherein the machine learning algorithms specifically include, but are not limited to, support vector machine algorithms, convolutional neural network algorithms, cyclic neural network algorithms, and the like. Different algorithms have different advantages and disadvantages, so the invention does not limit the concrete machine learning algorithm adopted by the node specifically, and the technicians in the field can set the algorithm according to the actual application effect.
Step S320: and configuring a corresponding multi-stage classification model according to the number and the type of the classifiers corresponding to each stage of classification.
The multi-stage classification model is a mixed model containing multiple algorithms, contains different classification algorithms adopted by classifiers on all nodes in the system, and records the connection relation among the classifiers through configuration files. In this embodiment, first, according to the number and types of classifiers corresponding to each level of classification determined in step S310, a corresponding multi-level classification model is configured, and a configuration file for recording information of each node classifier is generated; and after the news text information to be classified is input into the multi-stage classification model, inquiring the configuration file according to the obtained output result of the current node classifier so as to determine the next-stage node classifier of the current node classifier. The multi-level classification model is preferably a tree-like classification model comprising multi-level node classifiers.
Step S330: inputting the obtained news text information to be classified into a multi-stage classification model for classification, and determining the output result of the multi-stage classification model as the final classification result of the news text information to be classified.
Specifically, the obtained news text information to be classified is input into a multi-level classification model, the multi-level classification model identifies and classifies the news text information to be classified according to a built-in classifier, a classification result is generated, and finally the output classification result is determined as a final classification result of the news text information to be classified.
Therefore, according to the multi-level classification method based on the news text information, provided by the invention, the problem of inaccurate classification result caused by unbalanced sample data is pertinently solved by constructing a multi-level news text information classification frame and configuring different classifiers at each level, the accuracy of news text information classification is effectively improved, and the news text information classification efficiency is improved.
Example four
Fig. 4 shows a multi-level classification method based on news text information, which includes:
step S410: aiming at all levels of classification of news text information, a preset training sample set is trained through various machine learning algorithms, and the number and the types of classifiers corresponding to all levels of classification are determined according to training results.
Specifically, a training sample set is generated according to the acquired labeling data, training feature words contained in the training sample set are extracted, and corresponding weights are given to the extracted training feature words; and then generating corresponding training feature vectors according to the extracted training feature words and the weights thereof, and obtaining training results and corresponding classifiers according to the training feature vectors. The training feature words may be extracted according to a preset dictionary, or may be extracted according to other rules, which is not specifically limited in the present invention. The present invention is not limited to a specific method for weighting the extracted training feature words, and those skilled in the art can flexibly set the method. For example, when the news text information to be classified is a plain text file, a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm may be adopted to assign corresponding weights to the extracted training feature words.
In order to further improve the classification accuracy of the multi-stage classification method, in the present embodiment, the various classification algorithms are preferably machine learning algorithms, wherein the machine learning algorithms specifically include, but are not limited to, support vector machine algorithms, convolutional neural network algorithms, cyclic neural network algorithms, and the like. Different algorithms have different advantages and disadvantages, so the invention does not limit the concrete machine learning algorithm adopted by the node specifically, and the technicians in the field can set the algorithm according to the actual application effect.
Step S420: and configuring a corresponding multi-stage classification model according to the number and the type of the classifiers corresponding to each stage of classification.
In this embodiment, according to the number and types of classifiers corresponding to each level of classification determined in step S410, a corresponding multi-level classification model is configured, and a configuration file corresponding to the multi-level classification model is generated; and when the output result of the current node classifier is obtained, determining the next-stage node classifier of the current node classifier by inquiring the configuration file. The configuration file stores a plurality of configuration items corresponding to the node classifiers, and specifically, each configuration item includes description information of the corresponding node classifier, a classification type adapted to the node classifier, and/or a correspondence between each output result of the node classifier and a next-stage node classifier thereof. Therefore, the most suitable classifier can be automatically selected from a plurality of classifiers classified at the next stage through the configuration file for further classification operation.
In this embodiment, the multi-level classification model is a tree-like classification model including multi-level node classifiers, and the model includes a plurality of node classifiers of different types, for example, may include a root node classifier, a leaf node classifier and an intermediate node classifier, where the number of the leaf node classifier and the intermediate node classifier is usually plural, for example, may be a one-to-one relationship in which one node classifier corresponds to only one sub-classification; the node classifiers can also be in a many-to-one relationship that a plurality of node classifiers correspond to the same sub-classification, and under the condition of the many-to-one relationship, different types of node classifiers can be further selected according to factors such as news text information types and the like to carry out the sub-classification for identification; the node classifier can also be a one-to-many relationship corresponding to a plurality of sub-classifications, and at the moment, the classification rules of the sub-classifications are usually similar, so that the same node classifier can be used for identification. In addition, the number of the root node classifiers is usually one, but a plurality of root node classifiers of different types can be adopted, so that the method is suitable for different news text information types. The above description of the multi-level classification model structure is only an example, and is not a limitation of the present invention to the multi-level classification model structure, and those skilled in the art may adopt other suitable structures according to the actual situation.
Step S430: and evaluating the training result, modifying the number and the type of the classifiers corresponding to each grade of classification according to the evaluation result, and updating the configured multistage classification model according to the modification result.
To further improve the accuracy of the number and types of classifiers determined in step S410, an evaluation step, i.e., step S430, may be added. And evaluating the training result of the step S410 according to a preset verification set, and modifying the number and types of classifiers corresponding to the classes of each level determined in the step S410 according to the evaluation result, so that the determined classifiers are more suitable for the classes of the level in which the determined classifiers are located. The modification includes deletion, addition and/or replacement of the classifier. The verification set is a small part of the labeling data, does not participate in model training, and is specially used for evaluating different trained models, which has a better effect.
Step S430 may not only assist step S410 in determining a suitable classifier, but also perform continuous attempts on a newly added sample set and a newly adopted classification algorithm in the subsequent step operation process, and further perform evaluation on each attempt result, thereby determining a better classifier. The specific evaluation method adopted in step S430 is not specifically limited in the present invention, and those skilled in the art can flexibly set the evaluation method according to actual situations.
In order to achieve the optimal news text information identification effect, step S430 continuously modifies the number and types of classifiers determined in step S410, and meanwhile, correspondingly updates the configured multi-level classification model and the configuration file corresponding to the model, and performs matching update on the multi-level classification model according to the updated configuration file.
In order to improve the overall operating efficiency of the method, the updating operation performed on the multi-level classification model in step S430 may be a hot-swap type updating operation, that is, when the new model is better than the online model without shutting down the system, the multi-level classification model used by the system may be quickly updated through the hot-swap operation. In order to cooperate with the hot-switch operation, the configuration file generated in step S420 may include a plurality of metadata respectively corresponding to different classification models, each metadata records path and description information (for example, model type, etc.) of the corresponding classification model, and the corresponding metadata is updated synchronously when the classification model is updated, so that the update operation may be automatically completed according to the content recorded by the metadata when the hot-switch update operation of the model is performed.
Step S440: inputting the obtained news text information to be classified into a multi-stage classification model for classification, and determining the output result of the multi-stage classification model as the final classification result of the news text information to be classified.
The text information of the news to be classified is generally a complete paragraph or article, and cannot be directly input into the multi-stage classification model for identification, so before the multi-stage classification model is input, a series of preprocessing operations need to be performed on the text information of the news to be classified, and the text information of the news to be classified is converted into a file type which can be identified by the multi-stage classification model. Common preprocessing operations can be extracting the file feature words contained in the news text information to be classified, giving corresponding weights to the extracted file feature words, generating corresponding file feature vectors according to the extracted file feature words and the weights thereof, and the like. The rule for extracting the document feature words and giving corresponding weights may be consistent with the rule for similar operations in step S410, and will not be described herein again.
In addition, in practical application, the sources of the text information of the news to be classified are various, and therefore, a series of standardized processing needs to be performed on the text information of the news to be classified, so that the subsequent preprocessing operation is facilitated. The common normalization processing includes adjusting fonts in the news text information to be classified according to preset font setting rules and/or filtering vocabularies in the news text information to be classified according to preset filtering rules.
Therefore, according to the multi-level classification method based on the news text information, provided by the invention, the problem of inaccurate classification result caused by unbalanced sample data is pertinently solved by constructing a multi-level news text information classification frame and configuring different classifiers at each level, the accuracy of news text information classification is effectively improved, and the news text information classification efficiency is improved. In addition, the multi-stage classification method also utilizes a machine learning algorithm to classify news text information, and realizes real-time correction of classification models through an evaluation mechanism and a model updating mechanism with a hot switching function, so that the method can keep the best implementation state. Meanwhile, through preprocessing operation and standardized operation, the method can identify the text information of the news to be classified from different types and different sources, further improves the adaptability of the method, and widens the application range of the method.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of a multi-level classification system based on news text information according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (16)

1. A multi-level classification system based on news text information, comprising:
the training module is used for training a preset training sample set by various machine learning algorithms aiming at all levels of classification of news text information and determining the number and types of classifiers corresponding to all levels of classification according to training results;
the multi-stage classification module is used for configuring a corresponding multi-stage classification model according to the number and the types of classifiers corresponding to the classes determined by the training module;
the result determining module is used for inputting the obtained news text information to be classified into the multistage classification model for classification, and determining the output result of the multistage classification model as the final classification result of the news text information to be classified;
the multi-stage classification model comprises a plurality of node classifiers of different types, a many-to-one relation exists between the node classifiers corresponding to the same sub-classification, and when classification is carried out under the condition of the many-to-one relation, the node classifiers of different types are selected according to the news text information type to carry out sub-classification for identification;
wherein the system further comprises:
an evaluation module, configured to evaluate the training result of the training module, and modify the number and types of classifiers corresponding to the classes according to the evaluation result, where the modifying includes: deletion, addition and/or replacement of classifiers;
and the model updating module is used for updating the configured multi-stage classification model according to the modification of the evaluation module.
2. The system of claim 1, wherein the multi-level classification module is further to: generating a configuration file corresponding to the multi-level classification model, and the model update module is further to: and updating the configuration file, and updating the multistage classification model according to the updated configuration file.
3. The system of claim 1, wherein the multi-level classification model is a tree-like classification model comprising multi-level node classifiers.
4. The system of claim 3, wherein the multi-level classification module is further to: when an output result of the current node classifier is obtained, determining a next-stage node classifier of the current node classifier by inquiring a configuration file corresponding to the multi-stage classification model;
wherein, the configuration file stores: a plurality of configuration items respectively corresponding to the node classifiers, each configuration item comprising: the description information of the corresponding node classifier, the classification type adapted to the node classifier, and/or the corresponding relationship between each output result of the node classifier and the next-stage node classifier thereof.
5. The system of any of claims 1-4, wherein the training module is specifically configured to:
generating the training sample set according to the acquired labeling data, extracting training feature words contained in the training sample set, and giving corresponding weights to the extracted training feature words;
and generating a corresponding training feature vector according to the extracted training feature words and the weights thereof, and obtaining a training result and a corresponding classifier according to the training feature vector.
6. The system of any of claims 1-4, wherein the result determination module is specifically configured to: preprocessing the acquired news text information to be classified, and inputting the preprocessed news text information to be classified into the multistage classification model for classification;
wherein the pre-processing comprises: extracting the file feature words contained in the news text information to be classified, and giving corresponding weights to the extracted file feature words; and generating a corresponding file feature vector according to the extracted file feature words and the weights thereof.
7. The system of claim 6, wherein the result determination module, prior to preprocessing, is further to: and adjusting fonts in the news text information to be classified according to a preset font setting rule, and/or filtering vocabularies in the news text information to be classified according to a preset filtering rule.
8. The system of any of claims 1-4, wherein the plurality of machine learning algorithms comprises at least one of: support vector machine algorithms, convolutional neural network algorithms, and recurrent neural network algorithms.
9. A multi-level classification method based on news text information comprises the following steps:
aiming at each level of classification of news text information, training a preset training sample set through various machine learning algorithms, and determining the number and types of classifiers corresponding to each level of classification according to training results;
configuring corresponding multi-stage classification models according to the number and types of classifiers corresponding to the various stages of classifications;
inputting the obtained news text information to be classified into the multistage classification model for classification, and determining the output result of the multistage classification model as the final classification result of the news text information to be classified;
the multi-stage classification model comprises a plurality of node classifiers of different types, a many-to-one relation exists between the node classifiers corresponding to the same sub-classification, and when classification is carried out under the condition of the many-to-one relation, the node classifiers of different types are selected according to the news text information type to carry out sub-classification for identification;
wherein the method further comprises:
evaluating the training result, modifying the number and the type of classifiers corresponding to each grade of classification according to the evaluation result, and updating the configured multistage classification model according to the modification result; wherein the modifying comprises: deletion, addition, and/or replacement of classifiers.
10. The method of claim 9, wherein the step of configuring the corresponding multi-level classification model according to the number and types of classifiers corresponding to the respective levels of classification further comprises: generating a configuration file corresponding to the multi-level classification model, and updating the configured multi-level classification model according to the modification result further comprises: and updating the configuration file, and updating the multistage classification model according to the updated configuration file.
11. The method of claim 9, wherein the multi-level classification model is a tree-like classification model comprising multi-level node classifiers.
12. The method of claim 11, wherein the step of configuring the corresponding multi-level classification model according to the number and types of classifiers corresponding to the respective levels of classification further comprises: when an output result of the current node classifier is obtained, determining a next-stage node classifier of the current node classifier by inquiring a configuration file corresponding to the multi-stage classification model;
wherein, the configuration file stores: a plurality of configuration items respectively corresponding to the node classifiers, each configuration item comprising: the description information of the corresponding node classifier, the classification type adapted to the node classifier, and/or the corresponding relationship between each output result of the node classifier and the next-stage node classifier thereof.
13. The method according to any one of claims 9 to 12, wherein the step of training a preset training sample set by using a plurality of machine learning algorithms for each class of news text information, and determining the number and types of classifiers corresponding to each class according to the training results specifically comprises:
generating the training sample set according to the acquired labeling data, extracting training feature words contained in the training sample set, and giving corresponding weights to the extracted training feature words;
and generating a corresponding training feature vector according to the extracted training feature words and the weights thereof, and obtaining a training result and a corresponding classifier according to the training feature vector.
14. The method according to any one of claims 9 to 12, wherein the step of inputting the obtained news text information to be classified into the multistage classification model for classification, and determining the output result of the multistage classification model as the final classification result of the news text information to be classified specifically includes:
preprocessing the acquired news text information to be classified, and inputting the preprocessed news text information to be classified into the multistage classification model for classification;
wherein the pre-processing comprises: extracting the file feature words contained in the news text information to be classified, and giving corresponding weights to the extracted file feature words; and generating a corresponding file feature vector according to the extracted file feature words and the weights thereof.
15. The method of claim 14, wherein prior to the pre-processing further comprises: and adjusting fonts in the news text information to be classified according to a preset font setting rule, and/or filtering vocabularies in the news text information to be classified according to a preset filtering rule.
16. The method of any of claims 9-12, wherein the plurality of machine learning algorithms comprises at least one of: support vector machine algorithms, convolutional neural network algorithms, and recurrent neural network algorithms.
CN201710103541.0A 2017-02-24 2017-02-24 Multi-level classification system and method based on news text information Active CN106909654B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710103541.0A CN106909654B (en) 2017-02-24 2017-02-24 Multi-level classification system and method based on news text information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710103541.0A CN106909654B (en) 2017-02-24 2017-02-24 Multi-level classification system and method based on news text information

Publications (2)

Publication Number Publication Date
CN106909654A CN106909654A (en) 2017-06-30
CN106909654B true CN106909654B (en) 2020-07-21

Family

ID=59208413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710103541.0A Active CN106909654B (en) 2017-02-24 2017-02-24 Multi-level classification system and method based on news text information

Country Status (1)

Country Link
CN (1) CN106909654B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402994B (en) * 2017-07-17 2021-01-19 云润大数据服务有限公司 Method and device for classifying multi-group hierarchical division
CN107562880A (en) * 2017-09-01 2018-01-09 北京神州泰岳软件股份有限公司 A kind of classification results screening technique and device based on multistage classifier
CN110019776B (en) * 2017-09-05 2023-04-28 腾讯科技(北京)有限公司 Article classification method and device and storage medium
CN108073677B (en) * 2017-11-02 2021-12-28 中国科学院信息工程研究所 Multi-level text multi-label classification method and system based on artificial intelligence
CN107943940A (en) * 2017-11-23 2018-04-20 网易(杭州)网络有限公司 Data processing method, medium, system and electronic equipment
CN108710651B (en) * 2018-05-08 2022-03-25 华南理工大学 Automatic classification method for large-scale customer complaint data
CN110781292A (en) * 2018-07-25 2020-02-11 百度在线网络技术(北京)有限公司 Text data multi-level classification method and device, electronic equipment and storage medium
CN109165380B (en) * 2018-07-26 2022-07-01 咪咕数字传媒有限公司 Neural network model training method and device and text label determining method and device
CN109189950B (en) * 2018-09-03 2023-04-07 腾讯科技(深圳)有限公司 Multimedia resource classification method and device, computer equipment and storage medium
CN109471938B (en) * 2018-10-11 2023-06-16 平安科技(深圳)有限公司 Text classification method and terminal
CN109960725A (en) * 2019-01-17 2019-07-02 平安科技(深圳)有限公司 Text classification processing method, device and computer equipment based on emotion
CN112052331A (en) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 Method and terminal for processing text information
CN110633366B (en) * 2019-07-31 2022-12-16 国家计算机网络与信息安全管理中心 Short text classification method, device and storage medium
CN110442725B (en) * 2019-08-14 2022-02-25 科大讯飞股份有限公司 Entity relationship extraction method and device
CN110597985A (en) * 2019-08-15 2019-12-20 重庆金融资产交易所有限责任公司 Data classification method, device, terminal and medium based on data analysis
CN113139558B (en) * 2020-01-16 2023-09-05 北京京东振世信息技术有限公司 Method and device for determining multi-stage classification labels of articles
CN111625644B (en) * 2020-04-14 2023-09-12 北京捷通华声科技股份有限公司 Text classification method and device
CN111753197B (en) * 2020-06-18 2024-04-05 达观数据有限公司 News element extraction method, device, computer equipment and storage medium
CN112507121B (en) * 2020-12-01 2023-06-30 平安科技(深圳)有限公司 Customer service violation quality inspection method and device, computer equipment and storage medium
CN113254645B (en) * 2021-06-08 2021-09-28 南京冰鉴信息科技有限公司 Text classification method and device, computer equipment and readable storage medium
CN116777400B (en) * 2023-08-21 2023-10-31 江苏海外集团国际工程咨询有限公司 Engineering consultation information whole-flow management system and method based on deep learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100565523C (en) * 2007-04-05 2009-12-02 中国科学院自动化研究所 A kind of filtering sensitive web page method and system based on multiple Classifiers Combination
CN102117411B (en) * 2009-12-30 2015-03-11 日电(中国)有限公司 Method and system for constructing multi-level classification model
CN102193928B (en) * 2010-03-08 2013-04-03 三星电子(中国)研发中心 Method for matching lightweight ontologies based on multilayer text categorizer
CN103324758B (en) * 2013-07-10 2017-07-14 苏州大学 A kind of news category method and system
CN103426007B (en) * 2013-08-29 2016-12-28 人民搜索网络股份公司 A kind of machine learning classification method and device
CN103778569A (en) * 2014-02-13 2014-05-07 上海交通大学 Distributed generation island detection method based on meta learning
CN106453033B (en) * 2016-08-31 2019-03-15 电子科技大学 Multi-level process for sorting mailings based on Mail Contents

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Boost和信任函数的多文本分类器组合模型;王爱华等;《计算机工程与应用》;20020228;第52页2.1节 *
系统学习机器学习之组合多分类器;Eason.wxd;《CSDN博客https://blog.csdn.net/app_12062011/article/details/50424776》;20151229;第一页倒数第4行至正文最后 *

Also Published As

Publication number Publication date
CN106909654A (en) 2017-06-30

Similar Documents

Publication Publication Date Title
CN106909654B (en) Multi-level classification system and method based on news text information
CN109471938B (en) Text classification method and terminal
WO2016179938A1 (en) Method and device for question recommendation
KR20200007969A (en) Information processing methods, terminals, and computer storage media
US20190102655A1 (en) Training data acquisition method and device, server and storage medium
US11349680B2 (en) Method and apparatus for pushing information based on artificial intelligence
Halibas et al. Application of text classification and clustering of Twitter data for business analytics
CN109598307B (en) Data screening method and device, server and storage medium
CN110472043B (en) Clustering method and device for comment text
CN104361037B (en) Microblogging sorting technique and device
US10387805B2 (en) System and method for ranking news feeds
CN110019779B (en) Text classification method, model training method and device
CN109558482B (en) Parallelization method of text clustering model PW-LDA based on Spark framework
CN111159404B (en) Text classification method and device
US20190026650A1 (en) Bootstrapping multiple varieties of ground truth for a cognitive system
CN106445908A (en) Text identification method and apparatus
CN104035955B (en) searching method and device
CN108717459B (en) A kind of mobile application defect positioning method of user oriented comment information
CN110717090A (en) Network public praise evaluation method and system for scenic spots and electronic equipment
CN110990563A (en) Artificial intelligence-based traditional culture material library construction method and system
CN104899310B (en) Information sorting method, the method and device for generating information sorting model
CN112685374B (en) Log classification method and device and electronic equipment
US20130054553A1 (en) Method and apparatus for automatically extracting information of products
CN108491423B (en) Sorting method and device
CN113641823B (en) Text classification model training, text classification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing

Patentee after: Beijing time Ltd.

Address before: 100089 710, 7 / F, building 1, zone 1, No.3, Xisanhuan North Road, Haidian District, Beijing

Patentee before: BEIJING TIME Co.,Ltd.

CP01 Change in the name or title of a patent holder