CN110298032B - Text classification corpus labeling training system - Google Patents

Text classification corpus labeling training system Download PDF

Info

Publication number
CN110298032B
CN110298032B CN201910455049.9A CN201910455049A CN110298032B CN 110298032 B CN110298032 B CN 110298032B CN 201910455049 A CN201910455049 A CN 201910455049A CN 110298032 B CN110298032 B CN 110298032B
Authority
CN
China
Prior art keywords
classification
labeling
model
text
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910455049.9A
Other languages
Chinese (zh)
Other versions
CN110298032A (en
Inventor
崔莹
代翔
王侃
丁洪丽
杨露
陈涛
余博
王日冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201910455049.9A priority Critical patent/CN110298032B/en
Publication of CN110298032A publication Critical patent/CN110298032A/en
Application granted granted Critical
Publication of CN110298032B publication Critical patent/CN110298032B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification corpus labeling training system, and aims to provide a semi-automatic labeling training device which can reduce the repetition degree of manual labeling and improve the accuracy of a pre-labeling result. The invention is realized by the following technical scheme: a text classification corpus labeling preparation module counts text word frequency and removes noise information of the text; the semi-automatic text corpus classification labeling module selects CNN, KNN, ANN and a deep learning algorithm in a classification labeling task, converts unstructured and semi-structured texts into a vector space model, generates a word vector space of the texts, and extracts and reflects document theme characteristics; after the labeling task is completed, the feedback type classification labeling model learning and training module feeds back the classification labeling model for perfection and updating; the text classification labeling model effect evaluation module quantizes evaluation indexes based on classification index rules, establishes a labeling algorithm comprehensive evaluation model, analyzes test results, evaluates classification results and evaluates the quantitative labeling effect of the model indexes.

Description

Text classification corpus labeling training system
Technical Field
The invention relates to the technical field of text mining, in particular to a text classification corpus semi-automatic labeling training system.
Background
The mass data corpus labeling work has an important influence on the training of an algorithm model, is used as basic work in a big data analysis process, mainly supports the links of daily research and development, algorithm tuning and demonstration verification of big data, and is the core foundation of big data mining analysis. The text classification is to divide a large number of text documents into one or a group of categories, so that each category represents different concept topics, and specifically, the text classification also comprises the techniques of feature extraction (feature vector representation) and feature selection (dimension reduction) of the documents. The text classification is an important basic tool in the fields of information extraction, question-answering systems, machine translation and Semantic network oriented Semantic Web application, and plays an important role in the process of bringing the natural language processing technology into practical use. At present, classified corpora in the field are relatively deficient, and the labeling work of the classified corpora is mainly completed through manual labeling, so that the problems of poor corpus labeling quality, complicated labeling process, low labeling efficiency, high human resource cost and the like widely exist. Meanwhile, the existing classified corpus labeling tool has the defects of single labeling method, incapability of automatically updating a labeling method model and the like. Therefore, a semi-automatic classification labeling and training platform capable of assisting in manually labeling the linguistic data is urgently needed to solve the problems.
Text preprocessing in a text classification experiment, the selection of corpus corpussbase is important, and the selection of corpus may influence the final effect of classification. However, because the corpus has different storage formats, incomplete documents, duplicate documents and other problems, in order to improve the classification performance and avoid the problems from affecting the subsequent work of the text classification system, the corpus must be preprocessed to remove noise information therein and normalize the content so that the content meets the data input requirement of text classification. Chinese word segmentation is an important step in text preprocessing. The words are the smallest language units which can be independently used, and Chinese is not like English, and no space exists between words of Chinese, so that the first problem facing computer processing is the automatic word segmentation of Chinese. In brief, automatic Chinese word segmentation refers to segmenting a Chinese character sequence into a single word, i.e., allowing a computer system to automatically add spaces or other boundary marks between words in a text. The main difficulties of the automatic Chinese word segmentation at present are word segmentation specification, ambiguity segmentation and identification of unknown words. The existing automatic Chinese word segmentation methods can be roughly classified into the following three categories: the mechanical word segmentation method based on character string matching is carried out according to a word segmentation word list and a character string matching principle. According to different cutting directions of the character strings, the character string matching method can be divided into forward matching and reverse matching. The long word first or short word first can be classified as a maximum match or a minimum match. The strategy of re-cutting when the matching is unsuccessful can be divided into an increasing character method and a decreasing character method. The most maximum matching method is used for word segmentation according to a basic segmentation principle of 'long word is first' and a given word segmentation word list. The maximum matching method can be further divided into a forward maximum matching method and a reverse maximum matching method. Vapnik et al put forward another design optimization criterion for linear classifiers based on years of research, statistics and learning theory. The principle also starts from linear segmentation and then extends to the case of linear inseparability. Even extending to the use of non-linear functions, such classifiers are known as Support Vector Machines (SVMs). The support vector machine has a deep theoretical background. The support vector machine method is a new method proposed in recent years. The method is used for analyzing linear divisible conditions, and for linear inseparable conditions, linear inseparable samples of a low-dimensional input space are converted into a high-dimensional feature space by using a nonlinear mapping algorithm so as to be linearly divisible. The SVM method maps a sample space to a high-dimensional or infinite-dimensional feature space (Hilbert space) through a nonlinear mapping, so that the problem of nonlinear divisibility in the original sample space is converted into the problem of linear divisibility in the feature space. Simply stated, it is the lifting and linearization. Dimension raising, that is, mapping a sample to a high-dimensional space, generally increases the complexity of calculation and even causes "dimension disaster", so people have little need for help. However, as a problem of classification, regression, etc., a sample set that is likely to be not linearly processed in a low-dimensional sample space may be linearly divided (or regressed) in a high-dimensional feature space by a linear hyperplane. The general dimensionality increase brings complexity to calculation, and the SVM method ingeniously solves the problem. The word segmentation method based on statistics comprises the following steps: the Chinese words are formed by combining characters, and the higher the frequency of simultaneous occurrence of adjacent characters in a text, the more likely the Chinese words are a word. Therefore, the frequency of adjacent co-occurrence of the word and the word in the corpus can be counted, and the higher the frequency is, the higher the credibility of the forming word is. The statistical models used for word segmentation based on statistics mainly include mutual information, N-gram models, neural network models, hidden markov models and maximum entropy models. The statistical models mainly use the joint occurrence probability of the characters as the basis of word segmentation. The advantage of the statistics-based segmentation method is that it does not require a segmentation dictionary, which is not limited by the domain of processing text. However, this method often extracts some combinations of common words that are not words, such as "one", "my", "some", etc. Therefore, in the actually used statistical word segmentation system, word segmentation is required to be performed according to a common word segmentation dictionary and the character string matching principle, and some new words are identified by using a statistical method. By combining the character string statistics and the character string matching, the method not only utilizes the advantages of high speed and high frequency of matching word segmentation, but also exerts the advantages that the statistical word segmentation method can identify new words according to the context and can automatically eliminate ambiguity. However, the ambiguity resolution method depends on the accuracy of the statistical language model and the decision algorithm to a great extent, a large amount of labeled corpora is required, and the word segmentation speed is slowed down due to the increase of the search space. The more times a word appears in different documents, the weaker it is in its ability to distinguish between different documents. The general method of weight calculation is to use the statistical information of the text, mainly the word frequency. In the absolute word frequency method, the distinguishing capability of low-frequency feature items cannot be embodied, because some feature items have very high frequency but very weak classifying capability, such as many common words, and some feature items have very strong classifying capability although the frequency is lower. The weight is proportional to the frequency of the feature item appearing in the document, i.e. the more times the feature item appears in the document, the more important the feature item is, and is inversely proportional to the number of documents in the corpus containing the feature item, i.e. the more frequently the feature item appears in different documents, the less important the feature item is. The text classification method is a core part of a text classification system. At present, many statistical learning and machine learning algorithms are widely applied to text classification, and text classification technologies based on statistics and machine learning become mainstream technologies. Common classification algorithms include naive Bayes, k-nearest neighbor method, decision tree, support vector machine, neural network Rocchio classification method, association rule and combination classification method. The classifier decision tree (decisiontree dt classifier) is a common, simple, and widely used classification method, is a rule-based classifier, and is also suitable for text classification. In the process of classifying the decision tree, dividing the data into a plurality of branches according to a tree structure to form the decision tree, wherein each branch comprises the attribution commonality of the data category, and useful information is extracted from each branch to form a rule. There are various decision tree algorithms, which mainly include: the method comprises the steps of information gain-based heuristic algorithm ID3, information gain rate-based algorithm C4 for solving continuous attribute classification, Gini coefficient-based algorithm CART, scalable algorithm SLIQ for a large sample set, parallelizable algorithm SPRINT and the like. The main characteristic of text classification is that the attributes are many, which results in the structure of the decision tree being very complex, and the application of the decision tree to large-scale text classification is limited by the extremely large scale. The construction of the decision tree is mainly divided into two steps, the generation of the decision tree refers to the process of generating the decision tree by training data, and the pruning of the decision tree refers to the process of checking, correcting and modifying the decision tree generated in the previous step. The KNN algorithm (K-Nearest neighbor classification algorithm) is a classification algorithm in supervised learning, and a K-Nearest neighbor (K-NN) classifier is a classical classification method and uses a specific training example for prediction. The principle of the k-neighbor classifier is simple, given a test document to be classified, the system searches known k documents which are most similar to the document to be classified in a training set, and then judges the category of the document to be classified according to the classification of the k documents. And determining which type the most of the k adjacent documents belong to. The k-nearest neighbor classifier is a passive learning method and does not need to establish a model, but the k-nearest neighbor classifier is predicted based on local information, so that the k-nearest neighbor classifier is very sensitive to noise and has high memory requirement, and the algorithm stores all training data. Since the similarity between the test text and the training text needs to be calculated separately, the cost of using k-nearest neighbor classification for one test sample is large. The selection of the parameter k value in the k-nearest neighbor classifier is important. If the k value is selected too large, documents which are not related to the test document may be contained, so that the classification accuracy is influenced by the increased noise. On the contrary, if the k value is selected too small, the characteristics of the test document cannot be fully embodied, and the classification accuracy is also influenced. A naive Bayes classifier is a classification method based on a Bayes probability formula. Support vector machine classifiers have become a classification technique of great interest today. Based on a statistical learning theory and a principle of minimum structural risk, the optimal compromise is sought between the complexity of a model (namely learning precision Accuracy of a specific training sample) and learning capacity (namely the capacity of identifying any sample without errors) according to limited training sample information so as to obtain the best popularization capacity. The basic idea behind support vector machines is to find a decision plane in vector space that "best" partitions the data points in both categories. The classification method of the support vector machine is to find a decision plane for the largest class boundary in a training set. The basic algorithm of the support vector machine aims at the problem of two types of classification, and in order to realize the identification of a plurality of types, the support vector machine needs to be expanded to establish a plurality of two types of classifiers. The performance evaluation method for evaluating the classification performance is a very important step after the text classifier completes the learning training and testing. In analyzing machine-learned data sources, the most common knowledge discovery topic is to convert data objects or events into predetermined categories and then perform specialized processing based on the categories, which is the basic task of a classification system. There are currently two main text classification methods, one based on pattern systems (by applying knowledge engineering techniques) and the other classification patterns (by using statistical/or machine learning techniques). The expert system approach is to encode the knowledge of the expert as a regular expression into a classification system. The method of machine learning is a generalized inductive process, which uses a set of pre-classified examples to establish classification through training. The existing text classification method needs a large training corpus, can obtain good effect on the premise that the training corpus is large enough, and the scale of the training corpus directly influences the classification effect. However, the large scale manual corpus tagging is a difficult problem. It is envisaged that automatic text classification has its real-life significance today when the amount of information is growing explosively. However, the automatic text classification requires a large corpus, which is a difficult point in text classification.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a semi-automatic corpus classification and annotation training system which can reduce the repetition degree of the manual annotation process, reduce the labor cost and improve the accuracy of the pre-annotation result.
The above object of the present invention can be achieved by the following measures, wherein a training system for labeling corpus of text classification comprises: text classification corpus label prepares module, semi-automatization text corpus classification label module, feedback type classification label model learning training module and text classification label model effect evaluation module, its characterized in that: the text classified corpus tagging preparation module distinguishes data containing training texts of different sources and testing corpus texts, counts word frequency of texts, preprocesses and removes noise information of the texts, selects sources of the text classified corpuses aiming at the different text classified corpuses, and provides a selectable and applicable tagging algorithm in a tagging process; the semi-automatic text corpus classifying and labeling module is used for selecting a Convolutional Neural Network (CNN), a random forest, a classification algorithm KNN, an Artificial Neural Network (ANN), a Support Vector Machine (SVM) and a deep learning algorithm in a classifying and labeling task to complete automatic labeling aiming at different labeling use requirements and corpus characteristics, a Chinese word segmenter is used for segmenting text words and removing stop words, an unstructured and semi-structured text is converted into a structured vector space model, an adaptive algorithm is selected autonomously and automatic labeling is carried out, at least one text classifying and extracting algorithm selected from the text classifying and extracting algorithms CNN, the random forest, KNN, ANN, SVM and the deep learning algorithm is used for carrying out pre-labeling processing of single text classification or pre-labeling processing of multi-text classification fusion on corpus data to be labeled to generate a word vector space of the text, extracting document theme characteristics and providing a unified text classification model access standard, finishing the classification and labeling work of the corpus texts; after the labeling task is completed, a feedback type classification labeling model learning training module uses an algorithm training classifier to perform classification model learning training aiming at an internal and external labeling model algorithm, uses labeled mature linguistic data to retrain a text classification model, feeds back the classification labeling model to perfect updating, and automatically feeds back and adjusts to complete a new text classification labeling task through continuous iteration between model updating and linguistic data labeling; the text classification labeling model effect evaluation module constructs a classification evaluation index according to a classification evaluation index standard, quantifies the evaluation index based on a classification index rule, establishes a labeling algorithm comprehensive evaluation model, analyzes a test result, evaluates a classification result, evaluates a model index quantitative labeling effect, and automatically adapts a classification labeling algorithm model according to the classified corpora to be labeled of different tasks.
Compared with the prior art, the invention has the following beneficial effects:
the repetition degree of the manual labeling process can be reduced, and the labor work cost is reduced. The invention adopts a system which mainly comprises four modules of text classified corpus labeling preparation, semi-automatic text corpus classification labeling, feedback type classification labeling model learning training and text classification labeling model effect evaluation, and can provide an automatic labeling mode based on self-selection adaptive algorithm and multi-algorithm fusion aiming at different labeling use requirements and corpus characteristics.
The text classification corpus labeling efficiency is high. According to the method, the management of the text classified linguistic data is realized by distinguishing data from different sources; the method comprises the steps of selecting CNN, random forest, KNN, ANN, SVM and deep learning algorithm in a classification labeling task to finish an automatic labeling text classification extraction algorithm, providing a suitable text classification labeling algorithm for selection in a labeling process aiming at different text classification corpora, performing pre-labeling processing of a single text classification method or multi-text classification method fusion on corpus data to be labeled, introducing an artificial judgment link, and enabling a system to support automatic feedback adjustment of a real-time background text classification algorithm model to finish a new text classification labeling task, so that the time for obtaining information can be greatly shortened, the efficiency for obtaining information is improved, and the corpus labeling efficiency is greatly improved.
The method is characterized in that an adaptive algorithm is selected autonomously and automatic labeling is carried out according to different labeling use requirements and corpus characteristics, and text classification pre-labeling processing of a single algorithm model or pre-labeling processing of multi-text classification fusion is carried out on text corpus data to be labeled by integrating at least one text classification extraction algorithm in CNN, random forest, KNN (K-Nearest neighbor), artificial neural network ANN, SVM and deep learning algorithm, so that unified text classification model access standards are provided, corpus text classification labeling work is completed, and model training time is short. The prediction effect is good, and the method is insensitive to abnormal values. And after the labeling task is finished, retraining the text classification model by using the labeling corpus. The model labeling effect is evaluated by establishing a labeling algorithm comprehensive evaluation model, the learning training of the text classification model is fed back, the model achieves the best effect, the subsequent labeling tasks are newly added, the quality of the corpus text classification labeling and the effect of the algorithm model are improved through continuous iteration between model updating and corpus labeling, and the error rate of text classification labeling is reduced. Finally, the manual judgment link is used for realizing the intervention judgment of the labeling result, and the manual confirmation link is used for modifying, confirming and submitting the text classification labeling corpus so as to finish the text classification labeling work of the corpus, thereby greatly improving the accuracy and precision of the text classification extraction; experiments prove the effectiveness of the active learning algorithm applied to text classification. The workload of manually marking the linguistic data is greatly reduced.
The invention simplifies the user labeling operation process, supports the import, the training and the use of the external model through a friendly man-machine interactive labeling interface.
Drawings
FIG. 1 is a schematic diagram of the working principle of the text corpus tagging training system according to the present invention.
FIG. 2 is a process flow of training text corpus tagging algorithm model
FIG. 3 is a schematic diagram of a process flow of updating a text classification corpus tagging algorithm model.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
Detailed Description
See fig. 1. In a preferred embodiment described below, a text corpus annotation training system comprises: text classification corpus label prepares module, semi-automatization text corpus classification label module, feedback type classification label model learning training module and text classification label model effect evaluation module, wherein: the text classified corpus tagging preparation module distinguishes data containing training texts of different sources and testing corpus texts, counts word frequency of texts, preprocesses and removes noise information of the texts, selects sources of the text classified corpuses aiming at the different text classified corpuses, and provides a selectable and applicable tagging algorithm in a tagging process; the semi-automatic text corpus classifying and labeling module selects CNN, random forest, KNN, ANN, SVM and deep learning algorithm in a classifying and labeling task to complete automatic labeling aiming at different labeling use requirements and corpus characteristics, uses a Chinese word segmenter to segment words of the text and remove stop words, converts unstructured and semi-structured texts into a structured vector space model, autonomously selects an adaptive algorithm and develops automatic labeling, generates a word vector space of the text based on at least one text classification extraction algorithm in the text classification extraction algorithms CNN, random forest, KNN, ANN, SVM and deep learning algorithm, extracts and reflects document theme characteristics, performs pre-labeling processing of single text classification or multi-text classification fusion on corpus data of the text to be labeled, and further performs pre-judgment on the fused labeling result manually according to classification labeling service standards, storing the labeling result as a cooked corpus, managing the cooked corpus by a classified corpus labeling preparation module for use in labeling algorithm model training, providing a uniform text classification model access standard, and completing corpus text classification labeling work; after the labeling task is completed, a feedback type classification labeling model learning and training module is used for performing model learning and training by using an algorithm training classifier and retraining a text classification model by using a labeling corpus aiming at an internally integrated classification labeling algorithm model and an externally depth-enhanced labeling algorithm model, the feedback model is perfectly updated, and a new text classification labeling task is completed by automatic feedback adjustment through continuous iteration between model updating and corpus labeling; the text classification labeling model effect evaluation module constructs a classification evaluation index according to a classification evaluation index standard, quantifies the evaluation index based on a classification index rule, establishes a labeling algorithm comprehensive evaluation model, analyzes a test result, evaluates a classification result, evaluates the quantitative labeling effect of the classification model index, and makes optimal algorithm model adaptation recommendation for the corpus semi-automatic classification labeling module.
And the text classification corpus labeling preparation module is used for managing the corpus to be labeled according to sources or topics and providing preparation for labeling tasks. The semi-automatic text corpus classification labeling module autonomously selects an adaptation algorithm and carries out automatic labeling aiming at different labeling use requirements and corpus characteristics, realizes intervention judgment of a labeling result through an artificial judgment link, and specifically comprises the following steps: the semi-automatic text corpus classification labeling module creates classification labeling tasks according to different source corpora, selects an algorithm model with adaptive effect for each classification labeling task, such as CNN (compressed natural language), random forest, and the like,The KNN, the ANN, the SVM and the deep learning single classification algorithm or the fusion classification algorithm complete automatic labeling, and the specific labeling algorithm can be configured according to the automatic labeling effect of the corpus. The semi-automatic text corpus classification labeling module creates a service labeling rule aiming at a special labeling task and manages the labeling service rule, wherein the labeling service rule mainly comprises a classification dictionary and adopts the labeling service rule to automatically label the corpus; and performing pre-labeling processing of a single classification method or multi-classification method fusion on the corpus data to be labeled, wherein the multi-classification method result fusion adopts a voting method, and voting counting is performed based on a judgment result of a single classifier, so that the category with the most votes is the final category of the sample to be classified. Given a text
Figure DEST_PATH_IMAGE002
Different classifiers will produce different classification results
Figure DEST_PATH_IMAGE004
Each classification being specific to text
Figure 736889DEST_PATH_IMAGE002
Is rather right to
Figure DEST_PATH_IMAGE006
A supporting vote is cast, so that the final voting result of each text can obtain a score, and the category with the highest voting number is a fusion classification result; on the basis of the automatic labeling and fusing processing result, manually modifying, confirming and storing the labeling result according to the labeling service standard; the feedback type classification labeling model learning and training module mainly comprises two parts of model training and classification model updating, provides model learning and training and feedback updating capabilities aiming at an internal and external labeling model algorithm, and is a classification model training processing flow chart as shown in fig. 2, and specifically comprises a model training process: carrying out off-line training on Bayes and KNN trainable algorithms by using the labeled corpus data, and calling a unified training model interface Train to enable the model accuracy to reach the best; an external algorithm model is led in according to the unified model access interface,updating or exporting the model, storing a classification model file comprising an algorithm name, a model name and a serialization model file, and updating a classification training model table; updating a classification model: and setting whether the classification model is automatically updated or not through the model updating configuration file, if the classification model needs to be updated, updating the model used for labeling in the platform by using the trained model, loading the serialized model file, performing deserialization on the serialized model file, and finishing a new labeling task by using the updated model. The marking model effect evaluation module: the method for constructing the label, the rule and the index quantization of the model evaluation index is provided, and the evaluation of the model labeling effect by automatically constructing a labeling algorithm comprehensive evaluation model is supported, and the method comprises the following specific steps: constructing and setting a single index algorithm according to the index standard; quantizing the indexes according to an index calculation rule; constructing a labeling algorithm comprehensive evaluation model by adopting corresponding indexes according to different labeling tasks; and (5) completing index comprehensive value calculation and feeding back the effect of the labeling model.
The basic evaluation index labeled to the classification corpus by the method comprises accuracy Precision, and the measure is that the classification corpus is correctly classified to the concept information CiThe following rate, Recall, measures a certain category of conceptual information CiThe ratio of correctly classified texts under the category, a certain concept category information CiF value (i.e. a harmonic mean of the text information classification accuracy and recall rate), and E value (a weighted value for meeting different requirements of different systems on accuracy and recall rate) of the text being classified correctly. The specific definition is as follows:
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE012
rate of accuracyPAnd recall rateRGenerally referred to as an inverse relationship. Increasing accuracy by some methods can lead to a decrease in recall and vice versa. In order to define the different requirements of the application system for accuracy and recall, a weighting value can be taken into consideration, so as to obtain the value E:
Figure DEST_PATH_IMAGE014
wherein b is the added weight, the larger b is, the larger the weight of the accuracy rate in the consideration of the E value is, otherwise, the larger the weight of the recall rate is.
See fig. 2. A feedback type classification labeling model learning training module, which provides model learning training aiming at an internal and external labeling model algorithm, reads corpora in a text classification model training processing flow, selects key algorithm training, aims at a non-training algorithm and ends the non-training process, uses labeled corpora data to perform offline training on CNN, KNN, ANN, SVM and deep learning trainable algorithm, calls a uniform training model interface Train, generates a text classification model sequence file Kryo, and enables the model accuracy to reach the best; and judging whether the text classification model is stored or not, if not, ending, if so, importing the external algorithm model according to the unified model access interface, updating or exporting the external algorithm model, storing a classification model file comprising an algorithm name, a model name and a serialization model, and updating a text classification training model table.
See fig. 3. And the feedback type classification labeling model learning and training module updates the model for text classification labeling in the platform by using the trained model to complete a new text classification labeling task. In the updating of the text classification model, a feedback type classification labeling model learning training module starts a text classification service, selects a text classification algorithm, and ends aiming at the untrained algorithm; judging whether to update the text classification model or not by analyzing a switch for updating the classification model in the configuration file according to the selected CNN, random forest, KNN, ANN, SVM and deep learning trainable algorithm, if not, reading the specified text classification model file, deserializing the text classification model file, loading the model and ending the program according to the name of the text classification model and a text classification training model table.
The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (6)

1. A text classification corpus tagging training system, comprising: text classification corpus label prepares module, semi-automatization text corpus classification label module, feedback type classification label model learning training module and text classification label model effect evaluation module, its characterized in that: the text classification corpus labeling preparation module is used for distinguishing data containing training and testing corpus texts from different sources, counting word frequencies of the texts and preprocessing noise information of the texts; the semi-automatic text corpus classification and labeling module is used for selecting a Convolutional Neural Network (CNN), a random forest, a classification algorithm (KNN), an Artificial Neural Network (ANN), a Support Vector Machine (SVM) and a deep learning algorithm in a classification and labeling task to finish automatic labeling according to different labeling use requirements and corpus characteristics, using a Chinese word segmentation device for text word segmentation, removing stop words, converting unstructured and semi-structured texts into structured vector space models, and finishing corpus text classification and labeling work; the semi-automatic text corpus classification labeling module is used for importing an external algorithm model according to a unified model access interface, updating or exporting the model, storing a classification model file comprising an algorithm name, a model name and a serialized model file, and updating a classification training model table; in the updating of the classification model, the classification model is set through a model updating configuration file, whether the classification model is automatically updated or not is judged, if the classification model is required to be updated, the trained model is used for updating the model used for marking in the platform, the serialized model file is loaded and subjected to deserialization, and a new marking task is completed by using the updated model; after the labeling task is completed, a feedback type classification labeling model learning and training module provides model learning and training and feedback updating capabilities aiming at the internal and external labeling model algorithms, an algorithm training classifier is used for performing classification model learning and training, and labeled idiom materials are used for retraining a text classification model; in the model training process: carrying out off-line training on Bayes and KNN trainable algorithms by using the labeled corpus data, and calling a unified training model interface Train to enable the model accuracy to reach the best; feeding back a classification labeling model for perfect updating, and automatically feeding back and adjusting to complete a new text classification labeling task through continuous iteration between model updating and corpus labeling; aiming at the untrained algorithm, no training process is carried out, and the process is ended; performing offline training on CNN, KNN, ANN, SVM and deep learning trainable algorithm by using the labeled corpus data, calling a uniform training model interface Train, and generating a text classification model sequence file Kryo to ensure that the model accuracy reaches the best; then judging whether to store the text classification model, if not, ending, if so, importing the external algorithm model according to the unified model access interface, updating or exporting the external algorithm model, storing a classification model file comprising an algorithm name, a model name and a serialization model, and updating a text classification training model table; judging whether to update the text classification model or not by analyzing a switch for updating the classification model in the configuration file according to the selected CNN, random forest, KNN, ANN, SVM and deep learning trainable algorithm, if not, reading a specified text classification model file according to the name of the text classification model and a text classification training model table, deserializing the text classification model file, loading the model and ending the program; the text classification labeling model effect evaluation module is used for constructing a classification evaluation index according to a classification evaluation index standard, quantizing the evaluation index based on a classification index rule, establishing a labeling algorithm comprehensive evaluation model, analyzing a test result, evaluating a classification result, evaluating a model index quantization labeling effect, and automatically adapting a classification labeling algorithm model according to the classified corpora to be labeled of different tasks.
2. The text corpus annotation training system of claim 1, wherein: the text classified corpus labeling preparation module selects sources of the text classified corpuses according to different text classified corpuses, provides a selectable and applicable labeling algorithm in the labeling process, completes the management of the corpuses to be labeled according to the sources or topics, and provides preparation for labeling tasks.
3. The text corpus annotation training system of claim 1, wherein: the semi-automatic text corpus classification labeling module automatically selects an adaptive algorithm and develops automatic labeling, based on at least one text classification extraction algorithm in text classification extraction algorithms CNN, random forest, KNN, ANN, SVM and deep learning algorithm, pre-labeling processing of single text classification or multi-text classification fusion is carried out on the corpus data of the text to be labeled, word vector space of the text is generated, document theme characteristics are extracted and reflected, and a unified text classification model access standard is provided.
4. The text corpus annotation training system of claim 1, wherein: the semi-automatic text corpus classification labeling module is used for creating classification labeling tasks according to corpora from different sources, selecting an algorithm model with adaptive effect for each classification labeling task, and selecting a single classification algorithm based on CNN, random forest, KNN, ANN, SVM and deep learning in the classification labeling tasks or fusing the classification algorithms to complete automatic labeling, wherein the specific labeling algorithm can be configured according to the automatic labeling effect of the corpora.
5. The text corpus annotation training system of claim 1, wherein: the semi-automatic text corpus classification labeling module is used for creating a business labeling rule aiming at a special labeling task and managing the labeling business rule, wherein the labeling business rule mainly comprises a classification dictionary, the corpus is automatically labeled by adopting the labeling business rule, the corpus data to be labeled is subjected to pre-labeling treatment of a single classification method or pre-labeling treatment of multi-classification method fusion, the multi-classification method result fusion adopts a voting method, and the classification with the most votes is the final classification of a sample to be classified based on the judgment result of a single classifier and voting counting.
6. The text corpus annotation training system of claim 1, wherein: the text classification labeling model effect evaluation module provides a model evaluation index construction rule and index quantification, supports the evaluation of the model labeling effect by automatically constructing a labeling algorithm comprehensive evaluation model, constructs and sets a single index algorithm according to an index standard, quantifies indexes according to an index calculation rule, and adopts corresponding indexes to construct a labeling algorithm comprehensive evaluation model according to different labeling tasks; and (4) completing the calculation of the index comprehensive value and feeding back the effect of the labeling model.
CN201910455049.9A 2019-05-29 2019-05-29 Text classification corpus labeling training system Active CN110298032B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910455049.9A CN110298032B (en) 2019-05-29 2019-05-29 Text classification corpus labeling training system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910455049.9A CN110298032B (en) 2019-05-29 2019-05-29 Text classification corpus labeling training system

Publications (2)

Publication Number Publication Date
CN110298032A CN110298032A (en) 2019-10-01
CN110298032B true CN110298032B (en) 2022-06-14

Family

ID=68027381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910455049.9A Active CN110298032B (en) 2019-05-29 2019-05-29 Text classification corpus labeling training system

Country Status (1)

Country Link
CN (1) CN110298032B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750647B (en) * 2019-10-17 2020-07-31 北京华宇信息技术有限公司 Method for constructing E L P model of multi-source heterogeneous information data
CN110991279B (en) * 2019-11-20 2023-08-22 北京灵伴未来科技有限公司 Document Image Analysis and Recognition Method and System
CN113010667A (en) * 2019-12-20 2021-06-22 王道维 Training method for machine learning decision model by using natural language corpus
CN111476034B (en) * 2020-04-07 2023-05-12 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111563117B (en) * 2020-07-14 2020-11-20 北京每日优鲜电子商务有限公司 Structured information display method and device, electronic equipment and computer readable medium
CN111881105B (en) * 2020-07-30 2024-02-09 北京智能工场科技有限公司 Labeling model of business data and model training method thereof
CN111881294B (en) * 2020-07-30 2023-10-24 本识科技(深圳)有限公司 Corpus labeling system, corpus labeling method and storage medium
CN112163068B (en) * 2020-09-25 2022-11-01 国网山东省电力公司电力科学研究院 Information prediction method and system based on autonomous evolution learner
CN113407713B (en) * 2020-10-22 2024-04-05 腾讯科技(深圳)有限公司 Corpus mining method and device based on active learning and electronic equipment
CN112269880B (en) * 2020-11-04 2024-02-09 吾征智能技术(北京)有限公司 Sweet text classification matching system based on linear function
CN112802584A (en) * 2021-01-26 2021-05-14 武汉大学 Medical ultrasonic examination data classification method and device based on classifier
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data
CN113065341A (en) * 2021-03-14 2021-07-02 北京工业大学 Automatic labeling and classifying method for environmental complaint report text
CN113064993B (en) * 2021-03-23 2023-07-21 南京视察者智能科技有限公司 Design method, optimization method and labeling method of automatic text classification labeling system based on big data
CN113282747B (en) * 2021-04-28 2023-07-18 南京大学 Text classification method based on automatic machine learning algorithm selection
CN113569988B (en) * 2021-08-23 2024-04-19 广州品唯软件有限公司 Algorithm model evaluation method and system
CN113887227B (en) * 2021-09-15 2023-05-02 北京三快在线科技有限公司 Model training and entity identification method and device
CN114398943B (en) * 2021-12-09 2023-04-07 北京百度网讯科技有限公司 Sample enhancement method and device thereof
CN115221871B (en) * 2022-06-24 2024-02-20 毕开龙 Multi-feature fusion English scientific literature keyword extraction method
CN116579339B (en) * 2023-07-12 2023-11-14 阿里巴巴(中国)有限公司 Task execution method and optimization task execution method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750268A (en) * 2012-06-19 2012-10-24 山东中创软件商用中间件股份有限公司 Object serializing method as well as object de-serializing method, device and system
CN103179133A (en) * 2013-04-12 2013-06-26 北京工业大学 Communication method between client side and server based on entity class
CN104598586A (en) * 2015-01-18 2015-05-06 北京工业大学 Large-scale text classifying method
CN107341262A (en) * 2017-07-14 2017-11-10 上海达梦数据库有限公司 The serializing of object type row, unserializing method and device in database
CN107992633A (en) * 2018-01-09 2018-05-04 国网福建省电力有限公司 Electronic document automatic classification method and system based on keyword feature
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN108009643A (en) * 2017-12-15 2018-05-08 清华大学 A kind of machine learning algorithm automatic selecting method and system
CN109343836A (en) * 2018-08-31 2019-02-15 阿里巴巴集团控股有限公司 Data Serialization, data antitone sequence method, device and equipment
CN109460795A (en) * 2018-12-17 2019-03-12 北京三快在线科技有限公司 Classifier training method, apparatus, electronic equipment and computer-readable medium
CN109685056A (en) * 2019-01-04 2019-04-26 达而观信息科技(上海)有限公司 Obtain the method and device of document information

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750268A (en) * 2012-06-19 2012-10-24 山东中创软件商用中间件股份有限公司 Object serializing method as well as object de-serializing method, device and system
CN103179133A (en) * 2013-04-12 2013-06-26 北京工业大学 Communication method between client side and server based on entity class
CN104598586A (en) * 2015-01-18 2015-05-06 北京工业大学 Large-scale text classifying method
CN107341262A (en) * 2017-07-14 2017-11-10 上海达梦数据库有限公司 The serializing of object type row, unserializing method and device in database
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN108009643A (en) * 2017-12-15 2018-05-08 清华大学 A kind of machine learning algorithm automatic selecting method and system
CN107992633A (en) * 2018-01-09 2018-05-04 国网福建省电力有限公司 Electronic document automatic classification method and system based on keyword feature
CN109343836A (en) * 2018-08-31 2019-02-15 阿里巴巴集团控股有限公司 Data Serialization, data antitone sequence method, device and equipment
CN109460795A (en) * 2018-12-17 2019-03-12 北京三快在线科技有限公司 Classifier training method, apparatus, electronic equipment and computer-readable medium
CN109685056A (en) * 2019-01-04 2019-04-26 达而观信息科技(上海)有限公司 Obtain the method and device of document information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Autonomic State Management for Optimistic;Alessandro Pellegrini;《IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS》;20150630;第26卷(第6期);1560-1569 *
NetFramework中序列化与反序列化方法的分析与应用;高立群;《微计算机应用》;20071130;第29卷(第11期);1178-1182 *
基于神经网络的中文词法分析系统的研究与实现;徐伟;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20180215(第1期);I138-1918 *
采用随机Petri 网的嵌入式机载软件可靠性检测;罗玲;《计算机工程与应用》;20180521;第55卷(第1期);233-240 *

Also Published As

Publication number Publication date
CN110298032A (en) 2019-10-01

Similar Documents

Publication Publication Date Title
CN110298032B (en) Text classification corpus labeling training system
CN116628172B (en) Dialogue method for multi-strategy fusion in government service field based on knowledge graph
CN110298033B (en) Keyword corpus labeling training extraction system
CN112100344B (en) Knowledge graph-based financial domain knowledge question-answering method
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN108874878A (en) A kind of building system and method for knowledge mapping
CN112307153B (en) Automatic construction method and device of industrial knowledge base and storage medium
CN110287482B (en) Semi-automatic participle corpus labeling training device
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN110795525A (en) Text structuring method and device, electronic equipment and computer readable storage medium
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN110765277A (en) Online equipment fault diagnosis platform of mobile terminal based on knowledge graph
CN110222192A (en) Corpus method for building up and device
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN115292490A (en) Analysis algorithm for policy interpretation semantics
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network
CN110633468B (en) Information processing method and device for object feature extraction
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN114722159A (en) Multi-source heterogeneous data processing method and system for numerical control machine tool manufacturing resources
CN114064888A (en) Financial text classification method and system based on BERT-CNN
CN113434668A (en) Deep learning text classification method and system based on model fusion
CN116304110B (en) Working method for constructing knowledge graph by using English vocabulary data
Mukherjee et al. Immigration document classification and automated response generation
Wang Automatic Scoring of English Online Translation Based on Machine Learning Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant