CN111046179B

CN111046179B - Text classification method for open network question in specific field

Info

Publication number: CN111046179B
Application number: CN201911222868.5A
Authority: CN
Inventors: 黄少滨; 余日昌; 刘汪洋; 杨辉; 李熔盛; 申林山; 李轶; 张柏嘉
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2022-07-15
Anticipated expiration: 2039-12-03
Also published as: CN111046179A

Abstract

The invention belongs to the technical field of text classification processing, and particularly relates to a text classification method for open network question sentences in a specific field. The invention overcomes the problems that enough available corpus with class marks is lacked, the network text information quantity is low and the noise is large under the condition of executing network open text classification tasks in certain specific fields, and provides a new method for the hierarchical classification of the open network question sentences in the field. The invention utilizes the open network question and the written text of the specific field to lead the word embedding expression of the field to be more consistent with the field knowledge characteristics, and simultaneously, the semi-supervised method is used to accelerate the training of the classification model and reduce the required marking samples. In addition, the classification of the categories at a multi-granularity level is realized by combining the conditional probability. The method can be used for extracting, judging and constructing auxiliary data in the fields of question-answering systems, emotion analysis, field knowledge bases and the like.

Description

Text classification method for open network question in specific field

Technical Field

The invention belongs to the technical field of text classification processing, and particularly relates to a text classification method for open network question sentences in a specific field.

Background

The human intelligence is closely related to the language. The logical thinking of human beings is in the form of language, and most of the knowledge of human beings is recorded and streamed in the form of language words. Thus, it is also an important, even central part of artificial intelligence. From the beginning of artificial intelligence research, people are looking for ways to let machines understand the world. Among them, Text Classification (Text Classification) is a topic that is widely used in the field of natural language processing. The specific description of the text classification task is that a computer is used for automatically classifying and marking a text set according to a certain classification system or standard. Since the nineties of the last century the internet has developed at an alarming rate, today networks have accommodated a vast array of data information, including text, sound, images, and the like, that is structurally diverse and rich in content. Text data occupies less network resources than sound and image data, making it easier to propagate through the network, which makes a large portion of the network resources appear in text form. How to find valuable information from these spacious texts is a big goal of information processing. The text classification system based on machine learning can automatically classify texts according to the content of the texts under a given classification model, so that people can better help organize the texts and mine text information, and therefore the texts are used as one of important tools for describing the human world and transmitting information, and a natural language processing technology becomes an important direction in the fields of computer science and artificial intelligence. Due to the rise of neural networks, text classification has been one of the fundamental technologies of natural language processing, and many progress has been made in the research in this field so far. However, most current methods for text classification of neural network models have several problems: 1) the neural network model classifier requires a large amount of data for learning, and needs a large amount of labeled training samples, which is high in cost. (ii) a 2) The training time is long, and the calculation amount is large. Among many neural network models, the convolutional neural network not only can reduce the number of parameters to be trained, but also can perform parallel computation to accelerate the training speed; 3) most classification tasks face texts which are book-oriented and normalized texts, and the texts are relatively long in length, so that the texts contain rich information, have obvious characteristics and are relatively low in task difficulty.

Due to the current situation that the labeled training samples are high in degree and the unlabeled samples are easy to obtain, the semi-supervised learning optimizes and learns the classification models of a small number of labeled samples by using the unlabeled training samples, so that the classification models can be applied to wider scenes.

The research on text classification dates back to the sixties of the last century, and early text classification was mainly based on Knowledge Engineering (Knowledge Engineering) and text classification was performed by manually defining some rules, which is time-consuming and labor-consuming, and needs to be sufficiently known in a certain field to write out appropriate rules.

Intelligent text classification techniques have matured more and more over decades. Marcin and

numerous current advanced text classification research efforts are evaluated, investigating six elements of a text classification task, namely data collection, labeled data analysis, feature construction and weighting, feature selection and projection, classification model training and solution evaluation. The discovery text classifier can be classified into supervised, semi-supervised, integrated, active, transition and multi-view learning classifiers. Most research works adopt a supervised learning method, and meanwhile, most of the research works use simple data instances, and the multi-label instance is quite rare. Where the textual representation can significantly affect the quality of the classification.

Therefore, optimizing the performance of a classifier by improving the text representation performance becomes an important branch in text classification research. Current word embedding has become the dominant text representation method due to its superior representation capability. Wang et al believe that for the task of text classification, tags play a central role in the final performance, further enhancing the text representation by jointly embedding words and class tags into the underlying space. Where the tag acts as an anchor for the class to affect embedding of the word. It is a good idea to optimize the classification with additional information to reinforce the textual representation, which only uses the fully-connected network model to achieve 99.02% accuracy on the DBpedia 2014 dataset. Furthermore, Kim optimizes classification performance by enhancing text representation capabilities. In contrast, it is contemplated that although most research efforts have experimented with a variety of different classification tasks, a generic pre-trained word embedding model is employed. It is proposed that the same text is input simultaneously 2 times in the input layer in the form of a representation of a pre-trained word vector as two training channels, the text representation of one of the channels being adjustable by back propagation through the neural network and the other being fixed, when performing the text classification task. Allowing for better sentence classification accuracy with both task-specific dynamic vectors and pre-trained static vectors. Rie and Zhang propose another method to optimize Convolutional Neural Networks (CNN) by improving the representation quality of convolutional neural network input samples. The method learns the embedding of text regions from unlabeled data and then integrates the learned embedding into supervised training. Therefore, a neural network for inputting untagged data, a neural network for inputting tagged data, and a Convolutional Neural Network (CNN) on the upper layer thereof are required, resulting in relatively large computational overhead. Wang et al, to overcome the challenges of short text classification, consider that more semantic and grammatical information needs to be captured from short text, and the key step to achieve this goal is to use a more advanced text representation model. The method enriches the information of the short texts with the help of an explicit knowledge base, namely, each short text is associated with related concepts in the knowledge base, and then words of the short texts and the related concepts are combined to generate the embedding by using pre-training. This word-concept is then embedded in the input Convolutional Neural Network (CNN). Therefore, the method combines the knowledge base related to the short text and the method of text concept extraction. Agnihotril et al consider another important property of classifier performance: namely, the speed of the classifier is trained on the premise of ensuring the prediction capability of the classifier. Starting with feature selection of an improved classifier, performing informativeness scoring on words, and then selecting the top b highest-scoring n-grams (n is 1-3) word sets as the feature set of the SVM classifier. Since they employ neural network model classifiers that are not currently hot, a feature extraction step is performed on the text before it is input into the classifier.

Embedding text into a potential space often faces the problem of insufficient instances of corpus tagging. Zhang and Xiao et al propose to exploit potential correlations between related tasks to extract common features and generate performance gains through multitask learning. The article proposes a multi-task learning architecture comprising four recurrent neural layers to fuse information of multiple related tasks. Because one of the strong constraints of Deep Neural Networks (DNNs) is that they are strongly dependent on a large number of annotated corpuses due to the large number of parameters that need to be trained. While neural networks trained on limited data are prone to overfitting and cannot be well popularized. However, constructing large-scale high-quality marker data sets is very labor-intensive. Multi-task learning can exploit the potential correlations between related tasks to extract common features, implicitly increasing corpus size, thereby improving classification. Therefore, when there are a plurality of similar tasks and corpora thereof, it is considered to improve the classification effect by using multi-task learning. In addition, Lease and Zhang et al propose an active learning method for convolutional neural networks. This approach focuses the active learning strategy on selecting the instance that best affects the embedding space (i.e., produces a discriminative word representation), rather than targeting the final classification result. The method obtains a better classification result by improving the representation quality of input samples, and limits the AL strategy to a word embedding stage to reduce the operation cost, so that the AL strategy can be used for any neural network model. Through active learning, good classification performance can be achieved only by labeling part of corpora selected by the model.

When we input a text representation using an appropriate method into the classification model, cases where the text lengths of the instances are not uniform are also faced. The conventional research work uses a fixed length to cut off the redundant text. SPP-NET proposed by He et al is a method for classifying images of different sizes without clipping input samples in advance when they are input to a convolutional network. According to the method, the last pooling layer of the convolutional neural network is replaced by a spatial pyramid pooling layer, namely the number of pooling windows is fixed, and the size of the windows is variable, so that the characteristic of the input N-way softmax layer is fixed. And simultaneously, multi-scale pooling is adopted in parallel, and feature mapping of each scale is fused to obtain a better classification result. In the text classification task, the input samples are more often of different lengths, and it is also a feasible idea to perform dimension unification before the softmax layer.

Text classification, one of the techniques that constitute a question-answering system, also presents some unique classification methods under the work of numerous researchers. Yao et al propose a method for classifying question sentences in a question-and-answer system. The method is characterized in that the relation type of question requirements is determined by extracting subject words, verbs and named entities from the question as classification features. For example, if the question word "who" is extracted, the question requirement attribute may be determined to be person. The method based on information extraction can rapidly obtain the classification of the question sentence. Similarly, Dodiya and Jain [20] mention that when classifying questions in a question-answering system, it is not only necessary to classify the question with respect to its domain knowledge category, but also to differentiate the question's query request in order to help find or construct the appropriate answer, e.g., who, when, etc. The rule-based method firstly carries out preprocessing of stop word removal and word stem extraction on a question, and then extracts a keyword sequence in the question and maps the keyword sequence into a corresponding problem classification. The method needs manual writing rules to construct the mapping relation between the keyword sequence and the classification, and meanwhile, the accuracy rate of the classification is poor (lower than 66 percent), and the fitness when the method is applied to Chinese is not as good as that of English. Similarly, Silva et al consider question classification as a subtask of question-answering, and they also apply a similar rule method to match question sentences and trace the generic class of the core words through WordNet, such as: flower- - > plant. The entity class of the core word of the question is then used as one of the features of the text classification to enhance the performance of the classifier. Compared with a classification mode of pure rule matching, the method is improved, so that when the rules cannot be matched with the question successfully, the judgment of the question type can still be made through the information of the central words.

In addition to the above methods, there have been many studies applying Convolutional Neural Networks (CNNs) to text classification tasks. Le and Denies et al are then faced with a variety of different convolutional neural network classification models, focusing on whether a "deep" convolutional network model is needed in the text classification task. The importance of depth in the convolution model to text classification at character and word level input, respectively, was studied. Le and Denies et al have the main conclusion that depth models have not proven to be more efficient than shallow models for text classification tasks. Moreover, further research is required to confirm or validate this observation on other datasets, natural language processing tasks and models. Indeed, the deep convolution model stems from the depth model originally developed for image processing, but new depth architectures for text processing may challenge this conclusion in the near future. Therefore, we can build the convolution model. The deep network structure is not needed, and the shallow and wide network model still can ensure that good results are obtained, and meanwhile, the number of calculation parameters needed by the model is reduced. Chen et al propose a punctured convolution, which adds a sampling rate concept to the convolution kernel, so that the coverage field of the convolution layer is wider, and the method can be well applied to large data. Meanwhile, the method combines the idea of space pyramid pooling, uses a plurality of parallel coiling layers with holes and different sampling rates to extract features, and then further fuses the features to generate results. And finally, combining a Deep Convolutional Neural Network (DCNN) with a fully connected Conditional Random Field (CRF) to obtain an accurate semantic segmentation result and an object boundary. Whether the punctured convolution method can be applied to the text is still to be verified. Johnson and Zhang believe that because with the deepening of networks, CNNs can effectively discover remote associations (and more global information) in text. A simple network architecture (DPCNN) was therefore developed that can achieve optimal accuracy by increasing the network depth without significantly increasing the computational cost. The DPCNN first performs text region embedding, i.e., employs region word embedding as an input to the convolutional network, then keeps the same number of feature maps at each convolutional layer of the convolutional network and performs a fixed window with a step of 2 maximum pooling layers after each 2 layers of convolution, so that the computation amount is halved after each pooling. The method has no great improvement on the structure of the CNN, only limits some hyper-parameters, and improves the classification precision from better word embedding expression and deeper network structure.

Still other researchers have attempted to combine the Convolutional Neural Network (CNN) with another Recurrent Neural Network (RNN) that has been successfully applied in the text field to achieve better performance. Lai et al adds a pooling layer above the Bi-directional RNN to capture context information. And performing maximum pooling operation on all time step results of the recurrent neural network by using the convolution pooling layer to solve the problem of gradient dissipation during RNN long-distance memory. The method combines part of characteristics of RNN and CNN, and focuses on improving classification results by using context information. Wang also proposes a method of combining RNN and CNN, but it uses a RNN network of length k instead of a convolution kernel as a window of size k to slide through the input text, and the window RNN produces only one final result and not k results. Meanwhile, because k neurons in the window share parameters, when the window is enlarged, the number of parameters needing to be learned is not increased. The context vector generated for each window can be viewed as a representation of a text segment. The context vector is then passed into a multi-layer perceptron (MLP) to extract high-level features, and then passed into a max-pooling layer to extract the most salient features and location invariant features. And finally, applying a linear rectification function and a softmax function to predict the probability of each category. The method well utilizes the advantages of RNN sequence representation and retains the advantages of parallel operation and feature extraction invariance of CNN.

Since many knowledge structures have many levels of abstraction, classifying in different levels can present different difficulties. Kowsari et al propose a generic hierarchical classification architecture suitable for various types of neural networks. The structure is a tree-like neural network, the root node is a father neural network, the output result is the classification result of the uppermost level, then each classification result is respectively input into a son neural network to output the next level of classification, and the hierarchical classification is finally formed by analogy. Therefore, the method model is composed of a plurality of neural networks, and a classification field and a classification level which are set manually are needed to design a corresponding network model. Zhu and Bain also propose a similar branched convolutional neural network (B-CNN) structure. The difference is that each layer of the CNN is considered to contain different hierarchical features in the network, so that the convolutional neural network classified to the finest granularity can be used as a main network, the features of a plurality of middle layers of the main network are input into branch networks for classified prediction of each hierarchy, and the final loss function is the weighted sum of prediction losses of each granularity. The structure is more concise. Fu et al also studied the use of CNN to achieve hierarchical classification, which uses bayesian techniques for hierarchical classification, so that hierarchical classification can be achieved only by modifying the uppermost classification layer of the CNN network. The method does not need sub-networks corresponding to hierarchical classification, and only needs additional neurons to learn the conditional probability that the sample is divided into fine-grained sub-categories under the condition of being divided into coarse-grained categories. The method greatly reduces the calculation cost of CNN hierarchical classification.

Disclosure of Invention

The invention aims to provide a text classification method for open network question sentences in specific fields.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: inputting a text to be classified, and judging the field of the text;

and 2, step: setting classification levels and classification categories;

on a coarse-grained level, dividing a field into a +1 types according to the target of a classification task and the boundary of the field to which a text to be classified belongs, wherein a is the number of types required by the task, and the additional 1 type represents the outside of the field;

on a fine-grained level, each coarse-grained category is subdivided into b +1 subclasses according to the target of a classification task and a knowledge system structure in the field to which a text to be classified belongs, and the extra 1 category represents a difference set of a union of a parent category and b subclasses of the parent category;

and 3, step 3: loading an open network question and answer text set and a written text set of the field according to the field to which the text to be classified belongs;

and 4, step 4: dividing an open network question and answer text set into question and answer pairs; dividing the written text set into written chapters;

and 5: based on syntactic characteristics, dividing all question-answer pairs and written chapters into sentences to obtain a sentence data set;

and 6: based on a jieba word segmentation module, establishing a dictionary and performing word segmentation on the sentence data set by utilizing a hidden Markov model and a Viterbi algorithm to obtain a word vector data set of the field to which the text to be classified belongs;

and 7: clearing invalid texts and symbols in a word vector data set of the field to which the texts to be classified belong;

and step 8: pre-training a word vector data set of the field to which the text to be classified belongs according to a CBOW algorithm to obtain a question training set in a word vector form;

and step 9: constructing a semi-supervised convolutional neural network combined with Bayes;

step 10: inputting a question training set in a word vector form into a semi-supervised Bayesian convolutional neural network for training to obtain an open network question text classifier in the field to which a text to be classified belongs

Step 11: and inputting the text to be classified into an open network question text classifier to obtain a classification result.

The present invention may further comprise:

the step 10 of inputting the question training set in the form of word vectors into the semi-supervised Bayesian convolutional neural network for training specifically comprises the following steps:

step 2.1: inputting a question training set in a word vector form into an input layer of a semi-supervised Bayesian convolutional neural network;

step 2.2: after the input layer of a semi-supervised Bayesian-combined convolutional neural networkEstablishing two convolution hidden layers; two layers of convolution hidden layers execute convolution operation, each of them has r randomly initialized convolution kernels with dimension of P × K, in which the number of channels of convolution layer is r, and the sizes of convolution windows P are respectively P₁、P₂Respectively acquiring detail features and region features, wherein the word vector dimension is K;

step 2.3: establishing a pooling layer after the two convolution hidden layers respectively, adopting the maximum pooling with the same window size of P, and then splicing the results of the two pooling layers end to end;

step 2.4: repeatedly executing the convolution and the maximum pooling structure s times to obtain a characteristic matrix M; inputting the result into a global maximum pooling layer to obtain a feature vector F;

step 2.5: establishing a parallel convolutional layer C1 and a fully connected softmax layer D1 after the pooling layer;

step 2.6: inputting the feature vector F into a fully connected softmax layer D1 to obtain C_CoarseThe dimensional feature vector is the probability distribution of coarse-grained classification, where C_CoarseThe number of coarse-grained categories;

step 2.7: inputting the feature matrix M into the convolutional layer C1 to obtain the (C)_Coarse×C_{Thin and thin}) A weight matrix composed of the feature vectors; inputting the weight matrix into a global maximum pooling layer to obtain (C) representing a coarse-to-fine-grained conditional probability distribution_Coarse×C_{Thin and thin}) Dimension vector of, wherein C_{Thin and thin}The number of categories being fine-grained;

step 2.8: c is to be_CoarseDimension feature vector and (C)_Coarse×C_{Thin and thin}) The dimension vector is obtained by splicing the head and the tail of the dimension vector to obtain (C)_Coarse×C_{Thin and thin})+C_CoarseVector of dimensions V1;

step 2.9: inputting the vector V1 into a fully-connected softmax layer D1 to obtain probability distribution of predicting fine-grained classification;

the final loss of the convolution neural network model of the fully connected softmax layer D1 classification is the weighted sum of the coarse classification loss and the fine classification loss, and a label-free mutual exclusivity loss term enables the prediction probability to be as close as possible to the form that only one element is 1 and the rest elements are 0;

loss＝λ^coarseloss^Coarse+λ^{Thin and thin}loss^{Thin and thin}+λ^{Is free of}loss^{Is free of}

Wherein, f_j(x_i) The j-th dimension element of the prediction probability vector generated in the 8 th step; lambda^CoarseAnd λ^{Thin and thin}All are set weight values; loss^CoarseAnd loss^{Thin and thin}Are softmax layer cross entropy loss functions.

The invention has the beneficial effects that:

the invention provides a text classification method for open network question sentences in a specific field, which overcomes the problems of lack of enough available corpus with classification marks, low information quantity and high noise of network texts under the condition of executing network open text classification tasks in some specific fields, and provides a new method for hierarchical classification of open network question sentences in the field. The invention utilizes the open network question and the written text of the specific field to lead the word embedding expression of the field to be more consistent with the field knowledge characteristics, and simultaneously, the semi-supervised method is used to accelerate the training of the classification model and reduce the required marking samples. In addition, the classification of the categories at a multi-granularity level is realized by combining the conditional probability. The method can be used for extracting, judging and constructing auxiliary data in the fields of question-answering systems, emotion analysis, field knowledge bases and the like.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Fig. 2 is a schematic diagram of a classifier structure.

Fig. 3 is a setting table of category classification and labels in the embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The method belongs to the field of text classification processing, and further relates to a text classification method of open network question sentences in a certain specific knowledge field by utilizing semi-supervised learning and hierarchical classification in the field of short text classification. The method can be used for extracting, judging and constructing auxiliary data in fields such as question-answering systems, emotion analysis, field knowledge bases and the like.

The invention provides a text classification method for open network question sentences in specific fields, which overcomes the problems that enough available corpus with classification marks is lacked, the information quantity of network texts is low and the noise is high under the condition of executing network open text classification tasks in some specific fields, and provides a new method for hierarchical classification of open network question sentences in the field. The invention comprises the following steps: (1) collecting related texts in the field (2), setting classification levels and categories (3), preprocessing an open network text and a written text (4), selecting a part of open network question samples, marking categories (5), inputting questions, performing semi-supervised level classifier training (6), and executing text classification by a classifier. The invention utilizes the open network question and the written text of the specific field to lead the word embedding expression of the field to be more accordant with the field knowledge characteristics, and simultaneously, the semi-supervised method is used for accelerating the training of the classification model and reducing the required marking samples. In addition, the classification of the categories at a multi-granularity level is realized by combining the conditional probability.

A text classification method for open network question sentences in specific fields comprises the following steps:

step 1: inputting a text to be classified, and judging the field of the text;

step 2: setting classification levels and classification categories;

on a fine-grained level, each coarse-grained category is subdivided into b +1 subclasses according to the target of a classification task and a knowledge system structure in the field to which a text to be classified belongs, and the additional 1 category represents a difference set of a union set of a parent category and the b subclasses of the parent category;

and step 3: loading an open network question and answer text set and a written text set in the field according to the field to which the text to be classified belongs;

and 4, step 4: dividing an open network question-answer text set into question-answer pairs; dividing the written text set into written chapters;

and 5: based on syntactic characteristics, all question-answer pairs and written chapters are divided into sentences to obtain a sentence data set;

step 10: inputting a question training set in a word vector form into a semi-supervised Bayes-combined convolutional neural network for training to obtain an open network question text classifier in the field of texts to be classified

In the step 10, inputting the question training set in the form of word vectors into a semi-supervised Bayesian convolutional neural network for training specifically comprises the following steps:

step 2.1: inputting a question training set in a word vector form into an input layer of a semi-supervised convolutional neural network combined with Bayes;

step 2.2: establishing two convolution hidden layers after an input layer of a semi-supervised Bayes combined convolution neural network; two convolution hidden layers perform convolution operation, each of which has r randomly initialized convolution kernels with dimension of P × K, where the number of convolution layer channels is r, and the sizes of convolution windows P are P₁、P₂Respectively acquiring detail features and region features, wherein the word vector dimension is K;

step 2.3: establishing a pooling layer after the two convolution hidden layers respectively, adopting the maximum pooling with the same window size of P, and splicing the results of the two pooling layers end to end;

step 2.6: inputting the feature vector F into the fully connected softmax layer D1 to obtain C_CoarseThe dimensional feature vector is the probability distribution of coarse-grained classification, where C_CoarseThe number of coarse-grained categories;

step 2.7: inputting the feature matrix M into the convolutional layer C1 to obtain the (C)_Coarse×C_{Thin and thin}) A weight matrix composed of the feature vectors; inputting the weight matrix into a global maximum pooling layer to obtain a (C) representing a conditional probability distribution from coarse granularity to fine granularity_Coarse×C_{Thin and thin}) Dimension vector of, wherein C_{Thin and thin}The number of categories being fine-grained;

step 2.8: will C_CoarseDimension feature vector and (C)_Coarse×C_{Thin and thin}) The dimension vector is obtained by splicing the head and the tail of the dimension vector to obtain (C)_Coarse×C_{Thin and thin})+C_CoarseVector of dimensions V1;

the final loss of the fully connected softmax layer D1 classified convolutional neural network model is the weighted sum of the coarse classification loss and the fine classification loss, and a label-free mutual exclusivity loss term enables the prediction probability to be as close as possible to the form that only one element is 1 and the rest elements are 0;

loss＝λ^coarseloss^Coarse+λ^{Thin and thin}loss^{Thin and thin}+λ^{Is composed of}loss^{Is composed of}

Wherein, f_j(x_i) For the prediction generated in step 8The jth dimension element of the probability vector; lambda^CoarseAnd λ^{Thin and thin}Are all set weight values; loss^CoarseAnd loss^{Thin and thin}Are softmax layer cross entropy loss functions.

Example 1:

the invention aims to provide a semi-supervised hierarchical classification method facing an open network question text in a specific field, simultaneously makes up the defects of the prior art, and can train a classification model by using extra knowledge and keep good classification precision under the condition of few labeled samples.

The specific idea of the invention for realizing the above purpose is as follows:

1. considering different classification difficulty among various categories, setting classification levels of coarse granularity and fine granularity, and setting categories in the levels;

2. preprocessing a specific field network question sentence, a question and answer text and a written text of a field oriented by a task;

3. generating a word-embedded representation of the domain-specific corpus by representation learning;

4. dividing a small part of balanced training sample set for labeling;

5. establishing a semi-supervised hierarchical convolutional neural network model;

6. and inputting a training data set to train a text classification model to obtain a final developed network question text classifier.

The overall flow chart of the method is shown in fig. 1, and the specific steps are as follows:

(1) inputting a text to be classified, and judging the field of the text; setting classification levels and classification categories.

(1a) On the coarse-grained level, the text to be classified is classified into a +1 classes according to the target of the classification task and the boundary of the domain to which the text belongs, wherein a is the number of classes required by the task, and the additional 1 class represents the outside of the domain.

(1b) On a fine-grained level, each coarse-grained class is subdivided into b +1 sub-classes according to the goal of a classification task and a knowledge architecture in the field to which the text to be classified belongs, and the additional 1 class represents the difference of the union of the parent class and the b sub-classes thereof.

(2) The open web question and answer text and written text data are preprocessed.

(2a) And loading the open network question and answer text and the written text.

(2b) Dividing an open network question-answer text set into question-answer pairs; the written text set is divided into written chapters.

(2c) And based on syntactic characteristics, all question-answer pairs and written chapters are divided into sentences to obtain a sentence data set.

(2d) Based on the jieba word segmentation module, a dictionary is established and the sentence data set is segmented by utilizing a hidden Markov model and a Viterbi algorithm.

(2e) And clearing invalid text and symbols in the data set.

(2f) Pre-training a word vector data set of the field to which the text to be classified belongs according to a CBOW algorithm to obtain a question training set in a word vector form, wherein the step not only utilizes a task-related processed open network question text, but also utilizes processed answers and written texts.

(3) Semi-supervised hierarchical text classifier training is performed.

(3a) And setting a semi-supervised convolution neural network combined with Bayes.

(3b) And inputting a question training set in a word vector form into the convolutional neural network model to train the model.

(3c) And storing the convolutional neural network model obtained after training to obtain the open network question text classifier in a certain specific field.

(4) And inputting the text to be classified into an open network question text classifier to obtain a classification result.

The training for the word vector in the specific field described in step 2 uses not only the task-related question corpus but also the semi-spoken answer corpus and the book-oriented text corpus.

The step 2f comprises the following specific steps:

step 1, CBOW is a neural network model, firstly, it maps each word in the context of the target word in the text to one-hot word vector w of K dimension₁,w₂,...,w_TAs input to the network model.

Step 2, multiplying the vectors by a weight matrix W_K×NAnd is transmitted into the next layer of the network and is added and averaged to obtain an N-dimensional vector H.

Step 3, multiplying the N-dimensional vector H by a weight matrix H_N×KObtaining a K-dimensional vector

And 4, the last layer of the neural network is a prediction probability layer, and the prediction probability layer can be realized by a softmax function:

wherein z is a vector obtained by a hidden layer before a softmax layer in the neural network, K is the dimension of the z vector and is the size of a corpus vocabulary, and sigma is_jIs the probability that the target word is predicted as the jth word in the vocabulary.

Step 5, when a group of word sequences w is given₁,w₂,...,w_TThen the goal of the CBOW model is to maximize the log-likelihood function, expressed as:

i.e. predicting the target word in the text by its context.

Finally, the obtained weight matrix W_K×NFor the resulting transformation matrix embedding the underlying space, N is the dimension of word embedding.

The step 3 comprises the following steps:

step 1, the text classification model is a neural network model, and firstly, the vectorization training set obtained in step 2 is input into an input layer of the neural network.

And step 2, establishing a convolution hidden layer. After the input layer, two parallel convolutional layers are arranged to perform convolution operations, each of which has r randomly initialized convolution kernels of dimension P x K (the number of convolutional layers is r, the sizes of convolution windows P are r respectively₁、Γ₂Is divided intoAnd respectively acquiring detail features and region features, wherein the dimension of the word vector is K).

And 3, respectively establishing a pooling layer after the two convolution layers, adopting the maximum pooling with the same window size of P, and splicing the results of the two pooling layers end to end.

And step 4, repeatedly executing the convolution and the maximum pooling structure s times to obtain a feature matrix M, and inputting the result into the global maximum pooling layer to obtain a feature vector F.

In step 5, after the pooling layer, a parallel convolutional layer C1 and a fully connected layer D1 are established.

Step 6, inputting C obtained by fully connecting softmax layers D1 into F_CoarseThe dimensional feature vector is the probability distribution of coarse-grained classification, where C_CoarseThe number of coarse-grained categories.

Step 7, M is inputted into the convolutional layer C1 to obtain the (C) signal_Coarse×C_{Thin and thin}) A weight matrix composed of feature vectors, and inputting the feature matrix into a global maximum pooling layer to obtain a (C) representation of a coarse-to-fine conditional probability distribution_Coarse×C_{Thin and thin}) Dimension vector of, wherein C_{Thin and thin}The number of classes of fine granularity.

And 8, splicing the output vector of the softmax layer D1 in the step 6 and the output vector of the global maximum pooling layer in the step 7 end to obtain (C)_Coarse×C_{Thin and thin})+C_CoarseVector of dimensions V1.

And step 9, inputting the vector V1 into a fully-connected softmax layer to obtain the probability distribution for predicting fine-grained classification.

And step 10, the final loss of the convolutional neural network model of the hierarchical classification is a weighted sum of the losses of the coarse classification and the fine classification, and a mutual exclusivity loss term without a label enables the prediction probability to be as close as possible to the form that only one element is 1 and the rest elements are 0.

Wherein,

f_j(x_i) Is the j-th dimension element of the prediction probability vector generated in step 8. Lambda^CoarseAnd λ^{Thin and thin}All are set weight values; loss^CoarseAnd loss^{Thin and thin}Are softmax layer cross entropy loss functions.

The invention will be described in connection with specific examples,

(1) a classification hierarchy and classification categories are set (as shown in fig. 3).

(1a) At the coarse level, there are categories classified into endowment insurance and general/other categories according to national policy and regulations.

(1b) On a fine granularity level, the details are divided into 5 categories of participation/participation, payment, account, right/treatment, synthesis/others according to national policy and regulation.

(2) And preprocessing the open network question sentence, the question and answer text and the policy and regulation text data.

(2a) And loading an open network question sentence, an answer text and a policy and regulation text.

(2b) Based on the jieba module, a dictionary is established and the segmentation is performed on the read data set by using a hidden Markov model and a Viterbi algorithm.

(2c) And clearing invalid texts and symbols in the data set.

(2d) And performing word embedding pre-training on the processed text according to the CBOW algorithm to obtain a word vector in the field of endowment insurance.

(3) The classified samples are marked at each granularity.

(3a) A question samples are randomly extracted, and A/5 question samples are randomly extracted to serve as a test set.

(3b) And marking the A selected samples according to the classification of the hierarchical classification obtained in the previous step.

(4) The textual representation of the open network question data is converted.

(4a) The question text data set is converted into the form of a vector list through word vectors obtained by previous training.

(4b) The marked 4/5 × a question samples in the question text data set except for the test set are used as a training set and a verification set.

(4c) 80% of the question samples were randomly drawn from the 4/5 × a question text data set as the training set, leaving 20% as the validation set.

(5) Semi-supervised hierarchical text classification is performed.

(5a) A semi-supervised bayesian-combined convolutional neural network is set (as shown in figure 2). The model firstly takes the linguistic data in the form of word vectors as input, extracts features through the convolutional layer, and then obtains a classification result through a hierarchical classification structure. The hierarchical classification structure is characterized in that a conditional probability score of P (fine-grained | coarse-grained) is obtained after a layer is pooled in a fine-grained channel, and then the conditional probability score is spliced with a coarse-grained classification result to finally obtain a fine-grained classification result. The hierarchical classification structure can be used for classification of two granularity levels and can be easily expanded into scenes of more granularity classifications.

(5b) And inputting a question training set in a word vector form into the convolutional neural network model to train the model.

(5c) And storing the convolutional neural network model obtained after training to obtain the open network question text classifier in the field of endowment insurance.

(5d) The text is classified using a text classifier.

The invention has the beneficial effects that:

1, because the method does not adopt a general pre-trained word vector or only adopts a known word vector in the field of question text training aimed at a task target, a more intellectual semi-spoken answer text and a written text containing field knowledge are added in word vector training on the basis of adopting a network open question text corpus, so that a better effect is obtained when the question text in the field is expressed by the word vector.

2, due to the hierarchical classification model designed by the method, the classification under the multi-granularity level is realized while the calculation amount is not obviously increased, and more information can be provided for other further data processing. In addition, the hierarchical classification structure can be combined with any neural network to realize multi-level classification.

And 3, because the semi-supervised learning method is combined with the hierarchical classification model, the training process is accelerated, the problems of few labeled samples and high labeling cost in some fields in the prior art are solved, and the discrimination of each class in the feature space is ensured when the number of the classes is large.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text classification method for open network question in specific field is characterized by comprising the following steps:

step 1: inputting a text to be classified, and judging the field of the text;

and 2, step: setting classification levels and classification categories;

and step 3: loading an open network question and answer text set and a written text set of the field according to the field to which the text to be classified belongs;

and 8: pre-training a word vector data set of the field to which the text to be classified belongs according to a CBOW algorithm to obtain a question training set in a word vector form;

2. The text classification method for the open network question in the specific field according to claim 1, characterized in that: in the step 10, inputting the question training set in the form of word vectors into a semi-supervised Bayesian convolutional neural network for training specifically comprises the following steps:

step 2.4: repeatedly executing the convolution and the maximum pooling structure s times to obtain a feature matrix M; inputting the result into a global maximum pooling layer to obtain a feature vector F;

step 2.7: the feature matrix M is inputted into the convolutional layer C1 to obtain the (C) matrix_Coarse×C_{Thin and thin}) A weight matrix composed of the feature vectors; inputting the weight matrix into a global maximum pooling layer to obtain a (C) representing a conditional probability distribution from coarse granularity to fine granularity_Coarse×C_{Thin and thin}) Dimension vector of where C_{Thin and thin}The number of categories being fine-grained;

loss＝λ^coarseloss^Coarse+λ^{Thin and thin}loss^{Thin and thin}+λ^{Is free of}loss^{Is composed of}

Wherein f is_j(x_i) The j-th dimension element of the prediction probability vector generated in the 8 th step; lambda [ alpha ]^CoarseAnd λ^{Thin and thin}Are all set weight values; loss^CoarseAnd loss^{Thin and thin}Are softmax layer cross entropy loss functions.