CN109885686A

CN109885686A - A kind of multilingual file classification method merging subject information and BiLSTM-CNN

Info

Publication number: CN109885686A
Application number: CN201910127535.8A
Authority: CN
Inventors: 崔荣一; 孟先艳; 赵亚慧; 易志伟; 田明杰; 徐凯斌; 杨飞扬; 王琪; 黄政豪; 金国哲; 张振国; 胡荣; 王大千
Original assignee: Yanbian University
Current assignee: Yanbian University
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2019-06-14

Abstract

The present invention relates to text classification technical fields in natural language processing, and in particular to a kind of multilingual file classification method for merging subject information and BiLSTM-CNN implements process are as follows: construct Parallel Corpus towards the multilingual parallel corpora of English in collecting first；Languages text each in corpus is pre-processed；Utilize the term vector of each languages of word embedded technology training；Each languages text subject vector is extracted using topic model；It builds and is suitable for multilingual neural network model, and merge subject information, carry out multilingual text representation.The method of text classification, solves aphasis, has very strong adaptability, is able to satisfy the demand of multilingual text classification, practical.

Description

A kind of multilingual file classification method merging subject information and BiLSTM-CNN

Technical field

The present invention relates to text classification technical fields in natural language processing, and in particular to a kind of fusion subject information and The multilingual file classification method of BiLSTM-CNN.

Background technique

With the rapid development of Internet, more and more internet datas exist in a text form, simultaneous International development, multilingual text data are more and more common.People are increasingly not content with the text under single language environment This information, the demand to multilingual text information are constantly promoted.People are urgent to be wished to from multilingual text data quickly Effectively find oneself required information.Research direction of the multilingual text classification as natural language processing is that solution is more The effective ways of languages text information development.

Multilingual text classification, objective are in the case where not needing manual intervention by existing automatic Text Categorization skill Art is expanded to multilingual by single languages.With the progress of globalization process, the research of multilingual text classification has obtained extensive pass Note and development, there are mainly four types of different methods at present.

Method based on dictionary.The strategy of bilingual dictionary is used, this method is simple and easy.For example, Olsson et al. is logical English Training document is translated into Czech document by the mode for crossing the bilingual dictionary of probability, to carry out across language text classification. But this method can not solve the problems, such as polysemy.

Method based on corpus.This method is divided into Parallel Corpus and comparison corpus, and Parallel Corpus refers to same Information is described with different language；Compare corpus and refer to that the information of same subject is described with different languages, In document be aligned according to discussed theme.But this method needs the highly developed and comprehensive corpus of covering, to reality The condition of testing causes significant limitation, is unfavorable for extending.

Method based on machine translation.By machine translation tools by the document translation of multiple languages at unified language mould Type is classified.This method is fairly simple, but depends critically upon the accuracy rate of machine translation, and efficiency is caused to reduce.

The method of word-based insertion.This method is by establishing the feature representation model based on deep learning, and training is multi-lingual Kind term vector.This method combination context accurately obtains semantic information, and feature is made to obtain specifically indicating very much.

One main difficulty of multilingual text classification is multilingual text representation, it is therefore proposed that a kind of new is multilingual Text representation and neural network model, solve the problems, such as languages.

Summary of the invention

For the problems mentioned above in the background art, the invention discloses a kind of fusion subject information and BiLSTM-CNN Multilingual file classification method, be able to solve languages problem.

A kind of multilingual file classification method merging subject information and BiLSTM-CNN, comprising the following steps:

1) Parallel Corpus is constructed towards the multilingual parallel corpora of English in collecting；

2) languages text each in corpus is pre-processed；

3) term vector of each languages of word embedded technology training is utilized；

4) each languages text subject vector is extracted using topic model；

5) it builds suitable for multilingual neural network model, and merges subject information, carry out multilingual text representation.

Preferably, when constructing multilingual Parallel Corpus in the step 1), towards 13 classes of three kinds of languages of English in collection Other scientific and technical literature abstract, the multilingual Parallel Corpus of content construction alignment；

Preferably, when handling each languages text in the step 2), detailed process is as follows:

S1: for Chinese corpus, building includes the scieintific and technical dictionary of biology, medicine, physics profession term, as The preference of participle is added in dictionary for word segmentation, optimizes Chinese word segmentation effect；

S2: extracting the stem of English word to English corpus, i.e., English word is reduced into its stem indicates；

S3: to needing to remove termination suffix and conjunction towards literary corpus, pronouns, general term for nouns, numerals and measure words and predicate are left；

Preferably, the term vector that each languages are trained in the step 3), is obtained using the CBOW model training of Word2vec The term vector that dimension is 220；

Preferably, the method that theme vector uses latent semantic analysis is extracted in the step 4), respectively to different language Text Feature Extraction its theme vector；

Preferably, it is built in the step 5) suitable for multilingual neural network model, the neural network model It is divided into three submodels, Chinese neural network model, English neural network model and Chao Wen neural network model, each submodel Neural network structure having the same, while the text of training different language obtains different model parameters, three submodels exist Finally cascade obtains complete neural network model, realizes multilingual text classification；

Preferably, in the step 5) Artificial Neural Network Structures of fusion subject information be divided into input layer, BiLSTM layers, CNN layers, full articulamentum and output layer.

The utility model has the advantages that

1) present invention efficiently solves the problems, such as languages, does not need by external resource, and each languages train alone the mind of oneself Through network model, each languages semantic information is accurately utilized, obtaining validity feature indicates, has certain versatility.

2) present invention indicates text using the feature that model group cooperation is text, obtains text time and two, space The text information of dimension, more accurately expresses text semantic.

3) present invention makes full use of subject information, is extracted the theme vector of each languages, in conjunction with text subject information and language Adopted information improves the accuracy of text modeling.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1: overall flow block diagram of the present invention；

Fig. 2: Text Pretreatment flow chart of the present invention；

Fig. 3: neural network model figure of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

The environment configurations of this example are as follows: Windows makees system, and CPU frequency 3.30GHZ inside saves as 16GB, programming language For Python, deep learning frame is Tensorflow, is completed at Integrated Development Environment PyCharm.

As shown in Figure 1, the specific implementation step of this algorithm are as follows:

Step 1: constructing Parallel Corpus towards the multilingual parallel corpora of English in collecting first；

Step 2: languages text each in corpus is pre-processed；

Step 3: utilizing the term vector of each languages of word embedded technology training；

Step 4: extracting each languages text subject vector using topic model；

Step 5: building suitable for multilingual neural network model, and merge subject information, carry out multilingual text table Show.

In above-mentioned steps 1, China and Britain are compiled towards scientific and technical literature summary texts, 32688 texts of every kind of languages are total 98064 texts are divided into 13 classifications, construct multilingual Parallel Corpus.

In above-mentioned steps 2, the text being collected into is pre-processed, due to the text information comprising three kinds of languages, so Point languages are pre-processed, and specific steps are as shown in Figure 2:

Step 2.1, for Chinese corpus, stop words is removed, is segmented, building includes the professional art such as biology, medicine, physics The scieintific and technical dictionary of language is added in dictionary for word segmentation as the preference of participle, optimizes Chinese word segmentation effect.

Step 2.2, to English corpus, by capitalization lower, and the stem of English word is extracted, i.e., will English word is reduced into the expression of its stem.

Step 2.3, it needs to remove termination suffix and conjunction etc. to towards literary corpus, leaves pronouns, general term for nouns, numerals and measure words and predicate.

By above step, the multilingual text in corpus is pre-processed, building experiment text set.

In step 3, each languages train alone its term vector, using the CBOW model of Word2vec, and have ignored in text Vocabulary of the word frequency less than 10.CBOW model is to predict centre word by the word of front and back, is three layers of processing model, is respectively as follows:

Input layer: V_u, wherein

Projection layer:

Output layer corresponds to a Huffman tree, and leaf node is the word in sample, and n omicronn-leaf child node is virtual nodes, and It is non-real to be assigned with space.

Its learning objective is to maximize log-likelihood function:

Wherein, w indicates any one word in corpus c, and when objective function obtains maximum value, corresponding term vector is just It is very good.

In step 4, using latent semantic analysis method, each languages extract alone its text subject vector, and steps are as follows:

Collection of document is analyzed, lexical item-document matrix is established；

Singular value decomposition is carried out to lexical item-document matrix；

To the matrix after singular value decomposition, its document-theme matrix is extracted, as the theme vector of each text.

In step 5, the neural network model for incorporating subject information is built, model is as shown in Figure 3.

By Fig. 3, it can be seen that, which is divided into three submodels, i.e., Chinese neural network model, English neural network mould Type and Chao Wen neural network model, each submodel neural network structure having the same, while the text of training different language Different model parameters is obtained, three submodels obtain complete neural network model, it can be achieved that multilingual text in last cascade This classification.

Artificial Neural Network Structures shown in Fig. 3 are divided into input layer, BiLSTM layers, CNN layers, full articulamentum, output layer etc.. Every layer of concrete meaning is as follows:

Input layer is spliced to form by term vector and theme vector:

Wherein w represents the term vector obtained by Word2vec training, and dimension is 220 dimensions, and θ representative is mentioned by latent semantic analysis The theme vector taken is equal with term vector dimension；

BiLSTM layers are two-way length memory networks in short-term, include two LSTM: forward directionWith it is backward BiLSTM layers of output is willOutputWithOutputCascade obtains the output O of t moment_t, it may be assumed that

The number of hidden layer neuron is set as 150 in BiLSTM, and BiLSTM layers of effect is the word order that body takes text Information.

CNN layers are made of convolutional layer, normalization layer, active coating and pond layer.

The size of the convolution kernel of convolutional layer is 3,4,5, and convolution kernel number is 128.

It normalizes layer and uses Batch-Normalization, calculating process is as follows:

Layer functions are activated to select frelu, formula are as follows:

The pond stage using maximum pondization strategy, reduces the error that convolutional layer parameter error causes estimation mean shift, more More reservation local messages.

The result cascade that three kinds of languages are obtained after Processing with Neural Network, inputs to softmax function, carries out classification Prediction.

Parallel Corpus is divided into training set and test set by the method for ten folding cross validations, carries out experimental verification.

Cause Dropout machine in full articulamentum to prevent over-fitting with training set training neural network by above-mentioned steps System, ignores some neurons with certain probability, and Dropout value is 0.5 in this experiment, while introducing L2 regularization mechanism, Its principle are as follows:

c₀Original loss function is represented,L2 regularization term, be by the quadratic sum of all parameter w, divided by The size n of training set is obtained.λ is exactly regularization coefficient.

Other parameter settings are as follows: batch-size takes 128, epoch to take 200, learning rate le-3.

The performance of the inventive method is verified with test set, evaluation index selects accuracy rate and cross entropy.

Accuracy rate, is defined as: for given test data set, the sample number and total number of samples that classifier is correctly classified The ratio between.

Cross entropy embodies the probability distribution and authentic specimen of model output as the common valuation functions of deep learning The similarity degree of probability distribution.Is defined as:

Wherein, y indicates authentic specimen value, and p indicates the class probability obtained through model prediction.

Embodiment one

The languages text set in multilingual text classification corpus that one embodiment selects step 1 to establish carries out The validity of experimental verification submodel.In the parameter setting of this example, embedding-size takes 220 dimensions, hidden layer neuron Number equally takes 150 dimensions, and theme number is set as 220, and batch-size takes 64 etc..The model of comparison is TextCNN: by one A convolutional layer, active coating, pond layer and full articulamentum are constituted, and demonstrate the text classification accuracy that submodel can improve.

Embodiment two

The present embodiment is basically the same as the first embodiment, and difference is:

The multilingual text corpus that the present embodiment selection step 1 is established, carries out multilingual text classification.Model is expanded It opens up to three kinds of languages, while each languages text of training, is cascaded in last neural net layer, this method can be accurately to multi-lingual Kind text is classified.

To sum up, this patent method can realize multilingual text classification, the multilingual neural network that this method training obtains Classify to also single languages, while solving language kind obstacle, improves the accuracy rate of multilingual text classification, and have Scalability.

The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although with reference to the foregoing embodiments Invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each implementation Technical solution documented by example is modified or equivalent replacement of some of the technical features；And these modification or Replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of multilingual file classification method for merging subject information and BiLSTM-CNN, which is characterized in that including following step It is rapid:

2) languages text each in corpus is pre-processed；

4) each languages text subject vector is extracted using topic model；

2. the multilingual file classification method of fusion subject information and BiLSTM-CNN according to claim 1, feature It is: when constructing multilingual Parallel Corpus in the step 1), towards the science and technology text of 13 classifications of three kinds of languages of English in collection Offer abstract, the multilingual Parallel Corpus of content construction alignment.

3. the multilingual file classification method of fusion subject information and BiLSTM-CNN according to claim 1, feature Be: when handling each languages text in the step 2), detailed process is as follows,

S1: for Chinese corpus, building includes the scieintific and technical dictionary of biology, medicine, physics profession term, as participle Preference be added in dictionary for word segmentation, optimize Chinese word segmentation effect；

S3: to needing to remove termination suffix and conjunction towards literary corpus, pronouns, general term for nouns, numerals and measure words and predicate are left.

4. the multilingual file classification method of fusion subject information and BiLSTM-CNN according to claim 1, feature Be: the term vector of each languages of training in the step 3) uses the CBOW model training of Word2vec to obtain dimension as 220 Term vector.

5. the multilingual file classification method of fusion subject information and BiLSTM-CNN according to claim 1, feature It is: the method that theme vector uses latent semantic analysis is extracted in the step 4), respectively to the Text Feature Extraction of different language Its theme vector.

6. the multilingual file classification method of fusion subject information and BiLSTM-CNN according to claim 1, feature It is: is built in the step 5) suitable for multilingual neural network model, the neural network model is divided into three sons Model, Chinese neural network model, English neural network model and Chao Wen neural network model, each submodel are having the same Neural network structure, while the text of training different language obtains different model parameters, three submodels are cascaded finally To complete neural network model, multilingual text classification is realized.

7. the multilingual file classification method of fusion subject information and BiLSTM-CNN according to claim 1, feature Be: the Artificial Neural Network Structures of fusion subject information are divided into input layer, BiLSTM layers, CNN layers, Quan Lian in the step 5) Connect layer and output layer.