CN113761885A

CN113761885A - Bayesian LSTM-based language identification method

Info

Publication number: CN113761885A
Application number: CN202110283749.1A
Authority: CN
Inventors: 周少龙; 陈欣洁; 余智华; 冯凯; 李建广
Original assignee: Golaxy Data Technology Co ltd
Current assignee: Golaxy Data Technology Co ltd
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-12-07

Abstract

The invention discloses a Bayesian LSTM-based language identification method, which comprises the following steps: s1, constructing a word vector model; s2, inputting the word vector into the LSTM as input; s3, sampling the weight through probability density distribution, and optimizing distribution parameters; s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier; and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4. Has the advantages that: the method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.

Description

Bayesian LSTM-based language identification method

Technical Field

The invention relates to the field of language identification, in particular to a Bayesian LSTM-based language identification method.

Background

Text language recognition is considered a text classification task based on some special feature. At present, methods based on N-gram models and methods based on deep learning are mainly adopted. Py of the existing full-supervised classifier is based on a polynomial Bayes classification method to realize a language identification model insensitive to scenes, and the most probable language probability value in a group of candidate languages is judged in a probability calculation mode. 97 language scenes can be identified, the characteristic N-gram items of mutual information are extracted by the characteristics, the method based on the N-gram model is suitable for long texts, and the identification accuracy rate is higher the longer the test document is. The method has a relatively limited recognition to short texts, and has great difficulty in recognition especially for paying attention to Chinese simplified bodies, Chinese normal bodies, Chinese traditional bodies and the like.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

The invention aims to provide a Bayesian LSTM-based language identification method to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a Bayesian LSTM-based language identification method comprises the following steps:

s1, constructing a word vector model;

s2, inputting the word vector into the LSTM as input;

s3, sampling the weight through probability density distribution, and optimizing distribution parameters;

s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier;

and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4.

Further, the step S1 of constructing the word vector model includes the following steps:

s11, preprocessing the collected language material files of the languages to form a language material base;

s12, representing each sentence as a word vector and a word vector by adopting a token generator for each language;

s13, converting the input words into vectors, and then disassembling each character in the words;

and S14, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.

Further, the step S2 of inputting the word vector into the LSTM includes the following steps:

s21, the word vector of the first step is used as input, and information among words in the sentence is well reserved;

s22, the update and the retention of LSTM network information are realized by an input gate, a forgetting gate, an output gate and a unit.

Further, the input gate determines how much input of the network is saved to the unit state at the current moment;

the forgetting gate determines how much the unit state at the previous moment is reserved to the current moment;

how much of the output gate control unit state is output to the current output value of the LSTM.

Compared with the prior art, the invention has the following beneficial effects: according to the method, a language corpus is constructed according to web crawler data, and training set data is obtained by processing character strings of different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by adopting the probability density distribution of Bayesian; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a bayesian lstm based language identification method according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description:

referring to fig. 1, a language identification method based on bayesian lstm according to an embodiment of the present invention includes the following steps:

step S1: constructing word vector models

Preprocessing the collected language material files of the languages to form a language database, and expressing each sentence as a word vector and a word vector by adopting a token generator for each language. Converting the input word into vectors, then decomposing each character in the word, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.

Step S2: the word vector is input into the LSTM as input.

And the word vector of the first step is taken as input, so that the information among words in the sentence is well kept. The updating and retaining of the LSTM network information are realized by an input gate it, a forgetting gate ft, an output gate ot and a cell unit ct.

The input gate (input gate) determines how many input xt of the network are saved to the cell state ct at the current time, so that the current irrelevant content can be prevented from entering the memory.

i_t＝σ(WⁱX_t+Uⁱh_t-1+bⁱ)

The forgetting gate (forget gate) determines how much the cell state ct-1 at the previous time remains to the current time ct, and can hold information long before.

f_t＝σ(W^fX_t+U^fh_t-1+b^f)

How much output gate (output gate) control unit state ct is output to the current output value ht of LSTM can control the effect of long-term memory on the current output.

o_t＝σ(W^oX_t+U^oh_t-1+b^o)

The updated information at the current time is represented by ct:

c_t＝f_t×c_t-1+i_t×g_t

wherein: g_t＝tanh(W^gX_t+U^gh_t-1+b^g)

The final output information is:

c_t＝o_t×tanh(c_t)

wherein W, U represents weight coefficient of the neural network, b represents bias, xt represents input word vector, ht-1 is output result of hidden layer at last moment of LSTM layer, ct-1 represents history information at last moment, gt represents information of current unit under candidate state, and σ and tanh represent activation function.

Step S3: because LSTM can not learn the importance degree of different words relative to sentences well, the LSTM method based on Bayesian optimization is provided by combining the core thought of Bayesian neural network. The weights are sampled by probability density distribution to optimize the distribution parameters, rather than setting a fixed weight. At the ith time, the sampling of the weights at the model nth layer is represented as:

at the ith time, the sampling of the bias b on the model nth layer is represented as:

where p, u are trainable parameters representing different weight distributions. N (0,1) represents the normal state distribution.

Step S4: the feature vector vt subjected to Bayesian optimization is subjected to prediction classification by a Softmax classifier with simple calculation and remarkable effect:

y＝Soft max(W_vv_t+b_v)

wherein Wv, bV denote optimized weights and biases.

Step S5: and finally obtaining the classification category label of the text according to the predicted classification probability.

According to the scheme, a language corpus is constructed according to web crawler data, and training set data are obtained after character string processing is carried out on different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by adopting Bayesian probability density distribution; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.

For the convenience of understanding the technical solutions of the present invention, the following detailed description will be made on the working principle or the operation mode of the present invention in the practical process.

In practical application, the Bayesian LSTM method is a quick and effective method for picture generation and language modeling. Wherein the bayesian neural network samples the weights by probability density distribution and then optimizes the distribution parameters. By utilizing the method, the confidence and the uncertainty of the training data and the prediction result can be measured, and the weight dependency of the words relative to the sentences can be enhanced. While LSTM aims to solve the problem of information loss that occurs when processing long sequence data using a standard Recurrent Neural Network (RNN). Since the corpus file to be processed by the text is a kind of serialized data, the Recurrent Neural Network (RNN) can better handle the sequence problem, but when the training text length is too long, the problem that the gradient disappears easily occurs. As a specific form of RNN, a Long Short-Term Memory network (LSTM) can effectively solve the problem of Long-distance dependence which cannot be handled by the RNN.

For the language identification tasks in the Chinese subdivision field, including Chinese simplified, Chinese normal, Chinese traditional and the like, different Chinese forms need to be identified, such as 'China' and 'Central Impulse', a Bayesian LSTM-based language identification method is provided, and the Bayesian LSTM method is combined to be applied to the Chinese language subdivision field so as to distinguish the text language types of the simplified, the traditional and the Guangdong languages.

According to the method, a language corpus is constructed according to web crawler data, and training set data is obtained by processing character strings of different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by using the probability density distribution of Bayesian; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A Bayesian LSTM-based language identification method is characterized by comprising the following steps:

s1, constructing a word vector model;

s2, inputting the word vector into the LSTM as input;

2. The bayesian lstm-based language identification method according to claim 1, wherein the step S1 of constructing the word vector model comprises the following steps:

3. The bayesian LSTM based language identification method according to claim 1, wherein said step S2 of inputting the word vector into LSTM includes the following steps:

4. A bayesian lstm based language identification method according to claim 3, wherein said input gate determines how much of the network input at the current time is saved to the cell state;