CN113761885A - Bayesian LSTM-based language identification method - Google Patents

Bayesian LSTM-based language identification method Download PDF

Info

Publication number
CN113761885A
CN113761885A CN202110283749.1A CN202110283749A CN113761885A CN 113761885 A CN113761885 A CN 113761885A CN 202110283749 A CN202110283749 A CN 202110283749A CN 113761885 A CN113761885 A CN 113761885A
Authority
CN
China
Prior art keywords
lstm
bayesian
word vector
input
language identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110283749.1A
Other languages
Chinese (zh)
Inventor
周少龙
陈欣洁
余智华
冯凯
李建广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Golaxy Data Technology Co ltd
Original Assignee
Golaxy Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Golaxy Data Technology Co ltd filed Critical Golaxy Data Technology Co ltd
Priority to CN202110283749.1A priority Critical patent/CN113761885A/en
Publication of CN113761885A publication Critical patent/CN113761885A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a Bayesian LSTM-based language identification method, which comprises the following steps: s1, constructing a word vector model; s2, inputting the word vector into the LSTM as input; s3, sampling the weight through probability density distribution, and optimizing distribution parameters; s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier; and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4. Has the advantages that: the method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.

Description

Bayesian LSTM-based language identification method
Technical Field
The invention relates to the field of language identification, in particular to a Bayesian LSTM-based language identification method.
Background
Text language recognition is considered a text classification task based on some special feature. At present, methods based on N-gram models and methods based on deep learning are mainly adopted. Py of the existing full-supervised classifier is based on a polynomial Bayes classification method to realize a language identification model insensitive to scenes, and the most probable language probability value in a group of candidate languages is judged in a probability calculation mode. 97 language scenes can be identified, the characteristic N-gram items of mutual information are extracted by the characteristics, the method based on the N-gram model is suitable for long texts, and the identification accuracy rate is higher the longer the test document is. The method has a relatively limited recognition to short texts, and has great difficulty in recognition especially for paying attention to Chinese simplified bodies, Chinese normal bodies, Chinese traditional bodies and the like.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
The invention aims to provide a Bayesian LSTM-based language identification method to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a Bayesian LSTM-based language identification method comprises the following steps:
s1, constructing a word vector model;
s2, inputting the word vector into the LSTM as input;
s3, sampling the weight through probability density distribution, and optimizing distribution parameters;
s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier;
and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4.
Further, the step S1 of constructing the word vector model includes the following steps:
s11, preprocessing the collected language material files of the languages to form a language material base;
s12, representing each sentence as a word vector and a word vector by adopting a token generator for each language;
s13, converting the input words into vectors, and then disassembling each character in the words;
and S14, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.
Further, the step S2 of inputting the word vector into the LSTM includes the following steps:
s21, the word vector of the first step is used as input, and information among words in the sentence is well reserved;
s22, the update and the retention of LSTM network information are realized by an input gate, a forgetting gate, an output gate and a unit.
Further, the input gate determines how much input of the network is saved to the unit state at the current moment;
the forgetting gate determines how much the unit state at the previous moment is reserved to the current moment;
how much of the output gate control unit state is output to the current output value of the LSTM.
Compared with the prior art, the invention has the following beneficial effects: according to the method, a language corpus is constructed according to web crawler data, and training set data is obtained by processing character strings of different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by adopting the probability density distribution of Bayesian; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a bayesian lstm based language identification method according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description:
referring to fig. 1, a language identification method based on bayesian lstm according to an embodiment of the present invention includes the following steps:
step S1: constructing word vector models
Preprocessing the collected language material files of the languages to form a language database, and expressing each sentence as a word vector and a word vector by adopting a token generator for each language. Converting the input word into vectors, then decomposing each character in the word, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.
Step S2: the word vector is input into the LSTM as input.
And the word vector of the first step is taken as input, so that the information among words in the sentence is well kept. The updating and retaining of the LSTM network information are realized by an input gate it, a forgetting gate ft, an output gate ot and a cell unit ct.
The input gate (input gate) determines how many input xt of the network are saved to the cell state ct at the current time, so that the current irrelevant content can be prevented from entering the memory.
it=σ(WiXt+Uiht-1+bi)
The forgetting gate (forget gate) determines how much the cell state ct-1 at the previous time remains to the current time ct, and can hold information long before.
ft=σ(WfXt+Ufht-1+bf)
How much output gate (output gate) control unit state ct is output to the current output value ht of LSTM can control the effect of long-term memory on the current output.
ot=σ(WoXt+Uoht-1+bo)
The updated information at the current time is represented by ct:
ct=ft×ct-1+it×gt
wherein: gt=tanh(WgXt+Ught-1+bg)
The final output information is:
ct=ot×tanh(ct)
wherein W, U represents weight coefficient of the neural network, b represents bias, xt represents input word vector, ht-1 is output result of hidden layer at last moment of LSTM layer, ct-1 represents history information at last moment, gt represents information of current unit under candidate state, and σ and tanh represent activation function.
Step S3: because LSTM can not learn the importance degree of different words relative to sentences well, the LSTM method based on Bayesian optimization is provided by combining the core thought of Bayesian neural network. The weights are sampled by probability density distribution to optimize the distribution parameters, rather than setting a fixed weight. At the ith time, the sampling of the weights at the model nth layer is represented as:
Figure BDA0002979569090000041
at the ith time, the sampling of the bias b on the model nth layer is represented as:
Figure BDA0002979569090000042
where p, u are trainable parameters representing different weight distributions. N (0,1) represents the normal state distribution.
Step S4: the feature vector vt subjected to Bayesian optimization is subjected to prediction classification by a Softmax classifier with simple calculation and remarkable effect:
y=Soft max(Wvvt+bv)
wherein Wv, bV denote optimized weights and biases.
Step S5: and finally obtaining the classification category label of the text according to the predicted classification probability.
According to the scheme, a language corpus is constructed according to web crawler data, and training set data are obtained after character string processing is carried out on different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by adopting Bayesian probability density distribution; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.
For the convenience of understanding the technical solutions of the present invention, the following detailed description will be made on the working principle or the operation mode of the present invention in the practical process.
In practical application, the Bayesian LSTM method is a quick and effective method for picture generation and language modeling. Wherein the bayesian neural network samples the weights by probability density distribution and then optimizes the distribution parameters. By utilizing the method, the confidence and the uncertainty of the training data and the prediction result can be measured, and the weight dependency of the words relative to the sentences can be enhanced. While LSTM aims to solve the problem of information loss that occurs when processing long sequence data using a standard Recurrent Neural Network (RNN). Since the corpus file to be processed by the text is a kind of serialized data, the Recurrent Neural Network (RNN) can better handle the sequence problem, but when the training text length is too long, the problem that the gradient disappears easily occurs. As a specific form of RNN, a Long Short-Term Memory network (LSTM) can effectively solve the problem of Long-distance dependence which cannot be handled by the RNN.
For the language identification tasks in the Chinese subdivision field, including Chinese simplified, Chinese normal, Chinese traditional and the like, different Chinese forms need to be identified, such as 'China' and 'Central Impulse', a Bayesian LSTM-based language identification method is provided, and the Bayesian LSTM method is combined to be applied to the Chinese language subdivision field so as to distinguish the text language types of the simplified, the traditional and the Guangdong languages.
According to the method, a language corpus is constructed according to web crawler data, and training set data is obtained by processing character strings of different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by using the probability density distribution of Bayesian; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (4)

1. A Bayesian LSTM-based language identification method is characterized by comprising the following steps:
s1, constructing a word vector model;
s2, inputting the word vector into the LSTM as input;
s3, sampling the weight through probability density distribution, and optimizing distribution parameters;
s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier;
and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4.
2. The bayesian lstm-based language identification method according to claim 1, wherein the step S1 of constructing the word vector model comprises the following steps:
s11, preprocessing the collected language material files of the languages to form a language material base;
s12, representing each sentence as a word vector and a word vector by adopting a token generator for each language;
s13, converting the input words into vectors, and then disassembling each character in the words;
and S14, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.
3. The bayesian LSTM based language identification method according to claim 1, wherein said step S2 of inputting the word vector into LSTM includes the following steps:
s21, the word vector of the first step is used as input, and information among words in the sentence is well reserved;
s22, the update and the retention of LSTM network information are realized by an input gate, a forgetting gate, an output gate and a unit.
4. A bayesian lstm based language identification method according to claim 3, wherein said input gate determines how much of the network input at the current time is saved to the cell state;
the forgetting gate determines how much the unit state at the previous moment is reserved to the current moment;
how much of the output gate control unit state is output to the current output value of the LSTM.
CN202110283749.1A 2021-03-17 2021-03-17 Bayesian LSTM-based language identification method Pending CN113761885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110283749.1A CN113761885A (en) 2021-03-17 2021-03-17 Bayesian LSTM-based language identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110283749.1A CN113761885A (en) 2021-03-17 2021-03-17 Bayesian LSTM-based language identification method

Publications (1)

Publication Number Publication Date
CN113761885A true CN113761885A (en) 2021-12-07

Family

ID=78786735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110283749.1A Pending CN113761885A (en) 2021-03-17 2021-03-17 Bayesian LSTM-based language identification method

Country Status (1)

Country Link
CN (1) CN113761885A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115535482A (en) * 2022-11-23 2022-12-30 克拉玛依市科林恩能源科技有限责任公司 Crude oil storage tank sealing method and system
CN116702801A (en) * 2023-08-07 2023-09-05 深圳市微星智造科技有限公司 Translation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060167784A1 (en) * 2004-09-10 2006-07-27 Hoffberg Steven M Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference
US20180189269A1 (en) * 2016-12-30 2018-07-05 Microsoft Technology Licensing, Llc Graph long short term memory for syntactic relationship discovery
CN108335693A (en) * 2017-01-17 2018-07-27 腾讯科技(深圳)有限公司 A kind of Language Identification and languages identification equipment
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
US20200218857A1 (en) * 2017-07-26 2020-07-09 Siuvo Inc. Semantic Classification of Numerical Data in Natural Language Context Based on Machine Learning
CN111581943A (en) * 2020-04-02 2020-08-25 昆明理工大学 Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060167784A1 (en) * 2004-09-10 2006-07-27 Hoffberg Steven M Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference
US20180189269A1 (en) * 2016-12-30 2018-07-05 Microsoft Technology Licensing, Llc Graph long short term memory for syntactic relationship discovery
CN108335693A (en) * 2017-01-17 2018-07-27 腾讯科技(深圳)有限公司 A kind of Language Identification and languages identification equipment
US20200218857A1 (en) * 2017-07-26 2020-07-09 Siuvo Inc. Semantic Classification of Numerical Data in Natural Language Context Based on Machine Learning
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN111581943A (en) * 2020-04-02 2020-08-25 昆明理工大学 Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LI YANG等: "Naive Bayes and BiLSTM Ensemble for Discriminating between", 《PROCEEDINGS OF THE SIXTH WORKSHOP ON NLP FOR SIMILAR LANGUAGES, VARIETIES AND DIALECTS》 *
MADAN GOPAL JHANWAR等: "An Ensemble Model for Sentiment Analysis of Hindi-English Code-Mixed Data", 《ARXIV》 *
侯丽仙等: "面向任务口语理解研究现状综述", 《计算机工程与应用》 *
张琳琳等: "基于深度学习的相似语言短文本的语种识别方法", 《计算机应用与软件》 *
沙尔旦尔·帕尔哈提等: "基于稳健词素序列和LSTM的维吾尔语短文本分类", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115535482A (en) * 2022-11-23 2022-12-30 克拉玛依市科林恩能源科技有限责任公司 Crude oil storage tank sealing method and system
CN116702801A (en) * 2023-08-07 2023-09-05 深圳市微星智造科技有限公司 Translation method, device, equipment and storage medium
CN116702801B (en) * 2023-08-07 2024-04-05 深圳市微星智造科技有限公司 Translation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108984526B (en) Document theme vector extraction method based on deep learning
CN109871451B (en) Method and system for extracting relation of dynamic word vectors
CN109902293B (en) Text classification method based on local and global mutual attention mechanism
CN108984745B (en) Neural network text classification method fusing multiple knowledge maps
CN109635109B (en) Sentence classification method based on LSTM and combined with part-of-speech and multi-attention mechanism
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN109858041B (en) Named entity recognition method combining semi-supervised learning with user-defined dictionary
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN112270379A (en) Training method of classification model, sample classification method, device and equipment
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN111104509B (en) Entity relationship classification method based on probability distribution self-adaption
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN110580287A (en) Emotion classification method based ON transfer learning and ON-LSTM
CN111125367A (en) Multi-character relation extraction method based on multi-level attention mechanism
CN112163089B (en) High-technology text classification method and system integrating named entity recognition
CN111984791B (en) Attention mechanism-based long text classification method
CN113761885A (en) Bayesian LSTM-based language identification method
CN114298287A (en) Knowledge distillation-based prediction method and device, electronic equipment and storage medium
CN113626589A (en) Multi-label text classification method based on mixed attention mechanism
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN114386425B (en) Big data system establishing method for processing natural language text content
CN113488196B (en) Drug specification text named entity recognition modeling method
CN114239584A (en) Named entity identification method based on self-supervision learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211207

RJ01 Rejection of invention patent application after publication