CN113761885A - Bayesian LSTM-based language identification method - Google Patents
Bayesian LSTM-based language identification method Download PDFInfo
- Publication number
- CN113761885A CN113761885A CN202110283749.1A CN202110283749A CN113761885A CN 113761885 A CN113761885 A CN 113761885A CN 202110283749 A CN202110283749 A CN 202110283749A CN 113761885 A CN113761885 A CN 113761885A
- Authority
- CN
- China
- Prior art keywords
- lstm
- bayesian
- word vector
- input
- language identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 35
- 238000009826 distribution Methods 0.000 claims abstract description 15
- 238000005457 optimization Methods 0.000 claims abstract description 8
- 238000005070 sampling Methods 0.000 claims abstract description 5
- 239000000463 material Substances 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 2
- 238000012549 training Methods 0.000 description 11
- 230000015654 memory Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013531 bayesian neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a Bayesian LSTM-based language identification method, which comprises the following steps: s1, constructing a word vector model; s2, inputting the word vector into the LSTM as input; s3, sampling the weight through probability density distribution, and optimizing distribution parameters; s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier; and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4. Has the advantages that: the method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.
Description
Technical Field
The invention relates to the field of language identification, in particular to a Bayesian LSTM-based language identification method.
Background
Text language recognition is considered a text classification task based on some special feature. At present, methods based on N-gram models and methods based on deep learning are mainly adopted. Py of the existing full-supervised classifier is based on a polynomial Bayes classification method to realize a language identification model insensitive to scenes, and the most probable language probability value in a group of candidate languages is judged in a probability calculation mode. 97 language scenes can be identified, the characteristic N-gram items of mutual information are extracted by the characteristics, the method based on the N-gram model is suitable for long texts, and the identification accuracy rate is higher the longer the test document is. The method has a relatively limited recognition to short texts, and has great difficulty in recognition especially for paying attention to Chinese simplified bodies, Chinese normal bodies, Chinese traditional bodies and the like.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
The invention aims to provide a Bayesian LSTM-based language identification method to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a Bayesian LSTM-based language identification method comprises the following steps:
s1, constructing a word vector model;
s2, inputting the word vector into the LSTM as input;
s3, sampling the weight through probability density distribution, and optimizing distribution parameters;
s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier;
and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4.
Further, the step S1 of constructing the word vector model includes the following steps:
s11, preprocessing the collected language material files of the languages to form a language material base;
s12, representing each sentence as a word vector and a word vector by adopting a token generator for each language;
s13, converting the input words into vectors, and then disassembling each character in the words;
and S14, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.
Further, the step S2 of inputting the word vector into the LSTM includes the following steps:
s21, the word vector of the first step is used as input, and information among words in the sentence is well reserved;
s22, the update and the retention of LSTM network information are realized by an input gate, a forgetting gate, an output gate and a unit.
Further, the input gate determines how much input of the network is saved to the unit state at the current moment;
the forgetting gate determines how much the unit state at the previous moment is reserved to the current moment;
how much of the output gate control unit state is output to the current output value of the LSTM.
Compared with the prior art, the invention has the following beneficial effects: according to the method, a language corpus is constructed according to web crawler data, and training set data is obtained by processing character strings of different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by adopting the probability density distribution of Bayesian; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a bayesian lstm based language identification method according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description:
referring to fig. 1, a language identification method based on bayesian lstm according to an embodiment of the present invention includes the following steps:
step S1: constructing word vector models
Preprocessing the collected language material files of the languages to form a language database, and expressing each sentence as a word vector and a word vector by adopting a token generator for each language. Converting the input word into vectors, then decomposing each character in the word, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.
Step S2: the word vector is input into the LSTM as input.
And the word vector of the first step is taken as input, so that the information among words in the sentence is well kept. The updating and retaining of the LSTM network information are realized by an input gate it, a forgetting gate ft, an output gate ot and a cell unit ct.
The input gate (input gate) determines how many input xt of the network are saved to the cell state ct at the current time, so that the current irrelevant content can be prevented from entering the memory.
it=σ(WiXt+Uiht-1+bi)
The forgetting gate (forget gate) determines how much the cell state ct-1 at the previous time remains to the current time ct, and can hold information long before.
ft=σ(WfXt+Ufht-1+bf)
How much output gate (output gate) control unit state ct is output to the current output value ht of LSTM can control the effect of long-term memory on the current output.
ot=σ(WoXt+Uoht-1+bo)
The updated information at the current time is represented by ct:
ct=ft×ct-1+it×gt
wherein: gt=tanh(WgXt+Ught-1+bg)
The final output information is:
ct=ot×tanh(ct)
wherein W, U represents weight coefficient of the neural network, b represents bias, xt represents input word vector, ht-1 is output result of hidden layer at last moment of LSTM layer, ct-1 represents history information at last moment, gt represents information of current unit under candidate state, and σ and tanh represent activation function.
Step S3: because LSTM can not learn the importance degree of different words relative to sentences well, the LSTM method based on Bayesian optimization is provided by combining the core thought of Bayesian neural network. The weights are sampled by probability density distribution to optimize the distribution parameters, rather than setting a fixed weight. At the ith time, the sampling of the weights at the model nth layer is represented as:
at the ith time, the sampling of the bias b on the model nth layer is represented as:
where p, u are trainable parameters representing different weight distributions. N (0,1) represents the normal state distribution.
Step S4: the feature vector vt subjected to Bayesian optimization is subjected to prediction classification by a Softmax classifier with simple calculation and remarkable effect:
y=Soft max(Wvvt+bv)
wherein Wv, bV denote optimized weights and biases.
Step S5: and finally obtaining the classification category label of the text according to the predicted classification probability.
According to the scheme, a language corpus is constructed according to web crawler data, and training set data are obtained after character string processing is carried out on different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by adopting Bayesian probability density distribution; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.
For the convenience of understanding the technical solutions of the present invention, the following detailed description will be made on the working principle or the operation mode of the present invention in the practical process.
In practical application, the Bayesian LSTM method is a quick and effective method for picture generation and language modeling. Wherein the bayesian neural network samples the weights by probability density distribution and then optimizes the distribution parameters. By utilizing the method, the confidence and the uncertainty of the training data and the prediction result can be measured, and the weight dependency of the words relative to the sentences can be enhanced. While LSTM aims to solve the problem of information loss that occurs when processing long sequence data using a standard Recurrent Neural Network (RNN). Since the corpus file to be processed by the text is a kind of serialized data, the Recurrent Neural Network (RNN) can better handle the sequence problem, but when the training text length is too long, the problem that the gradient disappears easily occurs. As a specific form of RNN, a Long Short-Term Memory network (LSTM) can effectively solve the problem of Long-distance dependence which cannot be handled by the RNN.
For the language identification tasks in the Chinese subdivision field, including Chinese simplified, Chinese normal, Chinese traditional and the like, different Chinese forms need to be identified, such as 'China' and 'Central Impulse', a Bayesian LSTM-based language identification method is provided, and the Bayesian LSTM method is combined to be applied to the Chinese language subdivision field so as to distinguish the text language types of the simplified, the traditional and the Guangdong languages.
According to the method, a language corpus is constructed according to web crawler data, and training set data is obtained by processing character strings of different language texts; constructing a language identification method of an LSTM model based on Bayesian optimization, learning the dependency relationship among words by using a long-short memory network (LSTM), and optimizing the weight parameters of the network by using the probability density distribution of Bayesian; then, carrying out time sequence iterative training on the training data, and updating model parameters; and (5) constructing a language identification system for prediction. The method improves the robustness of the model and the accuracy of language identification by estimating the uncertainty of the model parameters.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (4)
1. A Bayesian LSTM-based language identification method is characterized by comprising the following steps:
s1, constructing a word vector model;
s2, inputting the word vector into the LSTM as input;
s3, sampling the weight through probability density distribution, and optimizing distribution parameters;
s4, carrying out prediction classification on the feature vectors subjected to Bayesian optimization through a Softmax classifier;
and S5, finally obtaining the classification category label of the text according to the predicted classification probability in the step S4.
2. The bayesian lstm-based language identification method according to claim 1, wherein the step S1 of constructing the word vector model comprises the following steps:
s11, preprocessing the collected language material files of the languages to form a language material base;
s12, representing each sentence as a word vector and a word vector by adopting a token generator for each language;
s13, converting the input words into vectors, and then disassembling each character in the words;
and S14, converting all characters contained in the word into vectors by using an LSTM model, and splicing the vectors converted by the word and the characters.
3. The bayesian LSTM based language identification method according to claim 1, wherein said step S2 of inputting the word vector into LSTM includes the following steps:
s21, the word vector of the first step is used as input, and information among words in the sentence is well reserved;
s22, the update and the retention of LSTM network information are realized by an input gate, a forgetting gate, an output gate and a unit.
4. A bayesian lstm based language identification method according to claim 3, wherein said input gate determines how much of the network input at the current time is saved to the cell state;
the forgetting gate determines how much the unit state at the previous moment is reserved to the current moment;
how much of the output gate control unit state is output to the current output value of the LSTM.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110283749.1A CN113761885A (en) | 2021-03-17 | 2021-03-17 | Bayesian LSTM-based language identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110283749.1A CN113761885A (en) | 2021-03-17 | 2021-03-17 | Bayesian LSTM-based language identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113761885A true CN113761885A (en) | 2021-12-07 |
Family
ID=78786735
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110283749.1A Pending CN113761885A (en) | 2021-03-17 | 2021-03-17 | Bayesian LSTM-based language identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113761885A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115535482A (en) * | 2022-11-23 | 2022-12-30 | 克拉玛依市科林恩能源科技有限责任公司 | Crude oil storage tank sealing method and system |
CN116702801A (en) * | 2023-08-07 | 2023-09-05 | 深圳市微星智造科技有限公司 | Translation method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060167784A1 (en) * | 2004-09-10 | 2006-07-27 | Hoffberg Steven M | Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference |
US20180189269A1 (en) * | 2016-12-30 | 2018-07-05 | Microsoft Technology Licensing, Llc | Graph long short term memory for syntactic relationship discovery |
CN108335693A (en) * | 2017-01-17 | 2018-07-27 | 腾讯科技(深圳)有限公司 | A kind of Language Identification and languages identification equipment |
CN108829818A (en) * | 2018-06-12 | 2018-11-16 | 中国科学院计算技术研究所 | A kind of file classification method |
US20200218857A1 (en) * | 2017-07-26 | 2020-07-09 | Siuvo Inc. | Semantic Classification of Numerical Data in Natural Language Context Based on Machine Learning |
CN111581943A (en) * | 2020-04-02 | 2020-08-25 | 昆明理工大学 | Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph |
-
2021
- 2021-03-17 CN CN202110283749.1A patent/CN113761885A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060167784A1 (en) * | 2004-09-10 | 2006-07-27 | Hoffberg Steven M | Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference |
US20180189269A1 (en) * | 2016-12-30 | 2018-07-05 | Microsoft Technology Licensing, Llc | Graph long short term memory for syntactic relationship discovery |
CN108335693A (en) * | 2017-01-17 | 2018-07-27 | 腾讯科技(深圳)有限公司 | A kind of Language Identification and languages identification equipment |
US20200218857A1 (en) * | 2017-07-26 | 2020-07-09 | Siuvo Inc. | Semantic Classification of Numerical Data in Natural Language Context Based on Machine Learning |
CN108829818A (en) * | 2018-06-12 | 2018-11-16 | 中国科学院计算技术研究所 | A kind of file classification method |
CN111581943A (en) * | 2020-04-02 | 2020-08-25 | 昆明理工大学 | Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph |
Non-Patent Citations (5)
Title |
---|
LI YANG等: "Naive Bayes and BiLSTM Ensemble for Discriminating between", 《PROCEEDINGS OF THE SIXTH WORKSHOP ON NLP FOR SIMILAR LANGUAGES, VARIETIES AND DIALECTS》 * |
MADAN GOPAL JHANWAR等: "An Ensemble Model for Sentiment Analysis of Hindi-English Code-Mixed Data", 《ARXIV》 * |
侯丽仙等: "面向任务口语理解研究现状综述", 《计算机工程与应用》 * |
张琳琳等: "基于深度学习的相似语言短文本的语种识别方法", 《计算机应用与软件》 * |
沙尔旦尔·帕尔哈提等: "基于稳健词素序列和LSTM的维吾尔语短文本分类", 《中文信息学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115535482A (en) * | 2022-11-23 | 2022-12-30 | 克拉玛依市科林恩能源科技有限责任公司 | Crude oil storage tank sealing method and system |
CN116702801A (en) * | 2023-08-07 | 2023-09-05 | 深圳市微星智造科技有限公司 | Translation method, device, equipment and storage medium |
CN116702801B (en) * | 2023-08-07 | 2024-04-05 | 深圳市微星智造科技有限公司 | Translation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN109871451B (en) | Method and system for extracting relation of dynamic word vectors | |
CN109902293B (en) | Text classification method based on local and global mutual attention mechanism | |
CN108984745B (en) | Neural network text classification method fusing multiple knowledge maps | |
CN109635109B (en) | Sentence classification method based on LSTM and combined with part-of-speech and multi-attention mechanism | |
CN108628823B (en) | Named entity recognition method combining attention mechanism and multi-task collaborative training | |
CN109858041B (en) | Named entity recognition method combining semi-supervised learning with user-defined dictionary | |
CN111209401A (en) | System and method for classifying and processing sentiment polarity of online public opinion text information | |
CN112270379A (en) | Training method of classification model, sample classification method, device and equipment | |
CN109214006B (en) | Natural language reasoning method for image enhanced hierarchical semantic representation | |
CN111104509B (en) | Entity relationship classification method based on probability distribution self-adaption | |
CN111966812B (en) | Automatic question answering method based on dynamic word vector and storage medium | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN110580287A (en) | Emotion classification method based ON transfer learning and ON-LSTM | |
CN111125367A (en) | Multi-character relation extraction method based on multi-level attention mechanism | |
CN112163089B (en) | High-technology text classification method and system integrating named entity recognition | |
CN111984791B (en) | Attention mechanism-based long text classification method | |
CN113761885A (en) | Bayesian LSTM-based language identification method | |
CN114298287A (en) | Knowledge distillation-based prediction method and device, electronic equipment and storage medium | |
CN113626589A (en) | Multi-label text classification method based on mixed attention mechanism | |
CN112699685A (en) | Named entity recognition method based on label-guided word fusion | |
CN115186670B (en) | Method and system for identifying domain named entities based on active learning | |
CN114386425B (en) | Big data system establishing method for processing natural language text content | |
CN113488196B (en) | Drug specification text named entity recognition modeling method | |
CN114239584A (en) | Named entity identification method based on self-supervision learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211207 |
|
RJ01 | Rejection of invention patent application after publication |