CN113887249B - Mongolian neural machine translation method based on dependency syntax information and transducer model - Google Patents

Mongolian neural machine translation method based on dependency syntax information and transducer model Download PDF

Info

Publication number
CN113887249B
CN113887249B CN202111113538.XA CN202111113538A CN113887249B CN 113887249 B CN113887249 B CN 113887249B CN 202111113538 A CN202111113538 A CN 202111113538A CN 113887249 B CN113887249 B CN 113887249B
Authority
CN
China
Prior art keywords
mongolian
matrix
chinese
model
dependency syntax
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111113538.XA
Other languages
Chinese (zh)
Other versions
CN113887249A (en
Inventor
仁庆道尔吉
程坤
庞蕊
刘馨远
麻泽蕊
尹玉娟
吉亚图
苏依拉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN202111113538.XA priority Critical patent/CN113887249B/en
Publication of CN113887249A publication Critical patent/CN113887249A/en
Application granted granted Critical
Publication of CN113887249B publication Critical patent/CN113887249B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Mongolian neural machine translation method based on dependency syntax information and a transducer model, which comprises the following steps: extracting Chinese dependency syntax information according to the Mongolian parallel corpus, and converting the dependency syntax information into an adjacency matrix; one path is additionally added to the output end of the transducer model, one path of the transducer model is used for predicting Chinese target sentences, and the other path of the transducer model learns Chinese dependency syntax knowledge; the method comprises the steps of adding a module for learning Mongolian syntax information on an encoder side explicitly, reinforcing the learning of Mongolian syntax by utilizing Chinese syntax knowledge, and implicitly learning the syntax information of a source language and a target language by an original transducer model by default.

Description

Mongolian neural machine translation method based on dependency syntax information and transducer model
Technical Field
The invention belongs to the technical field of artificial intelligence and natural language processing, relates to end-to-end translation of Mongolian to Chinese, and in particular relates to a Mongolian neural machine translation method based on dependency syntax information and a transducer model.
Background
Different nations have different cultures, beliefs and customs, the enhancement of the communication among the nations is an important way for promoting the friendly way among the nations, mongolian siblings are members of the large families of the China, mongolian translation can effectively promote the communication between Mongolian cultures and Chinese cultures, the culture propagation of the technology and the like can be truly carried out only by carrying out the communication on the basis of no language obstruction, the excellent culture, the art, the knowledge, the custom, the beliefs and the like of the Mongolian can be propagated through Mongolian translation, more people can know the culture of the Mongolian, mongolian translation effectively promotes the culture communication between the Mongolian and the Chinese nations, and plays a promoting role in the culture development of the Mongolian and the comprehensive development of the Chinese nations.
The Mongolian machine translation has less results and weaker leading edge research than the machine translation of other languages, and research methods of the Mongolian machine translation are developed along with the innovation of world machine translation technology, and generally, the Mongolian machine translation is gradually changed from rules to statistics and from statistics to neural machine translation, in the process, researchers can fuse two or more methods to improve the translation effect of the Mongolian machine translation, at present, the machine translation is mainly implemented by the neural machine translation technology, and the application of the neural machine translation technology to the Mongolian translation still has a lot of improvements.
The Mongolian has a plurality of words with the same semantic meaning but different word shapes, which can lead the dictionary used in the neural machine translation not to contain enough Mongolian words, and can lead the translation model not to learn good word vectors, thereby influencing the performance of the neural machine translation.
Disclosure of Invention
(One) solving the technical problems
In order to overcome the defects of the prior art, the invention aims to provide a Mongolian neural machine translation method based on dependency syntax information and a transform model, wherein a Mongolian syntax learning module is explicitly added in an encoder of the transform, an adjacency matrix of a Chinese dependency syntax tree is added behind a decoder to guide the model to learn Mongolian syntax, and after training is finished, mongolian syntax information can be restored through the Mongolian learning module.
(II) technical scheme
A Mongolian machine translation method based on dependency syntax information and a transducer model includes the steps of firstly, carrying out dependency syntax analysis on Chinese language materials in a Mongolian parallel corpus training set by utilizing Stanford CoreNLP to obtain a dependency syntax tree, then converting the dependency syntax tree into an adjacent matrix, enabling the matrix to be symmetrical along a main diagonal line, namely a Chinese dependency syntax adjacent matrix, secondly, improving the transducer model, firstly, adding a unidirectional LSTM cyclic unit after the last decoder of the transducer model outputs, subtracting the output matrix of the unidirectional LSTM cyclic unit from the Chinese dependency syntax adjacent matrix according to elements to obtain a loss value L2, and adding the loss value L2 and a cross entropy loss value L1 originally output by the transducer model to be the total loss value of the model, secondly, improving an encoder of the transducer model, setting a matrix for learning the ancient syntax in the encoder, training the model by utilizing the training set, fine tuning the model by utilizing a verification set, and evaluating the model by utilizing the test set.
Preferably, the flow of extracting the Chinese dependency syntax adjacency matrix is as follows:
step 1: dividing the Mongolian parallel corpus data set into a training set, a verification set and a test set;
step 2: performing dependency syntax analysis on the Chinese corpus in the Mongolian parallel corpus training set by utilizing Stanford CoreNLP to obtain a Chinese dependency syntax tree;
Step 3: and (3) converting the Chinese dependency syntax tree obtained in the step (1) into an adjacency matrix for storage, and symmetrically arranging the adjacency matrix along a main diagonal line, namely the Chinese dependency syntax adjacency matrix.
Preferably, the process of improving the transducer model, adding one output and adding a matrix for learning Mongolian grammar on the encoder is as follows:
step 1: modifying the transducer model into double outputs, and adding a unidirectional LSTM circulation unit after the last decoder;
Step 2: the output matrix of the unidirectional LSTM circulation unit and the Chinese dependency syntax adjacent matrix are subtracted according to elements to obtain a loss value L2, and the loss value L2 and a cross entropy loss value L1 originally output by the transducer model are added to form a total loss value of the model;
step 3: a matrix for learning the mongolian grammar is set in the encoder of the transducer model.
Preferably, in the step 1, the LSTM cycle unit: the inputs of LSTM are the outputs of the decoder and the output matrix has the dimensions of [ batch_size×seq_len×seq_len ].
Preferably, in the step 2, the sum of the loss value L2 and the other path loss is obtained by subtracting the output matrix of the unidirectional LSTM cyclic unit and the corresponding chinese-dependent syntax adjacency matrix by elements, and is used as the total loss of the improved transducer model.
Preferably, the encoder in the step 3 adds a matrix for learning Mongolian grammar, and the transform encoder uses a self-attention mechanism, and the self-attention formula is:
To enhance the learning of the source grammar information by the model, the original formula is changed into:
The M matrix is used for learning Mongolian grammar, and after being guided by Chinese dependency syntax information, the Mongolian syntax information matrix can form different weights among Mongolian word segmentation, so that the connection among the word segmentation can be enhanced.
(III) beneficial effects
The invention provides a Mongolian nerve machine translation method based on dependency syntax information and a transducer model, which has the following effects: the invention explicitly utilizes the syntax information of the target language end to guide the model to learn the syntax information of the source language, so that translation interaction between Mongolian and Chinese is more convenient, chinese groups communicate among Mongolian groups more conveniently, friendly relation establishment between the two groups is promoted, the relation is further greatly, communication is more convenient, and the invention brings help to people.
Drawings
FIG. 1 is a flow chart of a translation method system of the present invention;
fig. 2 is a system diagram of a converter modified to dual outputs.
Detailed Description
FIG. 1 shows an embodiment of a Mongolian machine translation method based on dependency syntax information and a transform model, firstly, performing dependency syntax analysis on a training set divided by Mongolian parallel corpus by Stanford CoreNLP to obtain a dependency syntax tree, converting the dependency syntax tree into an adjacent matrix for storage, symmetrically forming the obtained matrix along a main diagonal line to obtain a Chinese dependency syntax adjacent matrix, secondly, adding a Mongolian grammar learning matrix into an encoder by the modified transform model, adding one LSTM circulation unit after the last decoder, subtracting the output matrix of the LSTM and the corresponding Chinese dependency syntax adjacent matrix as a loss value of the path, and taking the sum of the two paths of loss values as a total loss value of the model.
Specific:
1. the process of constructing the Chinese dependency adjacency matrix:
step 1: dividing the Mongolian parallel corpus data set into a training set, a verification set and a test set;
step 2: performing dependency syntax analysis on the Chinese corpus in the Mongolian parallel corpus training set by utilizing Stanford CoreNLP to obtain a Chinese dependency syntax tree;
Step 3: and (3) converting the Chinese dependency syntax tree obtained in the step (1) into an adjacency matrix for storage, and symmetrically arranging the adjacency matrix along a main diagonal line, namely the Chinese dependency syntax adjacency matrix.
Stanford CoreNLP when performing dependency syntactic analysis, firstly cutting words from sentences, then analyzing the relation between words, wherein each relation is a triplet, the first element of the triplet is the relation between two words, the second element is the number of the words, the second element is taken as a father node, the three elements are child nodes, and the three elements are expressed by an example:
The sentence is: i want to eat the fried shredded potatoes.
Stanford CoreNLP is the following words: 'I'm ',' want ',' eat ',' fry ',' shredded potato ','. ' the obtained dependency syntax analysis result is: [ ('ROOT', 0, 2), ('nsubj', 2, 1), ('ccomp', 2, 3), ('conj', 3, 4), ('dobj', 4, 5), ('punct', 2, 6) ].
Converting it into a adjacency matrix for storage, the stored rules are as follows: firstly, constructing a square matrix, wherein the length and width of the square matrix are the number of cut words, the other relations form coordinates by using a second three elements except a ROOT relation, a marker bit 1 is arranged at the corresponding coordinate position of the square matrix, the unlabeled position is complemented with 0, after traversing the complete part relation, the square matrix is symmetrical along the main diagonal of the square matrix, and the Chinese dependency syntax adjacent matrix of the sentence is as follows:
2. FIG. 2 is a block diagram of a transform modified to dual output, the process of modifying the transform model, adding one output and adding a matrix for learning Mongolian grammar to the encoder is:
step 1: modifying the transducer model into double outputs, and adding a unidirectional LSTM circulation unit after the last decoder;
Step 2: the output matrix of the unidirectional LSTM circulation unit and the Chinese dependency syntax adjacent matrix are subtracted according to elements to obtain a loss value L2, and the loss value L2 and a cross entropy loss value L1 originally output by the transducer model are added to form a total loss value of the model;
step 3: a matrix for learning the mongolian grammar is set in the encoder of the transducer model.
In step 1, LSTM loop unit: the inputs of LSTM are the outputs of the decoder and the output matrix has the dimensions of [ batch_size×seq_len×seq_len ].
In step 2, the sum of the loss value L2 and the other path loss is obtained by subtracting the output matrix of the unidirectional LSTM cyclic unit and the corresponding Chinese dependency syntax adjacent matrix according to elements, and is used as the total loss of the improved transducer model.
The encoder in step 3 adds a matrix for learning Mongolian grammar, and a self-attention mechanism is used in the transducer encoder, and the self-attention formula is as follows:
To enhance the learning of the source grammar information by the model, the original formula is changed into:
The M matrix is used to enhance learning of mongolian grammar.

Claims (2)

1. A Mongolian neural machine translation method based on dependency syntax information and a transducer model is characterized in that: firstly, performing dependency syntax analysis on Chinese corpus in Mongolian parallel corpus by using Stanford CoreNLP, and storing the obtained Chinese dependency syntax tree by using an adjacent matrix; secondly, the transducer model is modified: firstly, converting a transducer model from single output to double output, wherein one path is used for predicting Chinese target sentences, the other path is used for learning Chinese syntax knowledge, and the other path is used for adding a matrix for learning Mongolian grammar on an encoder of the transducer model; then training the model, combining Chinese dependency syntax knowledge, and using the Chinese dependency syntax knowledge to enable the model to learn Mongolian grammar, so as to improve the quality of Mongolian machine translation, improve a transducer model, add one path of output and add a matrix for learning Mongolian grammar on an encoder, wherein the process is as follows: step 1: modifying the transducer model into double outputs, and adding a unidirectional LSTM circulation unit after the last decoder; step 2: the output matrix of the unidirectional LSTM circulation unit and the Chinese dependency syntax adjacent matrix are subtracted according to elements to obtain a loss value L2, and the loss value L2 and a cross entropy loss value L1 originally output by the transducer model are added to form a total loss value of the model; step 3: setting a matrix for learning Mongolian grammar in an encoder of a transducer model;
In the step 1, the LSTM loop unit: newly added cyclic layer: the output matrix size of the transform model decoder is [ batch_size×seq_len×d_model ], where batch_size is the number of training samples per batch, seq_len is the training sentence length, d_model is the word vector embedding dimension, a unidirectional LSTM cyclic unit is added after the output of the decoder, and the super-parameters of the unidirectional LSTM unit are set as follows: input_size is the dimension of the input feature, i.e. the dimension value of the word vector is equal to d_model, hidden_size is the number of hidden layer neurons, the value is equal to seq_len, num_layers define the number of layers of the network, nonlinearity define the activation function, bias defines whether bias is used, batch_first defines whether the batch_first parameter is the first dimension of the output matrix, dropout defines the probability of randomly failing certain neurons, birdirectional defines whether bi-directional LSTM is used, the input of LSTM is the output of the decoder, and the size of the output matrix is [ batch_size×seq_len×seq_len ];
in the step 2, the sum of a loss value L2 and another path of loss is obtained by subtracting the output matrix of the unidirectional LSTM circulation unit and the corresponding Chinese dependency syntax adjacent matrix according to elements and is used as the total loss of the improved transducer model;
the encoder in step3 adds a matrix for learning Mongolian grammar: the transducer encoder uses a self-attention mechanism, the self-attention formula is:
Wherein Q, K, V are matrices and have the same dimensions, in translation, analogy is that a current target language Q distribution to be expressed is obtained, K represents a source language syntax structure, V is a source language distribution, correlation is calculated by using the dot product of Q and K, and then the calculated correlation is multiplied by V to obtain a corresponding relation between the source language and the target language, and in order to strengthen the learning of a model on source grammar information, the original formula is changed into:
The M matrix is used for learning Mongolian grammar, and after being guided by Chinese dependency syntax information, the Mongolian syntax information matrix can form different weights among Mongolian word segmentation, so that the connection among the word segmentation can be enhanced.
2. The method for Mongolian machine translation based on dependency syntax information and a transducer model according to claim 1, wherein the flow of extracting the Chinese dependency syntax adjacency matrix is as follows:
step 1: dividing the Mongolian parallel corpus data set into a training set, a verification set and a test set;
step 2: performing dependency syntax analysis on the Chinese corpus in the Mongolian parallel corpus training set by utilizing Stanford CoreNLP to obtain a Chinese dependency syntax tree;
Step 3: and (3) converting the Chinese dependency syntax tree obtained in the step (1) into an adjacency matrix for storage, and symmetrically arranging the adjacency matrix along a main diagonal line, namely the Chinese dependency syntax adjacency matrix.
CN202111113538.XA 2021-09-23 2021-09-23 Mongolian neural machine translation method based on dependency syntax information and transducer model Active CN113887249B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111113538.XA CN113887249B (en) 2021-09-23 2021-09-23 Mongolian neural machine translation method based on dependency syntax information and transducer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111113538.XA CN113887249B (en) 2021-09-23 2021-09-23 Mongolian neural machine translation method based on dependency syntax information and transducer model

Publications (2)

Publication Number Publication Date
CN113887249A CN113887249A (en) 2022-01-04
CN113887249B true CN113887249B (en) 2024-07-12

Family

ID=79010202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111113538.XA Active CN113887249B (en) 2021-09-23 2021-09-23 Mongolian neural machine translation method based on dependency syntax information and transducer model

Country Status (1)

Country Link
CN (1) CN113887249B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720531B (en) * 2023-06-20 2024-05-28 内蒙古工业大学 Mongolian neural machine translation method based on source language syntax dependency and quantization matrix

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688862A (en) * 2019-08-29 2020-01-14 内蒙古工业大学 Mongolian-Chinese inter-translation method based on transfer learning
CN113095092A (en) * 2021-04-19 2021-07-09 南京大学 Method for improving translation quality of non-autoregressive neural machine through modeling synergistic relationship

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101762866B1 (en) * 2010-11-05 2017-08-16 에스케이플래닛 주식회사 Statistical translation apparatus by separating syntactic translation model from lexical translation model and statistical translation method
KR102203355B1 (en) * 2020-01-21 2021-01-18 김종호 System and method extracting experience information according to experience of product

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688862A (en) * 2019-08-29 2020-01-14 内蒙古工业大学 Mongolian-Chinese inter-translation method based on transfer learning
CN113095092A (en) * 2021-04-19 2021-07-09 南京大学 Method for improving translation quality of non-autoregressive neural machine through modeling synergistic relationship

Also Published As

Publication number Publication date
CN113887249A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN108280064B (en) Combined processing method for word segmentation, part of speech tagging, entity recognition and syntactic analysis
CN108009285B (en) Forest Ecology man-machine interaction method based on natural language processing
CN112559702B (en) Method for generating natural language problem in civil construction information field based on Transformer
CN110717334A (en) Text emotion analysis method based on BERT model and double-channel attention
CN110413986A (en) A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN112270193A (en) Chinese named entity identification method based on BERT-FLAT
CN108874878A (en) A kind of building system and method for knowledge mapping
CN110020438A (en) Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN113505209A (en) Intelligent question-answering system for automobile field
CN113971394B (en) Text repetition rewriting system
CN108364066B (en) Artificial neural network chip and its application method based on N-GRAM and WFST model
CN117763363A (en) Cross-network academic community resource recommendation method based on knowledge graph and prompt learning
CN113887249B (en) Mongolian neural machine translation method based on dependency syntax information and transducer model
CN110297894A (en) A kind of Intelligent dialogue generation method based on auxiliary network
CN117350378A (en) Natural language understanding algorithm based on semantic matching and knowledge graph
CN117407615A (en) Web information extraction method and system based on reinforcement learning
CN112163069A (en) Text classification method based on graph neural network node feature propagation optimization
CN116340507A (en) Aspect-level emotion analysis method based on mixed weight and double-channel graph convolution
CN117235256A (en) Emotion analysis classification method under multi-class knowledge system
CN116304064A (en) Text classification method based on extraction
CN115858736A (en) Emotion text generation method based on emotion prompt fine adjustment
CN115481636A (en) Technical efficacy matrix construction method for technical literature
CN112464673B (en) Language meaning understanding method for fusing meaning original information
CN111581339B (en) Method for extracting gene events of biomedical literature based on tree-shaped LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant