CN116956944A - Endangered language translation model method integrating syntactic information - Google Patents

Endangered language translation model method integrating syntactic information Download PDF

Info

Publication number
CN116956944A
CN116956944A CN202310960646.3A CN202310960646A CN116956944A CN 116956944 A CN116956944 A CN 116956944A CN 202310960646 A CN202310960646 A CN 202310960646A CN 116956944 A CN116956944 A CN 116956944A
Authority
CN
China
Prior art keywords
endangered
language
dependency
translation model
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310960646.3A
Other languages
Chinese (zh)
Inventor
钱兆鹏
于重重
徐小龙
秦汉忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202310960646.3A priority Critical patent/CN116956944A/en
Publication of CN116956944A publication Critical patent/CN116956944A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/49Data-driven translation using very large corpora, e.g. the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an endangered language translation model method fusing syntactic information, which comprises the following steps: constructing an endangered language dependency structure tree library in a dependency syntax standard format in a semi-automatic mode; performing dependency syntax analysis on the endangered language based on the double affine classifier, and constructing an endangered language dependency syntax analysis model based on the double affine classifier; and adding the word sequence index, the part of speech label, the dominant word index and the dependency syntax relation label contained in the endangered language dependency structure tree library as syntax features to a machine translation model coding end to construct an endangered language-Chinese neural machine translation model. The translation of the endangered language can be more accurately completed through the syntactic information, the defects that time and labor are wasted, a large amount of expertise is needed, the data size is small, the effect of using a conventional neural machine translation method is poor and the like when the endangered language corpus is marked manually are overcome, and the effectiveness of endangered language translation is greatly improved.

Description

Endangered language translation model method integrating syntactic information
Technical Field
The invention relates to the technical field of machine translation of artificial intelligence, in particular to an endangered language translation model method integrating syntactic information.
Background
Because the corpus resources of most endangered languages in endangered states are scarce and have no text record, the collected recording data can be generally recorded by using international phonetic symbols at present, and then the collected recording data can be understood by the masses after being marked by common languages (such as Chinese) by native speakers and linguists. The phonetic transcription method needs to consume a great deal of manpower and time, and does not have the expansibility of corpus resources.
The existing method for learning and translating the endangered language by using the neural machine includes the following technical difficulties:
the first and endangered language corpus resources are scarce, the native language people are fewer, the corpus labeling difficulty is high, the time is long, and no standard data set exists at present;
secondly, the grammar rules of the endangered language and the Chinese are inconsistent, and because no text exists, only international sound marking tones can be used, so that the understanding difficulty is high;
thirdly, manually labeling endangered language corpus is time-consuming and labor-consuming, a great deal of expertise is needed, and the data size is small.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides an endangered language translation model method for fusing the syntactic information, which constructs an endangered language translation model for fusing the syntactic information and can more accurately complete translation of endangered languages.
The invention realizes the construction of an endangered language translation model fused with syntactic information, which comprises three aspects of semi-automatically constructing endangered language dependency structure tree libraries, realizing endangered language dependency syntactic analysis based on double affine classifiers and establishing an endangered language-Chinese neural machine translation model fused with syntactic information. The invention can more accurately complete translation of endangered language through the syntactic information, and solves the problems of time and labor waste, large amount of expertise, small data size, poor effect of using a conventional neural machine translation method and the like of manually labeling endangered language corpus.
The invention comprises the following steps:
constructing an endangered language dependency structure tree library of a dependency syntax CoNLL-U standard format in a semi-automatic mode, and pushing the construction of language resources of endangered languages; based on a graph method and a double affine classifier model, a plurality of language embedding modes and coding models are fused, so that language structure understanding of endangered languages is better carried out; the method combines the syntax information of the endangered language with a machine translation model transducer and is used for assisting in completing the labeling work of the endangered language.
The tree library is applied to the syntactic analysis of the endangered language dependency and is used for establishing an endangered language-Chinese neural machine translation model;
the endangered language dependency syntax analysis module takes an endangered language dependency structure tree library as experimental data to carry out endangered language dependency syntax analysis and supports subsequent establishment of endangered language-Chinese neural machine translation models, and the endangered language-Chinese neural machine translation model comprises an embedding layer, a coding layer, a feature dimension reduction layer, a double affine analysis layer and a decoding layer;
establishing an endangered language-Chinese neural machine translation model, adding word sequence indexes, part of speech labels, dominant word indexes and dependency syntax relation labels contained in an endangered language dependency structure tree library as syntax features to a transducer model coding end, and constructing the endangered language-Chinese neural machine translation model.
The construction of the model of the invention comprises the following three steps: semi-automatic construction of endangered language dependency structure tree libraries; constructing an endangered language dependency syntactic analysis model based on a double affine classifier; the construction of an endangered language-Chinese nerve machine translation model fusing syntactic information specifically comprises the following steps:
1. semi-automatic construction of endangered language dependency structure tree libraries: firstly, collecting linguistic data, preprocessing the linguistic data, marking the part of speech and dependency syntactic relation of the processed linguistic data, and finally constructing an endangered language dependency structure tree library and performing manual verification;
2. constructing an endangered language dependency syntactic analysis model based on a double affine classifier: the dependency syntax analysis model TuParser of the endangered language based on the graph is provided, and the dependency syntax analysis of the endangered language is completed through an embedding layer, a coding layer, a feature dimension reduction layer, a double affine analysis layer and a decoding layer;
3. building an endangered language-Chinese nerve machine translation model: the transducer is selected as a baseline model, and a neural machine translation model TuSynTRM fused with syntactic information is designed on the basis of the baseline model. On the basis of keeping the structure of the Transformer model unchanged, the input characteristics of the source end are emphasized, and word sequence index, part of speech tagging, dominant word index and dependency relationship tagging information contained in the endangered language dependency structure tree library are combined with the coding end of the traditional Transformer model.
Specifically, the method comprises the following steps:
a, semi-automatically constructing an endangered language dependency structure tree bank to obtain a CoNLL-U format endangered language dependency structure tree bank;
and B, performing dependency syntax analysis on the endangered language based on the double affine classifier to obtain a dependency structure of the endangered language sentence.
Specifically, a graph-based endangered language dependency syntactic analysis model TuParser is adopted.
The TuParser model comprises an embedding layer, a coding layer, a feature dimension reduction layer, a dual affine analysis layer and a decoding layer;
and C, establishing an endangered language-Chinese neural machine translation model fusing the syntax information, selecting a transducer as a baseline model, and designing a neural machine translation model TuSynTRM fusing the syntax information on the basis of the baseline model.
C1, determining whether a dependency relationship exists between words in the endangered language sentence or not and a corresponding dependency relationship type through a dependency syntactic analysis model TuParser in the B;
the feature of the syntactic information (including endangered language parts of speech and dependency) is embedded and position codes (word sequence index and dominant word index of endangered language) are integrated;
c2.1, taking two syntactic information of the endangered language part of speech tagging and dependency relationship tagging extracted in the A as a part of embedding the input features;
and C2.2, regarding the word sequence index and the dominant word index of the endangered language as position information, and performing position coding on the word sequence index and the dominant word index of the endangered language, so that the TuSynTRM model can additionally learn the syntax information of the endangered language.
C3 The TuSynTRM model uses a traditional attention mechanism mode to extract information of endangered languages and Chinese, models the relation between bilingual languages, and builds an endangered language translation model fusing syntactic information.
C31. Mapping the input endangered language word vector sequence into three matrixes Q, K, V, and obtaining a correlation matrix after matrix multiplication, scaling and masking operation;
C32. performing matrix normalization, comparing Q, K similarity to obtain a weight coefficient, and summing with V weight to obtain a self-attention value;
C33. performing self-attention operation on each self-attention head to obtain single-head output;
C34. and finally, splicing and linearly transforming the outputs of the T self-attention heads to obtain the output of the multi-head attention.
D. And realizing the endangered language translation fusing the syntactic information by using the constructed endangered language translation model.
Through the steps, the establishment of a standard database is completed, and the construction of an endangered language translation model fusing syntactic information is completed. The model can more accurately complete translation of endangered language through syntactic information, overcomes the defects that manual labeling of endangered language corpus is time-consuming and labor-consuming, a large amount of expertise is needed, the data size is small, the effect of using a conventional neural machine translation method is poor, and the like, and greatly improves the effectiveness of endangered language translation.
Drawings
Fig. 1 is a block diagram of a model tuParser based on a dual affine classifier constructed according to the present invention.
Fig. 2 is a block diagram of a TuSynTRM model constructed in accordance with the present invention.
Detailed Description
The invention is further described by way of examples with reference to the accompanying drawings.
The method comprises the following steps: 1) Semi-automatic construction of endangered language dependency structure tree libraries; 2) Analyzing the endangered language dependency syntax based on the double affine classifier; 3) And establishing an endangered language-Chinese nerve machine translation model fusing the syntactic information.
Specifically, the method comprises the following steps:
a semi-automatization builds an endangered language dependency structure tree bank to obtain a CoNLL-U format endangered language dependency structure tree bank, wherein each endangered language sentence is composed of international phonetic symbols of one or more words, each endangered language word is represented by a plurality of field information, and the method comprises the following steps: word order index, part of speech tagging, dominant word index, dependency tagging information;
a1, deriving corpus, namely selecting corpus containing different categories such as folk legends, historic stories, character conversations and the like, and deriving the sentences in the form of three lines of text of international phonetic symbols, chinese opposite translation (translating single word) and Chinese translation (translating whole sentence) through ELAN voice marking software.
A2 preprocessing the text derived from A1 comprises:
a2.1, removing repeated corpus, and completely replacing punctuation marks with spaces to obtain an endangered language-Chinese parallel sentence group.
And A2.2, carrying out machine automatic word segmentation on the parallel sentence group of the endangered language-Chinese language translation obtained in the A2.1 by taking a space as a separator to obtain a single word, and respectively outputting parallel sentence pairs which can be completely aligned (Chinese annotation and endangered language international phonetic symbols are in one-to-one correspondence) and sentences with errors ((Chinese annotation does not have one-to-one correspondence with endangered language international phonetic symbols)) during segmentation. Re-word segmentation and manual alignment are carried out on the sentences with errors, and the sentences subjected to manual adjustment are combined with the sentences obtained after automatic word segmentation by a machine.
A2.3, marking the virtual words by English abbreviations during Chinese translation, and replacing the virtual words with corresponding Chinese words according to the context during Chinese translation.
A3, marking the parts of speech of the words obtained by segmentation in the A2 to obtain endangered language parts of speech marks.
A4, marking the dependency syntactic relation of the preprocessed sentence in the A2 to obtain a dependency relation mark;
a5 Constructing a CoNLL-U format endangered language dependency structure tree library, wherein each endangered language sentence obtained by A2 is composed of international phonetic symbols of one or more words, each endangered language word is represented by 10 field information, and the information is respectively:
(1) ID: the international phonetic index of endangered language words, each new sentence is labeled starting with integer 1.
(2) FORM: international phonetic symbols of endangered language words.
(3) LEMMA: radical morphemes of endangered languages are used here instead.
(4) UPOSTAG: part of speech tagging of endangered languages, see A3.
(5) XPOSTAG: specific part-of-speech labels are used herein instead.
(6) FEATS: the lexical or grammatical features of endangered language are used herein-instead.
(7) HEAD: the dominant word sequence number of the current endangered language word is the value of the ID or 0 (i.e. the root node).
(8) DEPREL: defined endangered language dependencies, see A4.
(9) DEPS: secondary dependencies, here replaced with.
(10) MISC: chinese notes corresponding to international phonetic symbols of endangered language words.
A6, manually checking, including checking Chinese annotation, part-of-speech annotation and dependency annotation.
And B, performing dependency syntax analysis on the endangered language based on the double affine classifier to obtain a dependency structure of the endangered language sentence.
The Chinese dependency syntax analysis framework designed by zhang et al provides a graph-based endangered language dependency syntax analysis model TuParser. The TuParser model comprises an embedding layer, a coding layer, a feature dimension reduction layer, a dual affine analysis layer and a decoding layer;
the B1 embedding layer designs the endangered language input vector mode by using e i Representing the constructed input vector. For an endangered language sentence w= { W containing n words 1 ,...,w i ,...,w n -w is i Representing the ith endangered language word. The method enriches the input vector e by means of part-of-speech features i Representation information of (a)
Wherein the method comprises the steps ofRepresenting endangered language word embedding vectors, +.>Representing part-of-speech feature vectors,>representing vector concatenation operations.
The B2 coding layer utilizes three continuous BiLSTM to carry out context coding on the input vector output by the B1, and for the ith endangered language word w i Output vector e passing through the embedded layer i Sending into BiLSTM, taking the output result of the last layer of the model as the context feature vector r i
r i =BiLSTM(e ibilstm )
Wherein θ is bilstm Is a BiLSTM model parameter.
B3 outputs vector r of B2 in the feature dimension reduction layer through the MLP network i And performing nonlinear transformation to remove redundant information irrelevant to the current decision, so that the overall training speed and analysis accuracy of the TuParser model are improved. The four endangered language syntax feature vectors after dimension reduction are respectively expressed by the following formulas:
wherein, MLP (*) Refers to an independent multi-layer perceptron network, anddependency arc child node characteristics, dependency arc father node characteristics, dependency relationship type (comprising: main-predicate relationship, guest-move relationship, state-center relationship, dynamic-complement relationship, centering relationship, azimuth relationship, parallel relationship, dual object structure, clause structure, double object structure, continuous-predicate structure, virtual word structure, core relationship) child node characteristics and dependency relationship type father node characteristics of endangered language are respectively represented.
B4 is to reduce the syntactic feature vector in B3Inputting to a dual affine analysis layer, and obtaining inter-node dependent arc characterization scores S by using dual affine attention mechanisms in a dependent arc classifier and a dependent relationship classifier respectively arc And a classification relation characterization score S between nodes rel The scoring function formula is as follows:
wherein U is (*) Is a model weight matrix, H (*) Representing an endangered language syntax feature matrix in dependency arc prediction and relation type prediction, wherein I is an identity matrix.
B5 obtaining inter-node dependent arc characterization score S from B4 arc And a classification relation characterization score S between nodes rel And then, adopting a first-order Eisner algorithm designed for a dependency syntactic analysis task at a decoding layer, wherein the Eisner algorithm essentially searches a maximum spanning tree through dynamic programming, namely, continuously merging analysis results of adjacent substrings to finally obtain a dependency structure of the whole endangered language sentence.
And C, establishing an endangered language-Chinese neural machine translation model fusing the syntax information, selecting a transducer as a baseline model, and designing a neural machine translation model TuSynTRM fusing the syntax information on the basis of the baseline model.
C1 determines whether a dependency relationship exists between words in the endangered language sentence and a corresponding dependency relationship type through the dependency syntax analysis model TuParser in the B.
The feature of the syntactic information (including endangered language parts of speech and dependency) is embedded and position codes (word sequence index and dominant word index of endangered language) are integrated;
c2.1, taking two kinds of syntax information of the endangered language part of speech labels and A4 dependency labels extracted in the A3 as a part of the embedding of the input features;
for an endangered language sentence w= { W containing n words 1 ,...,w i ,...,w n -w is i Input features representing the i-th endangered language word, the fusion syntax information extracted from W are embedded into e i Can be represented by formula 5-1:
wherein the method comprises the steps ofRepresenting pre-trained endangered language word vectors (representing endangered language words), -a pre-trained endangered language word vector (representing endangered language words) representing endangered language words representing endangered words (representing endangered words) representing endanger>Part-of-speech feature vector representing random initialization, < +.>Representing endangered language dependency feature vectors, +.>Representing vector concatenation operations.
And C2.2, regarding the word sequence index and the dominant word index of the endangered language as position information, so that the TuSynTRM model can additionally learn the syntax information of the endangered language. The functions of the endangered language word sequence index and the dominant word index after position coding are respectively represented by PEO (-) and PEH (-), and the calculation method is as follows:
wherein pos order Word order index, pos, representing endangered language words in sentences head Dominant word index, d, representing current endangered language words model Representing the dimension of the position vector, i representing a certain dimension in the position vector.
C3 The TuSynTRM model uses traditional attention mechanisms to extract information of endangered languages, chinese, and to model relationships between bilingual languages.
For input endangered language word vector sequencesH is mapped into three matrixes, the three matrixes are marked as Q, K, V, the correlation matrixes are obtained after matrix multiplication, scaling and masking operations, then the correlation matrixes are normalized by using a Softmax function, and finally similarity of Q, K is compared with V for weighted summation, so that a self-attention value is obtained. The correlation formula is as follows:
Q=W q H
K=W k H
V=W v H
where V is a matrix representing input features and Q, K is a feature matrix for calculating Attention self-Attention weights; k (K) T Is the transposed matrix of K, d k Is the column number of the Q, K matrix, i.e. vector dimension, D h Representing the length of the input endangered language word vector, is a projection matrix; />Representing a comparison of Q, K similarity.
The multi-headed attention consists of multiple self-attentives, and the calculation process first maps Q, K, V to T subsets by linear transformation as shown in the following equation:
wherein t represents the t-th head, are parameter matrices.
Then, each head is respectively self-attentive to obtain a single-head output head t The following formula is shown:
head t =Attention(Q t ,K t ,V t )
finally, after output of the T heads is spliced and linearly transformed, the output of the multi-head attention is obtained, and the output can be expressed as:
MultiHead(Q,K,V)=Concat(head 1 ,…,head T )W c
wherein W is c Is a weight matrix. MultiHead (Q, K, V) represents the output of multi-headed attention;
Concat(head 1 ,…,head T ) Representing the concatenation and linear transformation of the outputs of the T heads.
The invention is further described in the following steps, in connection with examples:
1, taking a Tujia North dialect as a research object, and establishing a dependency structure tree library in a semi-automatic way.
1.1 derivation total 6438 Tujia sentences, which are derived in the form of three lines of text of international phonetic symbols, chinese translation and Chinese translation by ELAN phonetic notation software.
1.2 data preprocessing
1.2.1 removing the repeated corpus to obtain 6023 sentences. Then, the punctuation marks appearing in the international phonetic symbol layer and the Chinese opposite translation layer are replaced by 'blank spaces'
1.2.2 machine automatic word segmentation is carried out on the Tujia-Chinese parallel sentence pair obtained by removing the duplication and punctuation marks by taking a space as a separator, so as to obtain 5102 Tujia-Chinese parallel sentence pair which is completely aligned after automatic segmentation. Re-word segmentation and manual alignment are carried out on sentences which cannot be completely aligned after automatic segmentation, the sentences subjected to manual adjustment are combined with sentences obtained after automatic word segmentation by a machine, and a total of 6023 parallel sentence pairs for Tujia-Chinese translation are obtained
1.2.3 marking by English abbreviations according to grammar meaning of Tujia language works, and replacing works abbreviations with corresponding Chinese words according to context when translating Chinese.
1.3, in order to describe the part of speech of each word after the sentence segmentation of Tujia more accurately, consult the part of speech label explanation of the jieba word segmentation device, combine the corpus of existing Tujia, have designed 36 Tujia part of speech label symbols;
1.4, designing a labeling relation table of a Tujia dependency structure tree library, totaling 14 Tujia dependency relation types, and providing corresponding relation description and labeling examples, wherein the dependency relation among words in the Tujia labeling examples is represented by thickening texts.
1.5 each Tujia sentence is composed of international phonetic symbols of one or more words, and each Tujia word is represented by 10 pieces of field information, respectively:
(1) ID: the international phonetic index of Tujia words, each new sentence is labeled starting with integer 1.
(2) FORM: international phonetic symbols of Tujia words.
(3) LEMMA: root morphemes of Tujia are used herein instead.
(4) UPOSTAG: part of speech tagging of Tujia.
(5) XPOSTAG: specific part-of-speech labels are used herein instead.
(6) FEATS: the lexical or grammatical nature of Tujia is used herein in place of.
(7) HEAD: the dominant word sequence number of the current Tujia word is the value of ID or 0 (i.e., root node).
(8) DEPREL: tujia language dependency.
(9) DEPS: secondary dependencies, here replaced with.
(10) MISC: chinese notes corresponding to Tujia word international phonetic symbols.
And (3) sorting the content automatically marked by the part-of-speech and dependency syntactic relation, and automatically generating a Tujia dependency structure tree library by writing a Python script according to the CoNLL-U data format, wherein the Tujia dependency structure tree library contains corpus 6023.
1.6, manual verification, wherein the specific verification thought is divided into three parts: checking Chinese annotation, checking part-of-speech labels and checking dependency relationship.
2. Constructing a TuParser-based TuParser of the TuParser;
2.1 splicing the language expression vector and the characteristic vector by the embedding layer;
2.2 coding layer uses three consecutive bilstms to context-code the input vector output by the embedded layer.
2.3, in the characteristic dimension reduction layer, nonlinear transformation is carried out on the output vector of the coding layer through an MLP network.
2.4, inputting the reduced-dimension syntactic feature vector into a double affine analysis layer, and respectively using a double affine attention mechanism in the dependency arc classifier and the dependency relation classifier to obtain an inter-node dependency arc characterization score and an inter-node classification relation characterization score;
and 2.5, the decoding layer adopts a first-order Eisner algorithm designed for the task of dependency syntactic analysis, and finally the dependency structure of the whole Tujia sentence is obtained.
3. Tujia language-Chinese nerve machine translation model for constructing fusion syntax information
3.1, determining whether the dependency relationship exists between the words in the endangered language sentence and the corresponding relationship type through the dependency syntax analysis model TuParser.
And 3.2, feature embedding and position coding of the syntactic information are fused, and two syntactic information of endangered language part-of-speech tagging and dependency relationship tagging are extracted to be used as part of input feature embedding.
And 3.3, regarding the word sequence index and the dominant word index of the endangered language as position information, so that the TuSynTRM model can additionally learn the syntax information of the endangered language.
3.4TuSynTRM model uses traditional attentive mechanism mode to extract information of endangered language and Chinese, and models relation between bilingual.
An endangered language translation model fusing the syntactic information is constructed based on the operations. And the problems that the manual labeling of endangered language corpus is time-consuming and labor-consuming, a large amount of expertise is required, the data size is small, the effect of using a conventional neural machine translation method is poor and the like are solved.
Finally, it should be noted that the examples are disclosed for the purpose of aiding in the further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims (6)

1. An endangered language translation model method integrating syntactic information is characterized by constructing an endangered language translation model integrating syntactic information, namely, adding word sequence indexes, part-of-speech labels, dominant word indexes and dependency syntactic relation labels contained in an endangered language dependency structure tree library as syntactic features to a machine translation model coding end, and constructing an endangered language-Chinese neural machine translation model; the method comprises the following steps:
1) Constructing an endangered language dependency structure tree library in a dependency syntax standard format in a semi-automatic mode;
2) Performing dependency syntax analysis on the endangered language based on the double affine classifier, and constructing an endangered language dependency syntax analysis model TuParser based on the double affine classifier to obtain a dependency structure of endangered language sentences;
the method based on the graph and the double affine classifier model are integrated with a plurality of language embedding modes and coding models and are used for understanding the structure of endangered languages;
3) Based on a machine translation model transducer, establishing an endangered language-Chinese neural machine translation model TuSynTRM fusing syntactic information; comprising the following steps:
c1, determining whether dependency relationship exists among words in endangered language sentences and corresponding dependency relationship types through a dependency syntactic analysis model;
c2, feature embedding and position coding of fusion syntax information;
c3, extracting information of the endangered language and the Chinese language by adopting a concentration mechanism mode, modeling the relation between bilingual, and constructing an endangered language translation model fusing syntactic information; comprising the following steps:
C31. mapping an input endangered language word vector sequence into a matrix Q, K, V, and obtaining a correlation matrix after matrix multiplication, scaling and masking operation;
C32. performing matrix normalization, comparing Q, K similarity to obtain a weight coefficient, and summing with V weight to obtain a self-attention value;
C33. performing self-attention operation on each self-attention head to obtain single-head output;
C34. splicing and linearly transforming the outputs of the plurality of self-attention heads to obtain the output of multiple heads of attention;
D. and realizing the endangered language translation fusing the syntactic information by using the constructed endangered language translation model.
2. The method of claim 1, wherein in step 1), the standard format is a colll-U format.
3. The method of an endangered language translation model fusing syntax information as claimed in claim 1, wherein in the step C2, the characteristics of the syntax information include: endangered language parts of speech, dependency; the position code includes word order indexes and dominant word indexes of endangered languages.
4. The method of an endangered language translation model fusing syntactic information according to claim 3, in which the specific process of step C2 includes:
c2.1, marking the extracted endangered language part of speech and marking dependency relationship as part of the embedding of the input features;
and C2.2, taking the word sequence index and the dominant word index of the endangered language as position information, and carrying out position coding on the word sequence index and the dominant word index of the endangered language so that the model learns the syntax information of the endangered language.
5. The method of claim 1, wherein in step C3, the input endangered language word vector sequence H is mapped into three matrices, the three matrixes are marked as Q, K, V, the correlation matrixes are obtained after matrix multiplication, scaling and masking operations, then the correlation matrixes are normalized by using a Softmax function, and finally the similarity of Q, K is compared with V for weighted summation, so that a self-attention value is obtained; expressed as:
Q=W q H
K=W k H
V=W v H
where V is a matrix representing input features and Q, K is a feature matrix for calculating Attention self-Attention weights; d, d k Is the column number of the Q, K matrix, i.e. vector dimension, D h Representing the length of the input endangered language word vector,is a projection matrix.
6. The method for an endangered language translation model fusing syntactic information according to claim 5, in which the process of obtaining multi-headed attention is expressed as:
q, K, V is mapped into T subsets by linear transformation as follows:
wherein t represents the t-th head, all are parameter matrixes;
each head is respectively self-attentive to obtain a single-head output head t Expressed as:
head t =Attention(Q t ,K t ,V t )
after the output of the T heads is spliced and linearly transformed, the output of the multi-head attention is obtained, and can be expressed as:
MultiHead(Q,K,V)=Concat(head 1 ,…,head T )W c
wherein W is c Is a weight matrix; multiHead (Q, K, V) represents the output of multi-headed attention;
Concat(head 1 ,…,head T ) Representing the concatenation and linear transformation of the outputs of the T heads.
CN202310960646.3A 2023-08-01 2023-08-01 Endangered language translation model method integrating syntactic information Pending CN116956944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310960646.3A CN116956944A (en) 2023-08-01 2023-08-01 Endangered language translation model method integrating syntactic information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310960646.3A CN116956944A (en) 2023-08-01 2023-08-01 Endangered language translation model method integrating syntactic information

Publications (1)

Publication Number Publication Date
CN116956944A true CN116956944A (en) 2023-10-27

Family

ID=88456436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310960646.3A Pending CN116956944A (en) 2023-08-01 2023-08-01 Endangered language translation model method integrating syntactic information

Country Status (1)

Country Link
CN (1) CN116956944A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118378091A (en) * 2024-06-19 2024-07-23 之江实验室 Method and system for constructing standard data set and baseline model for astronomical light red shift measurement

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118378091A (en) * 2024-06-19 2024-07-23 之江实验室 Method and system for constructing standard data set and baseline model for astronomical light red shift measurement

Similar Documents

Publication Publication Date Title
CN109918666B (en) Chinese punctuation mark adding method based on neural network
JP5444308B2 (en) System and method for spelling correction of non-Roman letters and words
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN109410949B (en) Text content punctuation adding method based on weighted finite state converter
CN112528649B (en) English pinyin identification method and system for multi-language mixed text
CN114707492B (en) Vietnam grammar error correction method and device integrating multi-granularity features
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
CN112214989A (en) Chinese sentence simplification method based on BERT
CN115310448A (en) Chinese named entity recognition method based on combining bert and word vector
CN115034218A (en) Chinese grammar error diagnosis method based on multi-stage training and editing level voting
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113159969A (en) Financial long text rechecking system
CN115510863A (en) Question matching task oriented data enhancement method
CN116956944A (en) Endangered language translation model method integrating syntactic information
Lefever et al. Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings
CN109815497B (en) Character attribute extraction method based on syntactic dependency
CN112528003B (en) Multi-item selection question-answering method based on semantic sorting and knowledge correction
Xiang et al. A cross-guidance cross-lingual model on generated parallel corpus for classical Chinese machine reading comprehension
BE1022627B1 (en) Method and device for automatically generating feedback
CN115577712A (en) Text error correction method and device
Nguyen et al. OCR error correction for Vietnamese handwritten text using neural machine translation
Zueva et al. A finite-state morphological analyser for Evenki
Pilán et al. Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination