CN116956944A - An endangered language translation model method integrating syntactic information - Google Patents
An endangered language translation model method integrating syntactic information Download PDFInfo
- Publication number
- CN116956944A CN116956944A CN202310960646.3A CN202310960646A CN116956944A CN 116956944 A CN116956944 A CN 116956944A CN 202310960646 A CN202310960646 A CN 202310960646A CN 116956944 A CN116956944 A CN 116956944A
- Authority
- CN
- China
- Prior art keywords
- endangered
- language
- dependency
- head
- syntactic information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013519 translation Methods 0.000 title claims abstract description 63
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000004458 analytical method Methods 0.000 claims abstract description 26
- 230000001537 neural effect Effects 0.000 claims abstract description 22
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims abstract description 20
- 239000013598 vector Substances 0.000 claims description 31
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000001131 transforming effect Effects 0.000 claims description 6
- 230000000873 masking effect Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 2
- 230000009977 dual effect Effects 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 2
- 230000007547 defect Effects 0.000 abstract 1
- 238000010276 construction Methods 0.000 description 9
- 230000009467 reduction Effects 0.000 description 7
- 230000011218 segmentation Effects 0.000 description 7
- 238000012795 verification Methods 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 3
- XTPJLNSARGBDNC-UHFFFAOYSA-N 3-[diethyl(prop-2-ynyl)azaniumyl]propane-1-sulfonate Chemical compound C#CC[N+](CC)(CC)CCCS([O-])(=O)=O XTPJLNSARGBDNC-UHFFFAOYSA-N 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 101100377706 Escherichia phage T5 A2.2 gene Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/49—Data-driven translation using very large corpora, e.g. the web
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域Technical field
本发明涉及人工智能的机器翻译技术领域,尤其涉及一种融合句法信息的濒危语言翻译模型方法。The present invention relates to the technical field of artificial intelligence machine translation, and in particular to an endangered language translation model method that integrates syntactic information.
背景技术Background technique
由于大多数处于濒危状态的濒危语言的语料资源稀缺,没有文字记录,目前一般只能先使用国际音标对采集到的录音资料记音,再由母语人和语言学家将其用通用语(如汉语)进行标注后,才能被大众所理解。这种记音标注方式需要耗费大量的人力和时间,且不具备语料资源的可拓展性。Since most endangered languages have scarce corpus resources and no written records, currently the collected recording data can only be transcribed using International Phonetic Alphabets, and then native speakers and linguists can transcribe them into common languages (such as Chinese) can be understood by the public only after being annotated. This phonetic notation method requires a lot of manpower and time, and does not have the scalability of corpus resources.
而现有的使用神经机器学习翻译濒危语言的方法面临的技术难点包括:The technical difficulties faced by existing methods of using neural machine learning to translate endangered languages include:
第一、濒危语言语料资源稀缺,母语人较少,语料标注难度大,耗时久,目前仍没有标准数据集;First, endangered language corpus resources are scarce, there are few native speakers, corpus annotation is difficult and time-consuming, and there is still no standard data set;
第二、濒危语言与汉语的语法规则不一致,且由于没有文字,只能使用国际音标记音,理解难度较大;Second, the grammatical rules of endangered languages are inconsistent with Chinese, and since there are no written characters, only international phonetic symbols can be used, making it difficult to understand;
第三、人工标注濒危语言语料费时费力,需要大量专业知识,数据量较少。Third, manual annotation of endangered language corpora is time-consuming and labor-intensive, requiring a lot of professional knowledge and a small amount of data.
发明内容Contents of the invention
为了克服上述现有技术存在的不足,本发明提供一种融合句法信息的濒危语言翻译模型方法,构建融合句法信息的濒危语言翻译模型,能够更加准确的完成对于濒危与语言的翻译。In order to overcome the shortcomings of the above-mentioned existing technologies, the present invention provides an endangered language translation model method that integrates syntactic information, constructs an endangered language translation model that integrates syntactic information, and can more accurately complete the translation of endangered languages.
本发明实现构建一种融合句法信息的濒危语言翻译模型共包括三个方面,分别是半自动化构建濒危语言依存结构树库、实现基于双仿射分类器的濒危语言依存句法分析、建立融合句法信息的濒危语言-汉语神经机器翻译模型。本发明通过句法信息能够更加准确的完成对于濒危与语言的翻译,且克服了人工标注濒危语言语料费时费力,需要大量专业知识,数据量较少,使用常规神经机器翻译方法的效果较差等问题。The invention realizes the construction of an endangered language translation model that integrates syntactic information, which includes three aspects: semi-automatic construction of an endangered language dependency structure tree library, realization of endangered language dependency syntax analysis based on a double affine classifier, and establishment of integrated syntax information. An endangered language-Chinese neural machine translation model. The present invention can more accurately complete the translation of endangered languages through syntactic information, and overcomes the problems of manually annotating endangered language corpus, which is time-consuming and laborious, requires a large amount of professional knowledge, has a small amount of data, and has poor effects using conventional neural machine translation methods. .
本发明包括:The invention includes:
采用一种半自动化方式构建依存句法CoNLL-U标准格式的濒危语言依存结构树库,推进濒危语言的语言资源建设;基于图的方法和双仿射分类器模型,融合多种语言嵌入方式和编码模型,更好地对濒危语言进行语言结构的理解;将濒危语言句法信息与机器翻译模型Transformer相结合,用于辅助完成濒危语言的标注工作。A semi-automatic method is used to construct a dependency structure tree library of endangered languages in the CoNLL-U standard format of dependency syntax to promote the construction of language resources for endangered languages; graph-based methods and double affine classifier models integrate multiple language embedding methods and coding model to better understand the language structure of endangered languages; it combines the syntactic information of endangered languages with the machine translation model Transformer to assist in the annotation of endangered languages.
其中,濒危语言依存结构树库,该树库应用于濒危语言依存句法分析、建立濒危语言-汉语神经机器翻译模型;Among them, the endangered language dependency structure treebank is used to analyze the dependency syntax of endangered languages and establish an endangered language-Chinese neural machine translation model;
濒危语言依存句法分析模块,以濒危语言依存结构树库作为实验数据,进行濒危语言依存句法分析,支持后续建立濒危语言-汉语神经机器翻译模型,包括嵌入层、编码层、特征降维层、双仿射解析层、解码层;The endangered language dependency syntax analysis module uses the endangered language dependency structure tree library as experimental data to perform endangered language dependency syntax analysis and support the subsequent establishment of the endangered language-Chinese neural machine translation model, including embedding layer, encoding layer, feature dimensionality reduction layer, double Affine parsing layer and decoding layer;
建立濒危语言-汉语神经机器翻译模型,将濒危语言依存结构树库中包含的词序索引、词性标注、支配词索引、依存句法关系标注作为句法特征加入到Transformer模型编码端,构建濒危语言-汉语神经机器翻译模型。Establish an endangered language-Chinese neural machine translation model, and add the word order index, part-of-speech tagging, dominant word index, and dependency syntactic relationship annotation contained in the endangered language dependency structure tree library as syntactic features to the encoding end of the Transformer model to build an endangered language-Chinese neural machine translation model. Machine translation model.
本发明模型的构建分为以下三个步骤:濒危语言依存结构树库的半自动化构建;构建基于双仿射分类器的濒危语言依存句法分析模型;构建融合句法信息的濒危语言-汉语神经机器翻译模型,具体包括:The construction of the model of the present invention is divided into the following three steps: semi-automatic construction of endangered language dependency structure tree library; construction of endangered language dependency syntax analysis model based on double affine classifier; construction of endangered language-Chinese neural machine translation that integrates syntactic information Models, specifically including:
1.濒危语言依存结构树库的半自动化构建:首先收集语料,然后对对语料进行预处理,将处理完的语料进行词性和依存句法关系标注,最后构建濒危语言依存结构树库并进行人工校验;1. Semi-automatic construction of endangered language dependency structure treebank: first collect corpus, then pre-process the corpus, annotate part-of-speech and dependency syntactic relationships of the processed corpus, and finally build an endangered language dependency structure treebank and perform manual correction. test;
2.构建基于双仿射分类器的濒危语言依存句法分析模型:提出了一种基于图的濒危语言依存句法分析模型TuParser,经嵌入层、编码层、特征降维层、双仿射解析层、解码层完成对于濒危语言的依存句法分析;2. Construct an endangered language dependency syntax analysis model based on a double affine classifier: A graph-based endangered language dependency syntax analysis model TuParser is proposed, which consists of an embedding layer, a coding layer, a feature dimension reduction layer, a double affine parsing layer, The decoding layer completes the dependency syntax analysis of endangered languages;
3.构建濒危语言-汉语神经机器翻译模型:选择Transformer作为基线模型,在其基础上设计了融合句法信息的神经机器翻译模型TuSynTRM。保持Transformer模型结构不变的基础上,重点强化源端输入特征,将濒危语言依存结构树库所包含的词序索引、词性标注、支配词索引、依存关系标注信息与传统Transformer模型编码端相结合。3. Construct an endangered language-Chinese neural machine translation model: Transformer was selected as the baseline model, and based on it, a neural machine translation model TuSynTRM that integrated syntactic information was designed. On the basis of keeping the structure of the Transformer model unchanged, we focus on strengthening the source-side input features, and combine the word order index, part-of-speech tagging, dominant word index, and dependency tag information contained in the endangered language dependency structure tree library with the coding end of the traditional Transformer model.
具体来说包括以下步骤:Specifically, it includes the following steps:
A半自动化构建濒危语言依存结构树库,得到CoNLL-U格式濒危语言依存结构树库;A semi-automatically constructs an endangered language dependency structure treebank, and obtains an endangered language dependency structure treebank in CoNLL-U format;
B基于双仿射分类器对濒危语言进行依存句法分析,得到濒危语言句子的依存结构。B performs dependency syntax analysis on endangered languages based on a double affine classifier, and obtains the dependency structure of sentences in endangered languages.
具体采用基于图的濒危语言依存句法分析模型TuParser。Specifically, the graph-based endangered language dependency syntax analysis model TuParser is used.
TuParser模型包含嵌入层、编码层、特征降维层、双仿射解析层和解码层五部分;The TuParser model consists of five parts: embedding layer, encoding layer, feature dimensionality reduction layer, biaffine parsing layer and decoding layer;
C建立融合句法信息的濒危语言-汉语神经机器翻译模型,选择Transformer作为基线模型,在其基础上设计了融合句法信息的神经机器翻译模型TuSynTRM。C established an endangered language-Chinese neural machine translation model that integrates syntactic information, selected Transformer as the baseline model, and designed a neural machine translation model TuSynTRM that integrates syntactic information based on it.
C1通过B中依存句法分析模型TuParser确定濒危语言句子中的词语之间是否存在依存关系,以及对应的依存关系类型;C1 uses the dependency syntax analysis model TuParser in B to determine whether there is a dependency relationship between words in the endangered language sentence, and the corresponding dependency relationship type;
C2融合句法信息的特征(包括:濒危语言词性、依存关系)嵌入及位置编码(濒危语言的词序索引、支配词索);C2 integrates the characteristics of syntactic information (including: endangered language parts of speech, dependencies) embedding and position coding (word order index, dominant word index of endangered languages);
C2.1将A中提取的濒危语言词性标注、依存关系标注两种句法信息作为输入特征嵌入的一部分;C2.1 uses the two types of syntactic information extracted in A, namely part-of-speech tagging and dependency tagging of endangered languages, as part of the input feature embedding;
C2.2将濒危语言的词序索引、支配词索引视作位置信息,将濒危语言词序索引、支配词索引进行位置编码,使得TuSynTRM模型能够额外学习到濒危语言的句法信息。C2.2 treats the word order index and dominant word index of endangered languages as position information, and performs position coding on the word order index and dominant word index of endangered languages, allowing the TuSynTRM model to additionally learn syntactic information of endangered languages.
C3 TuSynTRM模型使用传统注意力机制方式对濒危语言、汉语进行信息提取,并对双语之间的关系进行建模,构建得到融合句法信息的濒危语言翻译模型。The C3 TuSynTRM model uses the traditional attention mechanism to extract information from endangered languages and Chinese, model the relationship between bilingual languages, and build an endangered language translation model that integrates syntactic information.
C31.将输入的濒危语言词向量序列映射到三个矩阵Q、K、V中,经过矩阵乘法、缩放、掩码操作后获得相关性矩阵;C31. Map the input endangered language word vector sequence into three matrices Q, K, and V, and obtain the correlation matrix after matrix multiplication, scaling, and masking operations;
C32.进行矩阵归一化,比较Q、K的相似度得到权重系数,与V加权求和,得到自注意力数值;C32. Perform matrix normalization, compare the similarities of Q and K to obtain the weight coefficient, and sum it with V to obtain the self-attention value;
C33.对每个自注意力头分别进行自注意力操作,得到单头输出;C33. Perform self-attention operations on each self-attention head to obtain single-head output;
C34.最后将T个自注意力头的输出进行拼接、线性变换后,得到多头注意力的输出。C34. Finally, after splicing and linearly transforming the outputs of T self-attention heads, the output of multi-head attention is obtained.
D.利用构建的濒危语言翻译模型,实现融合句法信息的濒危语言翻译。D. Use the constructed endangered language translation model to realize endangered language translation that integrates syntactic information.
通过上述步骤,完成了标准数据库的建立,并且完成构建了融合句法信息的濒危语言翻译模型。该模型通过句法信息能够更加准确的完成对于濒危语言的翻译,并且克服了人工标注濒危语言语料费时费力、需要大量专业知识、数据量较少、使用常规神经机器翻译方法的效果较差等不足,大大提升了濒危语言翻译的有效性。Through the above steps, the establishment of the standard database is completed, and the endangered language translation model that integrates syntactic information is completed. This model can more accurately complete the translation of endangered languages through syntactic information, and overcomes the shortcomings of manually annotating endangered language corpus, which is time-consuming and labor-intensive, requires a lot of professional knowledge, has a small amount of data, and has poor results using conventional neural machine translation methods. Greatly improves the effectiveness of endangered language translation.
附图说明Description of the drawings
图1为本发明构建的基于双仿射分类器的模型TuParser的结构框图。Figure 1 is a structural block diagram of the model TuParser based on the double affine classifier constructed by the present invention.
图2为本发明构建的TuSynTRM模型的结构框图。Figure 2 is a structural block diagram of the TuSynTRM model constructed by the present invention.
具体实施方式Detailed ways
下面结合附图,通过实例进一步描述本发明。The present invention will be further described below through examples in conjunction with the accompanying drawings.
本发明方法包括:1)濒危语言依存结构树库的半自动化构建;2)基于双仿射分类器的濒危语言依存句法分析;3)建立融合句法信息的濒危语言-汉语神经机器翻译模型。The method of the invention includes: 1) semi-automatic construction of endangered language dependency structure tree library; 2) endangered language dependency syntax analysis based on double affine classifier; 3) establishing an endangered language-Chinese neural machine translation model that integrates syntactic information.
具体来说包括以下步骤:Specifically, it includes the following steps:
A半自动化构建濒危语言依存结构树库,得到CoNLL-U格式濒危语言依存结构树库,其中每条濒危语言句子由一个或多个词语的国际音标构成,每个濒危语言词语由多个字段信息表示,包括:词序索引、词性标注、支配词索引、依存关系标注信息;A semi-automatically constructs an endangered language dependency structure treebank, and obtains an endangered language dependency structure treebank in CoNLL-U format. Each endangered language sentence consists of the International Phonetic Alphabet of one or more words, and each endangered language word consists of multiple field information. Representation, including: word order index, part-of-speech tagging, dominant word index, dependency tagging information;
A1导出语料,首先挑选包含民间传说、历史故事、人物谈话等不同类别的语料,将这些句子通过ELAN语音标注软件以“国际音标、汉语对译(翻译单个词语)、汉语翻译(翻译整个句子)”三行文本的形式导出。A1 exports corpus, first selects corpus from different categories such as folklore, historical stories, character conversations, etc., and passes these sentences through the ELAN speech annotation software with "International Phonetic Alphabet, Chinese Translation (translate a single word), Chinese Translation (translate the entire sentence) ” is exported as three lines of text.
A2将A1导出的文本进行预处理包括:A2 preprocesses the text exported by A1 including:
A2.1去除重复的语料,将标点符号全部替换为空格,得到濒危语言-汉语对译平行句组。A2.1 Remove duplicate corpus, replace all punctuation marks with spaces, and obtain an endangered language-Chinese translation parallel sentence set.
A2.2将A2.1中得到的濒危语言-汉语对译平行句组以“空格”为分隔符进行机器自动分词得到单个词语,把切分时能够完全对齐(汉语注释和濒危语言国际音标符号一一对应)的平行句对和出现错误((汉语注释没有和濒危语言国际音标符号一一对应))的句子分别输出。对出现错误的句子重新分词并人工手动对齐,将手动调整后的句子与机器自动分词后得到的句子合并。A2.2 Carry out machine automatic segmentation of the endangered language-Chinese parallel sentence pairs obtained in A2.1 using "space" as the separator to obtain single words, and the segmentation can be completely aligned (Chinese annotations and endangered language International Phonetic Alphabet symbols Parallel sentence pairs with one-to-one correspondence) and sentences with errors ((Chinese annotations do not have one-to-one correspondence with International Phonetic Alphabet symbols for endangered languages)) are output separately. Re-segment the sentences with errors and manually align them, and merge the manually adjusted sentences with the sentences obtained after automatic word segmentation by the machine.
A2.3将虚词在汉语对译时采用英文缩写符号进行标记,在汉语翻译时根据上下文语境替换为相应的汉语词语。A2.3 Use English abbreviation symbols to mark function words in Chinese translation, and replace them with corresponding Chinese words according to the context during Chinese translation.
A3对A2中经过切分得到的词语进行词性标注,得到濒危语言词性标注。A3 performs part-of-speech tagging on the words segmented in A2 and obtains part-of-speech tagging for endangered languages.
A4将A2中经过预处理的句子进行依存句法关系标注,得到依存关系标注;A4 annotates the preprocessed sentences in A2 to obtain dependency syntactic relationship annotation;
A5 CoNLL-U格式濒危语言依存结构树库构建,A2得到的每条濒危语言句子由一个或多个词语的国际音标构成,每个濒危语言词语由10个字段信息表示,分别为:The A5 CoNLL-U format endangered language dependency structure tree library is constructed. Each endangered language sentence obtained by A2 is composed of the International Phonetic Alphabet of one or more words. Each endangered language word is represented by 10 field information, which are:
(1)ID:濒危语言词语的国际音标索引,每个新句子从整数1开始标记。(1)ID: The International Phonetic Alphabet index of endangered language words, each new sentence is marked starting from the integer 1.
(2)FORM:濒危语言词语的国际音标。(2)FORM: International Phonetic Alphabet for endangered language words.
(3)LEMMA:濒危语言的词根语素,这里用-代替。(3) LEMMA: The root morpheme of endangered languages, replaced by - here.
(4)UPOSTAG:濒危语言的词性标记,参见A3。(4)UPOSTAG: part-of-speech tag for endangered languages, see A3.
(5)XPOSTAG:特定的词性标记,这里用-代替。(5)XPOSTAG: A specific part-of-speech tag, replaced by - here.
(6)FEATS:濒危语言的词法或语法特点,这里用-代替。(6)FEATS: lexical or grammatical features of endangered languages, replaced by - here.
(7)HEAD:当前濒危语言词语的支配词序号,为ID的值或0(即根节点)。(7) HEAD: The dominant word sequence number of the current endangered language words, which is the value of ID or 0 (that is, the root node).
(8)DEPREL:所定义的濒危语言依存关系,参见A4。(8)DEPREL: Endangered language dependency defined, see A4.
(9)DEPS:二级依存关系,这里用-代替。(9)DEPS: Secondary dependency relationship, replaced by - here.
(10)MISC:与濒危语言词语国际音标所对应的汉语注释。(10)MISC: Chinese annotations corresponding to the International Phonetic Alphabet of endangered language words.
A6人工校验,包括对于汉语注释、词性标注、依存关系标注的校验。A6 manual verification includes verification of Chinese annotations, part-of-speech tagging, and dependency tagging.
B基于双仿射分类器对濒危语言进行依存句法分析,得到濒危语言句子的依存结构。B performs dependency syntax analysis on endangered languages based on a double affine classifier, and obtains the dependency structure of sentences in endangered languages.
zhang等人所设计的中文依存句法分析框架,提出了一种基于图的濒危语言依存句法分析模型TuParser。TuParser模型包含嵌入层、编码层、特征降维层、双仿射解析层和解码层五部分;The Chinese dependency syntax analysis framework designed by Zhang et al. proposed a graph-based endangered language dependency syntax analysis model TuParser. The TuParser model consists of five parts: embedding layer, encoding layer, feature dimensionality reduction layer, biaffine parsing layer and decoding layer;
B1嵌入层设计了濒危语言输入向量的方式,均使用ei表示所构造的输入向量。对于包含n个词语的濒危语言句子W={w1,...,wi,...,wn},其中wi表示第i个濒危语言词语。该方式借助词性特征丰富输入向量ei的表示信息The B1 embedding layer designs the endangered language input vector method, using e i to represent the constructed input vector. For an endangered language sentence containing n words W = {w 1 ,..., wi ,...,w n }, where wi represents the i-th endangered language word. This method uses part-of-speech features to enrich the representation information of the input vector e i
其中表示濒危语言词嵌入向量,/>表示词性特征向量,/>表示向量拼接操作。in Represents endangered language word embedding vector, /> Represents part-of-speech feature vector,/> Represents the vector splicing operation.
B2编码层利用三个连续BiLSTM对B1输出的输入向量进行上下文编码,对于第i个濒危语言词语wi,把经过嵌入层的输出向量ei送入BiLSTM,将模型最后一层的输出结果作为上下文特征向量ri。The B2 coding layer uses three consecutive BiLSTMs to contextually encode the input vector output by B1. For the i-th endangered language word w i , the output vector e i after the embedding layer is sent to the BiLSTM, and the output result of the last layer of the model is Context feature vector r i .
ri=BiLSTM(ei,θbilstm)r i =BiLSTM(e i ,θ bilstm )
其中θbilstm为BiLSTM模型参数。Among them, θ bilstm is the BiLSTM model parameter.
B3在特征降维层,通过MLP网络对B2的输出向量ri进行非线性变换,去除与当前决策无关的冗余信息,从而提升TuParser模型整体的训练速度与分析精度。将降维后的四个濒危语言句法特征向量分别用以下式子表示:In the feature dimensionality reduction layer, B3 performs nonlinear transformation on the output vector r i of B2 through the MLP network to remove redundant information irrelevant to the current decision-making, thus improving the overall training speed and analysis accuracy of the TuParser model. The four syntactic feature vectors of endangered languages after dimensionality reduction are expressed by the following formulas:
其中,MLP(*)指独立的多层感知机网络, 和分别表示濒危语言的依存弧子节点特征、依存弧父节点特征、依存关系类型(包括:主谓关系、宾动关系、介宾关系、状中关系、动补关系、定中关系、方位关系、并列关系、双宾语结构、子句结构、兼语结构、连谓结构、虚词结构、核心关系)子节点特征和依存关系类型父节点特征。Among them, MLP (*) refers to an independent multi-layer perceptron network, and Respectively represent the characteristics of the dependency arc child node, the characteristics of the dependence arc parent node, and the type of dependence relationship of endangered languages (including: subject-predicate relationship, object-verb relationship, preposition-object relationship, predicate-medial relationship, verb-complement relationship, definite center relationship, orientation relationship, Parallel relationship, double object structure, clause structure, concurrent structure, conjunction structure, function word structure, core relationship) child node characteristics and dependency type parent node characteristics.
B4将降维后的B3中的句法特征向量输入至双仿射解析层,分别在依存弧分类器及依存关系分类器中使用双仿射注意力机制,得到节点间依存弧表征得分Sarc和节点间分类关系表征得分Srel,评分函数公式下:B4 will reduce the dimensionality of the syntactic feature vector in B3 Input to the double affine parsing layer, and use the double affine attention mechanism in the dependency arc classifier and dependency relationship classifier respectively to obtain the dependency arc representation score S arc between nodes and the classification relationship representation score S rel between nodes. The scoring function formula Down:
其中U(*)是模型权重矩阵,H(*)代表依存弧预测、关系类型预测中的濒危语言句法特征矩阵,I是单位矩阵。Among them, U (*) is the model weight matrix, H (*) represents the syntactic feature matrix of endangered languages in dependency arc prediction and relationship type prediction, and I is the identity matrix.
B5从B4中获得节点间依存弧表征得分Sarc和节点间分类关系表征得分Srel后,在解码层采用针对依存句法分析任务所设计的一阶Eisner算法,Eisner算法本质是通过动态规划来寻找最大生成树,即通过不断合并相邻子串的分析结果,最终得到整条濒危语言句子的依存结构。After B5 obtains the inter-node dependency arc representation score S arc and the inter-node classification relationship representation score S rel from B4, it uses the first-order Eisner algorithm designed for the dependency syntax analysis task at the decoding layer. The essence of the Eisner algorithm is to find through dynamic programming Maximum spanning tree, that is, by continuously merging the analysis results of adjacent substrings, finally obtains the dependency structure of the entire endangered language sentence.
C建立融合句法信息的濒危语言-汉语神经机器翻译模型,选择Transformer作为基线模型,在其基础上设计了融合句法信息的神经机器翻译模型TuSynTRM。C established an endangered language-Chinese neural machine translation model that integrates syntactic information, selected Transformer as the baseline model, and designed a neural machine translation model TuSynTRM that integrates syntactic information based on it.
C1通过B中依存句法分析模型TuParser确定濒危语言句子中的词语之间是否存在依存关系,以及对应的依存关系类型。C1 uses the dependency syntax analysis model TuParser in B to determine whether there is a dependency relationship between words in the endangered language sentence, and the corresponding dependency relationship type.
C2融合句法信息的特征(包括:濒危语言词性、依存关系)嵌入及位置编码(濒危语言的词序索引、支配词索);C2 integrates the characteristics of syntactic information (including: endangered language parts of speech, dependencies) embedding and position coding (word order index, dominant word index of endangered languages);
C2.1将A3中提取的濒危语言词性标注、A4依存关系标注两种句法信息作为输入特征嵌入的一部分;C2.1 takes the endangered language part-of-speech tagging and A4 dependency tagging extracted from A3 as part of the input feature embedding;
对于包含n个词语的濒危语言句子W={w1,...,wi,...,wn},其中wi表示第i个濒危语言词语,W中提取的融合句法信息的输入特征嵌入ei可以表示为式5-1:For an endangered language sentence containing n words W = {w 1 ,..., wi ,...,w n }, where wi represents the i-th endangered language word, the input of the fused syntactic information extracted in W Feature embedding e i can be expressed as Equation 5-1:
其中表示预训练的濒危语言词向量(表示濒危语言单词),/>表示随机初始化的词性特征向量,/>表示濒危语言依存关系特征向量,/>表示向量拼接操作。in Represents pre-trained endangered language word vectors (representing endangered language words),/> Represents a randomly initialized part-of-speech feature vector, /> Represents the endangered language dependency feature vector,/> Represents the vector splicing operation.
C2.2将濒危语言的词序索引、支配词索引视作一种位置信息,使TuSynTRM模型能够额外学习到濒危语言的句法信息。用PEO(·)和PEH(·)分别表示濒危语言词序索引、支配词索引进行位置编码后的函数,计算方法如下:C2.2 treats the word order index and dominant word index of endangered languages as a kind of positional information, allowing the TuSynTRM model to additionally learn syntactic information of endangered languages. Use PEO(·) and PEH(·) to represent the position-encoded functions of the endangered language word order index and dominant word index respectively. The calculation method is as follows:
其中,posorder表示濒危语言词语在句子中的词序索引,poshead表示当前濒危语言词语的支配词索引,dmodel表示位置向量的维数,i表示位置向量中的某一维度。Among them, pos order represents the word order index of endangered language words in the sentence, pos head represents the dominant word index of the current endangered language words, d model represents the dimension of the position vector, and i represents a certain dimension in the position vector.
C3 TuSynTRM模型使用传统注意力机制方式对濒危语言、汉语进行信息提取,并对双语之间的关系进行建模。The C3 TuSynTRM model uses traditional attention mechanism methods to extract information from endangered languages and Chinese, and model the relationship between bilingual languages.
对于输入的濒危语言词向量序列H映射到三个矩阵中,三个矩阵记作Q、K、V,经过矩阵乘法、缩放、掩码操作后获得相关性矩阵,随后使用Softmax函数归一化,最后比较Q、K的相似度与V加权求和,得到自注意力数值。相关公式如下所示:For the input endangered language word vector sequence H is mapped to three matrices. The three matrices are denoted as Q, K, and V. After matrix multiplication, scaling, and masking operations, the correlation matrix is obtained, and then normalized using the Softmax function, and finally the similarity of Q and K is compared. Weighted summation with V, the self-attention value is obtained. The relevant formula is as follows:
Q=WqHQ=W q H
K=WkHK=W k H
V=WvHV=W v H
其中,V是表示输入特征的矩阵,Q、K是计算Attention自注意力权重的特征矩阵;KT是K的转置矩阵,dk是Q、K矩阵的列数,即向量维度,Dh代表输入濒危语言词向量的长度, 是投影矩阵;/>表示比较Q、K相似度。Among them, V is the matrix representing the input features, Q and K are the feature matrices used to calculate the attention self-attention weight; K T is the transpose matrix of K, d k is the number of columns of the Q and K matrices, that is, the vector dimension, D h represents the length of the input endangered language word vector, is the projection matrix;/> Indicates comparing Q and K similarity.
多头注意力由多个自注意力组成,计算过程首先将Q、K、V通过线性变换映射为T个子集,如下公式所示:Multi-head attention consists of multiple self-attentions. The calculation process first maps Q, K, and V into T subsets through linear transformation, as shown in the following formula:
其中t表示第t个头, 均为参数矩阵。where t represents the t-th head, are all parameter matrices.
之后对每个头分别进行自注意力操作,得到单头输出headt,如下公式所示:Then perform self-attention operations on each head separately to obtain the single-head output head t , as shown in the following formula:
headt=Attention(Qt,Kt,Vt)head t =Attention(Q t ,K t ,V t )
最后将T个头的输出拼接、线性变换后,得到多头注意力的输出,可以将其表示为:Finally, after splicing and linearly transforming the outputs of T heads, the output of multi-head attention is obtained, which can be expressed as:
MultiHead(Q,K,V)=Concat(head1,…,headT)Wc MultiHead(Q,K,V)=Concat(head 1 ,...,head T )W c
其中Wc是权重矩阵。MultiHead(Q,K,V)表示多头注意力的输出;where W c is the weight matrix. MultiHead(Q,K,V) represents the output of multi-head attention;
Concat(head1,…,headT)表示将T个头的输出进行拼接、线性变换。Concat(head 1 ,…,head T ) means concatenating and linearly transforming the output of T heads.
下面按照步骤,结合实例对本发明作进一步描述:The present invention will be further described below according to the steps and examples:
1以土家语北部方言为研究对象,半自动化建立依存结构树库。1 Taking the northern dialect of Tujia as the research object, semi-automatically establishes a dependency structure tree bank.
1.1导出共计6438条土家语句子,将这些句子通过ELAN语音标注软件以“国际音标、汉语对译、汉语翻译”三行文本的形式导出。1.1 Export a total of 6438 Tujia sentences, and export these sentences through the ELAN speech annotation software in the form of three lines of text "International Phonetic Alphabet, Chinese Translation, and Chinese Translation".
1.2数据预处理1.2 Data preprocessing
1.2.1将重复语料去除后得到6023条句子。随后,将国际音标层和汉语对译层中出现的标点符号全部替换为“空格”1.2.1 After removing duplicate corpus, 6023 sentences were obtained. Subsequently, all punctuation marks appearing in the International Phonetic Symbol layer and the Chinese translation layer are replaced with "spaces"
1.2.2将去重、去除标点符号后得到的土家语-汉语对译平行句对以“空格”为分隔符进行机器自动分词,得到经过自动切分后完全对齐的土家语-汉语对译平行句对5102条。对经过自动切分后不能完全对齐的句子重新分词并人工手动对齐,将手动调整后的句子与机器自动分词后得到的句子合并,总共得到6023条土家语-汉语对译的平行句对1.2.2 The Tujia-Chinese parallel sentence pairs obtained after removing duplication and punctuation marks are automatically segmented by machine using "space" as the separator, and the Tujia-Chinese parallel sentence pairs that are fully aligned after automatic segmentation are obtained There are 5102 sentence pairs. Sentences that could not be completely aligned after automatic segmentation were re-segmented and manually aligned. The manually adjusted sentences were merged with the sentences obtained after automatic segmentation by the machine. A total of 6023 parallel sentence pairs were obtained in Tujia-Chinese translation.
1.2.3根据土家语虚词的语法含义采用英文缩写符号进行标记,并在汉语对译时将虚词缩写符号根据上下文语境替换为相应的汉语词语。1.2.3 According to the grammatical meaning of Tujia function words, English abbreviation symbols are used to mark them, and when translating into Chinese, the function word abbreviation symbols are replaced with corresponding Chinese words according to the context.
1.3为更准确地描述土家语句子切分后各个词语的词性,参考jieba分词器的词性标注说明,结合现有土家语语料,设计了36种土家语词性标注符号;1.3 In order to more accurately describe the part-of-speech of each word after Tujia sentence segmentation, 36 Tujia part-of-speech tagging symbols were designed with reference to the part-of-speech tagging instructions of jieba word segmenter and the existing Tujia corpus;
1.4,设计了土家语依存结构树库的标注关系表,共有14种土家语依存关系类型,并给出了相应的关系说明及标注示例,其中土家语标注示例中词语之间的依存关系通过将文本加粗来表示。1.4, designed the annotation relationship table of the Tujia language dependency structure treebank. There are 14 types of Tujia language dependency relationships, and the corresponding relationship descriptions and annotation examples are given. The dependence relationships between words in the Tujia language annotation examples are determined by Text is shown in bold.
1.5每条土家语句子由一个或多个词语的国际音标构成,每个土家语词语由10个字段信息表示,分别为:1.5 Each Tujia sentence consists of the International Phonetic Alphabet of one or more words. Each Tujia word is represented by 10 field information, which are:
(1)ID:土家语词语的国际音标索引,每个新句子从整数1开始标记。(1)ID: The International Phonetic Alphabet index of Tujia words, each new sentence is marked starting from the integer 1.
(2)FORM:土家语词语的国际音标。(2)FORM: The International Phonetic Alphabet of Tujia words.
(3)LEMMA:土家语的词根语素,这里用-代替。(3) LEMMA: The root morpheme of Tujia language, replaced by - here.
(4)UPOSTAG:土家语的词性标记。(4)UPOSTAG: Tujia language part-of-speech tag.
(5)XPOSTAG:特定的词性标记,这里用-代替。(5)XPOSTAG: A specific part-of-speech tag, replaced by - here.
(6)FEATS:土家语的词法或语法特点,这里用-代替。(6)FEATS: The lexical or grammatical features of Tujia language, replaced by - here.
(7)HEAD:当前土家语词语的支配词序号,为ID的值或0(即根节点)。(7) HEAD: The dominant word sequence number of the current Tujia word, which is the value of ID or 0 (that is, the root node).
(8)DEPREL:土家语依存关系。(8)DEPREL: Tujia language dependency relationship.
(9)DEPS:二级依存关系,这里用-代替。(9)DEPS: Secondary dependency relationship, replaced by - here.
(10)MISC:与土家语词语国际音标所对应的汉语注释。(10)MISC: Chinese annotation corresponding to the International Phonetic Alphabet of Tujia words.
将经过词性、依存句法关系自动标注后的内容整理,通过编写Python脚本按照上述CoNLL-U数据格式自动生成土家语依存结构树库,共包含语料6023条。The contents that have been automatically tagged with parts of speech and dependency syntactic relationships are organized, and a Tujia dependency structure treebank is automatically generated according to the above-mentioned CoNLL-U data format by writing a Python script, which contains a total of 6023 corpus items.
1.6人工校验,具体的校验思路分为三部分:汉语注释校验,词性标注校验,依存关系校验。1.6 Manual verification, the specific verification ideas are divided into three parts: Chinese annotation verification, part-of-speech tagging verification, and dependency verification.
2.构建基于双仿射分类器的土家语依存句法分析模型TuParser;2. Construct TuParser, a Tujia language dependency syntax analysis model based on double affine classifier;
2.1嵌入层将语言表示向量和特征向量的拼接结果;2.1 The embedding layer combines the language representation vector and the feature vector;
2.2编码层利用三个连续BiLSTM对嵌入层输出的输入向量进行上下文编码。2.2 The encoding layer uses three consecutive BiLSTMs to contextually encode the input vector output by the embedding layer.
2.3在特征降维层,通过MLP网络对编码层的输出向量进行非线性变换。2.3 In the feature dimensionality reduction layer, the output vector of the coding layer is nonlinearly transformed through the MLP network.
2.4将降维后的句法特征向量输入至双仿射解析层,分别在依存弧分类器及依存关系分类器中使用双仿射注意力机制,得到节点间依存弧表征得分和节点间分类关系表征得分;2.4 Input the dimensionally reduced syntactic feature vector into the double affine parsing layer, and use the double affine attention mechanism in the dependency arc classifier and dependency relationship classifier respectively to obtain the inter-node dependency arc representation score and the inter-node classification relationship representation. Score;
2.5解码层采用针对依存句法分析任务所设计的一阶Eisner算法,最终得到整条土家语句子的依存结构。The 2.5 decoding layer uses the first-order Eisner algorithm designed for dependency syntax analysis tasks, and finally obtains the dependency structure of the entire Tujia sentence.
3.构建融合句法信息的土家语-汉语神经机器翻译模型3. Construct a Tujia-Chinese neural machine translation model that integrates syntactic information
3.1通过依存句法分析模型TuParser明确濒危语言句子中的词语之间是否存在依存关系,以及对应的关系类型。3.1 Use the dependency syntax analysis model TuParser to clarify whether there is a dependency relationship between words in endangered language sentences, and the corresponding relationship type.
3.2融合句法信息的特征嵌入及位置编码,提取濒危语言词性标注、依存关系标注两种句法信息作为输入特征嵌入的一部分。3.2 Integrate feature embedding and position coding of syntactic information, and extract two types of syntactic information, part-of-speech tagging and dependency tagging of endangered languages, as part of the input feature embedding.
3.3将濒危语言的词序索引、支配词索引视作一种位置信息,使TuSynTRM模型能够额外学习到濒危语言的句法信息。3.3 Treat the word order index and dominant word index of endangered languages as a kind of positional information, allowing the TuSynTRM model to additionally learn syntactic information of endangered languages.
3.4TuSynTRM模型使用传统注意力机制方式对濒危语言、汉语进行信息提取,并对双语之间的关系进行建模。3.4 The TuSynTRM model uses the traditional attention mechanism to extract information from endangered languages and Chinese, and model the relationship between bilingual languages.
基于以上操作构建了融合句法信息的濒危语言翻译模型。并且克服了人工标注濒危语言语料费时费力,需要大量专业知识,数据量较少,使用常规神经机器翻译方法的效果较差等问题。Based on the above operations, an endangered language translation model that integrates syntactic information is constructed. It also overcomes the problems of manually annotating endangered language corpus, which is time-consuming and labor-intensive, requires a lot of professional knowledge, has a small amount of data, and has poor results using conventional neural machine translation methods.
最后需要注意的是,公布实施例的目的在于帮助进一步理解本发明,但是本领域的技术人员可以理解:在不脱离本发明及所附的权利要求的范围内,各种替换和修改都是可能的。因此,本发明不应局限于实施例所公开的内容,本发明要求保护的范围以权利要求书界定的范围为准。Finally, it should be noted that the purpose of publishing the embodiments is to help further understand the present invention, but those skilled in the art can understand that various substitutions and modifications are possible without departing from the scope of the present invention and the appended claims. of. Therefore, the present invention should not be limited to the contents disclosed in the embodiments, and the scope of protection claimed by the present invention shall be subject to the scope defined by the claims.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310960646.3A CN116956944A (en) | 2023-08-01 | 2023-08-01 | An endangered language translation model method integrating syntactic information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310960646.3A CN116956944A (en) | 2023-08-01 | 2023-08-01 | An endangered language translation model method integrating syntactic information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116956944A true CN116956944A (en) | 2023-10-27 |
Family
ID=88456436
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310960646.3A Pending CN116956944A (en) | 2023-08-01 | 2023-08-01 | An endangered language translation model method integrating syntactic information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116956944A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118378091A (en) * | 2024-06-19 | 2024-07-23 | 之江实验室 | Method and system for constructing standard data set and baseline model for astronomical light red shift measurement |
-
2023
- 2023-08-01 CN CN202310960646.3A patent/CN116956944A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118378091A (en) * | 2024-06-19 | 2024-07-23 | 之江实验室 | Method and system for constructing standard data set and baseline model for astronomical light red shift measurement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109359293B (en) | Mongolian name entity recognition method neural network based and its identifying system | |
CN109918666B (en) | Chinese punctuation mark adding method based on neural network | |
CN114547298B (en) | Biomedical relation extraction method, device and medium based on combination of multi-head attention and graph convolution network and R-Drop mechanism | |
CN113377897B (en) | Multi-language medical term standard standardization system and method based on deep confrontation learning | |
CN114020768A (en) | Construction method and application of SQL (structured query language) statement generation model of Chinese natural language | |
CN112926345B (en) | Multi-feature fusion neural machine translation error detection method based on data augmentation training | |
CN112307773B (en) | Automatic generation method of custom question data for machine reading comprehension system | |
CN113343694B (en) | Medical named entity identification method and system | |
CN116151132B (en) | Intelligent code completion method, system and storage medium for programming learning scene | |
CN110457690A (en) | A Method for Judging the Inventiveness of a Patent | |
CN112183094A (en) | A Chinese grammar error checking method and system based on multiple text features | |
CN114528411A (en) | Automatic construction method, device and medium for Chinese medicine knowledge graph | |
CN115033659A (en) | Clause-level automatic summarization model system and summary generation method based on deep learning | |
CN115310448A (en) | A Chinese named entity recognition method based on the combination of bert and word vectors | |
CN114647715B (en) | An entity recognition method based on pre-trained language model | |
CN114943230A (en) | A Chinese Domain-Specific Entity Linking Method Integrating Common Sense Knowledge | |
CN118469006B (en) | Knowledge graph construction method, device, medium and chip for electric power operation text | |
CN115034218A (en) | Chinese grammar error diagnosis method based on multi-stage training and editing level voting | |
CN115048936A (en) | Method for extracting aspect-level emotion triple fused with part-of-speech information | |
CN115048511A (en) | Bert-based passport layout analysis method | |
CN118262874A (en) | Knowledge-graph-based traditional Chinese medicine diagnosis and treatment model data expansion system and method | |
CN116956944A (en) | An endangered language translation model method integrating syntactic information | |
CN114707508B (en) | Event detection method based on multi-hop neighbor information fusion based on graph structure | |
CN115658898A (en) | A method, system and device for extracting entity relationship between Chinese and English text | |
CN113408307B (en) | A neural machine translation method based on translation templates |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |