WO2022057116A1 - 一种基于Transformer深度学习模型的多语种地名词根汉译方法 - Google Patents

一种基于Transformer深度学习模型的多语种地名词根汉译方法 Download PDF

Info

Publication number
WO2022057116A1
WO2022057116A1 PCT/CN2020/136009 CN2020136009W WO2022057116A1 WO 2022057116 A1 WO2022057116 A1 WO 2022057116A1 CN 2020136009 W CN2020136009 W CN 2020136009W WO 2022057116 A1 WO2022057116 A1 WO 2022057116A1
Authority
WO
WIPO (PCT)
Prior art keywords
place name
root
chinese
language
place
Prior art date
Application number
PCT/CN2020/136009
Other languages
English (en)
French (fr)
Inventor
张雪英
赵文强
吴恪涵
Original Assignee
南京文图景信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京文图景信息科技有限公司 filed Critical 南京文图景信息科技有限公司
Priority to JP2021528844A priority Critical patent/JP2022552029A/ja
Publication of WO2022057116A1 publication Critical patent/WO2022057116A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the invention relates to the field of machine translation, in particular to a method for translating root nouns in English, French and German into Chinese based on a Transformer deep learning model.
  • Neural machine translation usually adopts an encoder-decoder framework to achieve end-to-end translation between natural languages, and the Transformer model is the best of many neural machine translation models.
  • the most significant difference between the Transformer model and other neural machine translation models is that the model completely relies on the attention mechanism, abandoning the recurrent neural network and convolutional neural network used in the traditional neural machine translation model, which makes the Transformer model largely alleviated
  • the problem of gradient disappearance and gradient explosion improves the parallel computing ability of the model and shortens the training time of the model.
  • the object of the present invention is to aim at the limitations and deficiencies of existing translation systems in the process of translating foreign language place names into Chinese, to provide a method for translating the roots of multilingual place names based on Transformer model, so as to obtain efficient and reasonable translation of place names in English, French and German. Chinese translation result.
  • the present invention is achieved through the following steps in order to solve the above-mentioned problems:
  • Step 1 First, preprocess the original foreign language place name corpus and the corresponding Chinese translation corpus;
  • Step 2 Then, based on the knowledge base of place names and language rules composed of the collected and sorted rules of place names in various languages and language features, and combined with the language features of foreign language place names, identify the language of the input foreign language place name;
  • Step 3 According to the language information of the recognized foreign language place names, select the place name root extraction rule corresponding to the language from the place name root extraction database, extract the root part of the foreign language place name, and use the Chinese place name root extraction rule to extract the place name corresponding to the Chinese translation. root part;
  • Step 4 Convert the foreign language place name and the root text corresponding to the Chinese translation into a character set, and use the one-hot encoding and the character embedding model constructed by the shallow feedforward neural network to obtain the corresponding character vector of each foreign language character and Chinese character;
  • Step 5 Train and fine-tune the Transformer model, and adjust the output dimension of the word embedding layer, the number of encoder layers, the number of self-attention mechanisms, the output dimension of the feedforward neural network, and the batch size based on the BLEU (Bilingual Evaluation Understudy) score.
  • the number of processing, the number of pre-training, and the values of the seven hyperparameters of the drop regularization probability enable the Transformer model to achieve the highest BLEU score in the translation results of the test set;
  • Step 6 Extract the root part of the place name to be translated into Chinese according to steps 1, 2 and 3, and convert the extraction result into a character vector and input it into the trained and fine-tuned Transformer model, and output the corresponding root Chinese translation result.
  • the above-mentioned preprocessing includes removal of special characters of place names, expansion of abbreviations of foreign language place names, unified lowercase processing of foreign language place names, and replacement of pronunciation symbols.
  • the present invention constructs a knowledge base of language rules for basic place names by summarizing and summarizing the words that appear frequently in English, French and German place names and can clearly distinguish the three languages.
  • the present invention can be further expanded in combination with common names and place names in English, French and German summarized in a third-party knowledge base, and a place name language rules knowledge base is established to assist the language identification of place names.
  • the above-mentioned place name root extraction includes the elimination of the common names of place names and the words that play a turning role in place names, that is, by constructing a place name elimination thesaurus, the generalized vocabulary of foreign and Chinese place names and the words that play a turning role are stored in it. After the preprocessed foreign and Chinese place names are processed by word segmentation, each word segmentation result is compared with the place name elimination lexicon through the index, and only the unmatched word segmentation results are retained, so as to obtain the root of the foreign language and Chinese place names.
  • the conversion of the extraction result into a character vector is to convert the toponym root character represented by the one-hot encoding into a character vector by constructing a shallow feedforward neural network.
  • the above-mentioned fine-tuning Transformer model is to determine the word embedding layer output dimension, the number of encoder layers, the number of self-attention mechanisms, the output dimension of the feed-forward neural network, the number of batches, the number of pre-training times, and the drop regularization by controlling variables. Probability of local optimal values for seven hyperparameters.
  • the BLEU score of the model with different values of the hyperparameter on the test set is evaluated, so as to determine the hyperparameter The best value within the range of values.
  • the above-mentioned model training times are not less than 50,000.
  • the present invention has the following beneficial technical effects:
  • the present invention focuses on the end-to-end translation between the noun root of a foreign language and a root of a Chinese place name, realizes the extraction of the noun root in the foreign language place name and the Chinese place name through the method based on the knowledge base, and converts the foreign language place name and the Chinese place name through the character embedding model.
  • the result of word root extraction is further transformed into a character set, which is used as the input of the Transformer model in the form of a special character set, which expands the contextual dependence of the place name sequence, so as to obtain a better translation result of the place name root.
  • the present invention summarizes and sorts out the foreign language features, corresponding language place name features and person name features, converts the above features into corresponding rules, and builds a knowledge base of place name language rules. Using the constructed knowledge base of place name language rules to identify the input foreign language place name language, thereby reducing the dependence on manual work.
  • the present invention summarizes and sorts out the components of the foreign language place names involved, classifies each component, converts the occurrence rules into rules, and constructs a place noun root extraction rule base.
  • the constructed toponymic root extraction rule base is used to extract the root part of the input foreign language toponyms, thereby significantly improving the translation efficiency of toponymic roots.
  • Fig. 1 is the flow chart of the Chinese translation method of foreign language geographical nouns of the present invention
  • Fig. 2 is the flow chart of obtaining root character vector of the present invention
  • Fig. 3 is the Transformer model architecture diagram that the present invention relates to
  • FIG. 4 is a flow chart of the calculation of the multi-head attention mechanism in the Transformer model involved in the present invention.
  • the Chinese translation method of multilingual geographical nouns based on the Transformer deep learning model includes the following steps:
  • the foreign language place name corpus is uniformly processed in lowercase and replaced by diacritics. For example, "New York” and “new york”, “cafe” and “café” both point to the same place name.
  • the character replacement method unifies the format of the foreign language place name corpus.
  • Training and fine-tuning the Transformer model is a Chinese translation model of foreign geographical nouns, and the training corpus is shown in Table 1.
  • the data required for model training consists of the foreign language gazetteer root and the corresponding Chinese gazetteer root dataset divided into training set, validation set and test set according to the ratio of 7:2:1.
  • the training set is the data required for model training
  • the validation set is the data set used by the model to judge the performance of the model after training a fixed number of times, which can effectively indicate whether the model is in a state of overfitting or underfitting
  • the test set is to judge whether the model is trained or not.
  • the main body of the Transformer model is composed of an encoder (Encoder) and a decoder (Decoder).
  • the input of the encoder and the decoder are the character vector of foreign language place names and the corresponding Chinese place name character vector respectively, and the dimension of the character vector is composed of words
  • the output dimension of the embedding layer is controlled.
  • one-step position encoding processing is performed, and a matrix M pe of the same dimension is added to each character vector.
  • the calculation formula is:
  • EncoderInput V ei (V ci )+M pe
  • the self-attention mechanism is triggered, and the character vector will be multiplied by the matrices W q , W k , W v respectively to obtain the query matrix Q , the key matrix K and the value matrix V, the output Z calculation formula of the self-attention mechanism is:
  • MultiHead(Z1,Z2,...,Zn) Concat(Z1,Z2,...,Zn)W o
  • the model Before the output of the multi-head self-attention mechanism enters the feedforward neural network, the model performs a residual connection operation on it, combining the input information of the encoder with the output of the multi-head self-attention mechanism.
  • the specific calculation formula is:
  • LayerNorm is a regularization operation.
  • Z1, Z2, ..., Zn are used as the input of the feedforward neural network, and the output dimension of the feedforward neural network is controlled by the output dimension of the feedforward neural network.
  • the output of the feedforward neural network also needs a residual connection and regularization operation before it can be input into the next coding layer. In this residual connection and regularization operation, the output of the feedforward neural network needs to be the same as the first residual. Addition of Z1, Z2, ..., Zn after difference join and regularization operation.
  • the operations performed in each coding layer after that are consistent with the above operations, and the number of coding layers is controlled by the number of encoder/decoder layers.
  • the operation in the encoder is roughly the same as the decoder, except that the input to the decoder is a character vector of the root character set of Chinese toponyms, and the encoder-decoder attention is added in each decoding layer compared to the encoding layer
  • the force mechanism combines the matrix output from the decoder and the output of the multi-head attention mechanism obtained in the encoding layer, fusing the input and output latent features.
  • the Transformer model builds a feedforward neural network layer and a softmax layer to operate on the output of the encoder.
  • the feedforward neural network layer maps the output of the encoder to a vector with the same dimension as the dictionary, and the softmax layer converts the mapped vector is the probability, and the character corresponding to the maximum probability is used as the output, and the final output of the model is composed of each output character.
  • the batch size determines the amount of data after the training data is divided into batches
  • the number of pre-training determines the number of times the model is pre-trained before formal training
  • the probability of discarding regularization determines that all neurons in the model training process do not update parameter neurons proportion.
  • the dynamic composition method of geographic model network service mainly consists of the following three parts:
  • place name root data extraction module The preprocessing result of the place name source data "hazardville fire department" and the corresponding Chinese translation “Hazardville Fire Department" are used as the input of the place name root data extraction module.
  • the place name root data extraction module first extracts the place name root part according to the place name splitting rules.
  • the toponymic roots extracted from the input placenames are "hazardville” and "Hazardville”.
  • the place name splitting rules are summarized after analyzing the characteristics of English and Chinese place names. Among them, the English place name splitting rules will filter out place name prefixes, place name suffixes and place name special words. As shown in Table 1, place name prefix words mainly include location.
  • place name suffix words mainly include three categories: natural environment generic name, administrative division generic name and point of interest generic name; place name special words are a collection of words that play a turning or successor role in the word order in place names, while Chinese place names are split. rules such as
  • place-name prefixes and place-name suffixes are filtered out, and the content of Chinese place-name prefixes and place-name suffixes is similar to that of English place-name prefixes and place-name suffixes.
  • the toponymic root data is first converted into a character set, and then the shallow neural network constructed by the word-embedding layer in the open source PyTorch converts the geographical name data in the form of characters into a vector form that can be understood by computers.
  • the process of vectorization of “hazardville” through shallow neural network is shown in Figure 2.
  • Table 3 Examples of corpora required for Transformer model training and fine-tuning
  • the method for obtaining the local optimal value of the other 6 parameters in the Transformer model including the input dimension, the output dimension of the feedforward layer, the number of coding layers and the number of batches, is the same as the above method.
  • the input of the encoder and decoder in the Transformer model is the character vector of the English place name character set and the corresponding Chinese translation place name character set, respectively.
  • the specific architecture of the model is shown in Figure 3.
  • the character vector will be input to the encoder and decoder.
  • For position encoding processing add a matrix M pe of the same dimension to each character vector V ci in the character set.
  • the calculation formula is:
  • the position-encoded character vector is input to the encoder and decoder, it is multiplied by the matrices W q , W k , W v to obtain the query matrix Q, the key matrix K and the value matrix V, and the calculation formula of the output Z of the self-attention mechanism for:
  • d k is the dimension of the character vector
  • the output of the multi-head self-attention mechanism is to connect the outputs of all self-attention mechanisms together and multiply it by the matrix W o , and the calculation formula is:
  • MultiHead(Z1,Z2,...,Zn) Concat(Z1,Z2,...,Zn)W o
  • the model Before the output of the multi-head self-attention mechanism enters the feedforward neural network, the model performs a residual connection operation on it, combining the input information of the encoder or decoder with the output of the multi-head self-attention mechanism.
  • the calculation formula is:
  • LayerNorm is a regularization operation, Z1, Z2, ..., Zn after residual connection and regularization operation are used as the input of the feedforward neural network, so as to model the potential mapping relationship between the source language and the target language.
  • the output of the feedforward neural network needs a residual connection and regularization operation before it can be input to the next encoding layer or decoding layer.
  • the output of the feedforward neural network needs to be The first residual connection and Z1, Z2, ..., Zn are added after the regularization operation.
  • the specific calculation flow of the multi-head attention mechanism is shown in Figure 4.
  • the operation in the decoder is roughly the same as the encoder, the difference is that the encoder-decoder attention mechanism is added in each decoding layer compared with the encoding layer, and the matrix output from the encoder and the multi-head attention obtained in the decoding layer are combined. Force mechanism outputs are combined, fusing input and output latent features.
  • the Transformer model builds a feedforward neural network layer and a softmax layer to operate on the output of the decoder.
  • the feedforward neural network layer maps the output of the decoder to a vector with the same dimension as the dictionary, and the softmax layer converts the mapped vector is the probability, and the character corresponding to the maximum probability is used as the output.
  • the final output of the model is composed of each output character. Combined with this example, the final output of the model is "Hazzardville".

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种基于Transformer模型的多语种地名汉译方法,其语种范围涵盖英语、法语和德语:基于地名语种知识库结合待汉译地名的语种特征,分辨输入待汉译地名的语种,并根据语种选取地名词根抽取规则库中相应的地名词根抽取规则来提取待汉译地名的词根;将提取到的地名词根文本通过字符嵌入模型转为字符向量;基于英语、法语和德语地名词根与对应中文地名词根翻译语料训练和微调得到的Transformer模型,输入待汉译地名词根的字符向量,获取最终词根汉译结果。上述汉译的英语、法语和德语地名词根结果均具有较好可读性,符合汉语阅读习惯,一定程度上满足多语种地名词根汉译需求,具有良好的灵活性和普适性。

Description

一种基于Transformer深度学习模型的多语种地名词根汉译方法 技术领域
本发明涉及机器翻译领域,具体涉及一种基于Transformer深度学习模型的英语、法语和德语地名词根汉译方法。
背景技术
[根据细则91更正 09.06.2021] 
地名作为不可或缺的基础地理信息和社会公共信息,是各类社会信息关联的重要桥梁,在国家和社会管理、经济发展、文化建设、国防外交等方面发挥着重要作用。经济交往过程中大量外文地名的出现急需提出一种能合理地翻译外文地名的方法。
近年来,神经机器翻译方面的研究得到迅速发展,相对于统计机器翻译而言在翻译质量上取得显著的提升。神经机器翻译通常采用编码器-解码器框架实现自然语言之间的端到端翻译,而Transformer模型是众多神经机器翻译模型中的佼佼者。Transformer模型与其他神经机器翻译模型最为显著的不同之处在于模型完全依赖注意力机制,摒弃了传统神经机器翻译模型采用的循环神经网络和卷积神经网络,这使得Transformer模型很大程度上缓解了梯度消失和梯度爆炸问题,提高了模型并行计算的能力,缩短了模型训练的时间。
目前,谷歌、微软和百度等在内的高科技公司都推出了相应的翻译产品,并且广受好评,但这些翻译产品在翻译外文地名时会出现错误使用意译和音译的问题,导致外文地名被翻译为某个形容词或特殊名词,同时还会出现中文翻译结果语序混乱的问题,翻译结果与中文使用习惯不相符。因此,如何实现能合理并高效翻译外文地名的翻译方法是目前迫切需要解决的问题。
发明内容
本发明的目的在于针对现有翻译系统在外文地名汉译过程中出现的局限和不足,提供一种基于Transformer模型的多语种地名词根汉译方法,以获得高效合理的英语、法语和德语地名的汉译结果。
本发明为解决上述问题是通过以下步骤来实现的:
步骤1:首先对原始外文地名语料和对应中文翻译语料进行预处理;
步骤2:然后基于由收集、整理的各语种地名、语言特征获取的规则所组成的地名语种规则知识库并结合外文地名的语种特征,识别输入外文地名的语种;
步骤3:根据识别出的外文地名的语种信息,从地名词根抽取库中选择与语种相对应的地名词根抽取规则,提取外文地名的词根部分,利用中文地名词根抽取规则抽取对应中文翻译中的地名词根部分;
步骤4:将外文地名和对应中文翻译的词根文本转化为字符集合,并利用独热编码与由浅层前馈神经网络构建的字符嵌入模型获取每个外文字符和中文字符相应的字符向量;
步骤5:训练和微调Transformer模型,以BLEU(Bilingual Evaluation Understudy,双语评估替补)得分为依据来调整词嵌入层输出维度、编码器层数、自注意力机制数、前馈神经网络输出维度、批处理数量、预训练次数和丢弃正则化概率七个超参数的取值,使得Transformer模型对测试集的翻译结果能取得最高的BLEU得分;
步骤6:按照步骤1、2和3提取待汉译地名的词根部分,并将提取结果转化为字符向量输入到训练、微调完毕的Transformer模型中,输出相应的词根汉译结果。
作为优选,上述预处理包括地名特殊字符剔除处理、外文地名缩写部分扩充处理和外文地名统一小写化处理和发音符号替代处理。
为实现所述地名特殊字符剔除处理、外文地名缩写部分扩充处理和外文地名统一小写化处理,需构建特殊字符库、缩写-全称映射库和发音符号替换映射库,并以上述知识库为基础,遍历地名字符串。
作为优选,本发明通过归纳总结获得英语、法语和德语地名中出现频率高且能清晰区分三种语言的单词构建基础地名语种规则知识库。
今天不,本发明基于所述基础地名语种规则知识库可结合第三方知识库中归纳的英语、法语和德语中常用人名、地名做进一步扩充,建立地名语种规则知识库辅助地名的语种识别。
上述地名词根提取包含对地名通名和地名中起到转折作用词汇的剔除,即通过构建一个地名剔除词库,将归纳整理的外文、中文地名常用通名词汇和起到转折作用的词汇储存其中,预处理后的外文、中文地名经过分词处理后,将每个分词结果通过索引与地名剔除词库对比,仅保留不能匹配的分词结果,从而获得外文、中文地名的词根。
上述步骤6中,所述将提取结果转化为字符向量是通过构建浅层前馈神经网络将由独热编码表示的地名词根字符转化为字符向量。
上述微调Transformer模型是通过控制变量的方法设置对照实验来确定词嵌入层输出维度、编码器层数、自注意力机制数、前馈神经网络输出维度、批处理数量、预训练次数和丢弃正则化概率七个超参数的局部最优取值。
通过采用固定其他超参数不变,改变上述七个超参数中某个超参数的取值,经过模型训练后评价该超参数的不同取值模型在测试集上的BLEU得分,从而判定该超参数在取值范围内的最优取值。
作为优选,上述模型训练次数不低于50000。
与现有技术相比,本发明具有以下有益技术效果:
1,本发明着重于外文地名词根与中文地名词根间的端到端翻译,通过基于知识库的方法实现外文地名和中文地名中地名词根的抽取,并通过字符嵌入模型将外文地名和中文地名的词根抽取结果进一步转化为字符集合,以特殊的字符集合形式作为Transformer模型的输入,扩充了地名序列上下文依赖,从而得到更好的地名词根翻译结果。
2,本发明归纳整理所涉及外文特征、相应语种地名特征和人名特征,将上述特征转化为相应的规则,构建地名语种规则知识库。利用构建完毕的地名语种规则知识库识别输入外文地名语种,从而减少了对于人工的依赖。
3,本发明归纳整理所涉及外文地名各项组成部分,并对各个组分进行分类,将其出现规律转化为规则,构建地名词根抽取规则库。利用构建完毕的地名词根抽取规则库抽取输入外文地名中的词根部分,从而显著提高了地名词根的翻译效率。
附图说明
图1是本发明的外文地名词根汉译方法流程图;
图2是本发明的词根字符向量获取流程图;
图3是本发明涉及的Transformer模型架构图;
图4是本发明涉及的Transformer模型中多头注意力机制计算流程图。
具体实施方式
下面结合附图对本发明的具体实施做详细的说明。基于Transformer深度学习模型的多语种地名词根汉译方法包含以下步骤:
(1)对原始外文地名语料和对应的中文翻译语料进行预处理,剔除外文地名语料中的特殊字符;剔除特殊字符的外文地名还需对缩写部分按照规则进行扩充;扩充完毕的外文地名语料还需进行小写化处理和发音符号替代处理。
1)通过建立特殊字符库结合字符串匹配的方法,剔除外文地名语料中由于编码转换、数据清洗不完全而存在的“#$./-”等特殊字符。
2)对外文地名中缩写形式,通过缩写对应规则来将地名语料存在的缩写形式转化为全称。
3)外文地名语料统一进行小写处理和发音符号替换处理,例如“New York”和“new york”,“cafe”和“café”都指向同一种地名,通过小写处理和基于发音符号替换词库的字符替换方法统一外文地名语料的格式。
(2)通过归纳总结得到的地名语种识别知识库,根据知识库中建立的单词与源语言之间的“键-值”关联来识别输入地名的语种。
(3)根据待汉译地名的语种信息,选择语种对应的地名词根提取规则和中文地名词根提取规则提取预处理后的外文地名语料和中文翻译语料中的词根,词根提取规则包括地名专名提取规则、地名通名和起到转折作用的词汇剔除规则两部分,确定输入地名中应该被剔除或被保留的部分;
(3)根据地名词根提取结果,将外文地名词根和中文地名词根转化为相应的字符集合,并构建每个外文地名词根和中文地名词根的字符向量,其对应的字符向量分别表示为V ei,V ci
(4)训练和微调Transformer模型为外文地名词根汉译模型,训练语料如表1所示。模型训练所需数据由外文地名词根和对应的中文地名词根数据集按照7:2:1的比例分割为训练集、验证集和测试集而组成。训练集是模型训练时所需的数据,验证集是模型在训练固定次数后判断模型性能的数据集,可以有效地提示模型是否处于过拟合或欠拟合状态,测试集是判断模型训练是否符合要求的数据集。在Transformer模型正式训练过程中,通过改变一个参数(如,注意力机制数),固定其他参数取值不变的方式,观察参数在不同取值下,比较模型在相同数据集上训练和测试的BLEU得分,从而判定该参数的局部最优取值。本方法对包括编码器/解码器层数、注意力机制数、词嵌入层输出维度、前馈神经网络输出维度、批处理大小、预训练次数和丢弃正则化概率在内的7个超参数进行了微调。
Transformer模型主体由编码器(Encoder)和解码器(Decoder)组成,在模型训练阶段,编码器和解码器的输入分别是外文地名字符向量和对应中文地名字符向量,而字符向量的维度是由词嵌入层输出维度控制的。字符向量在输入编码器或解码器前会先进行一步位置编码处理,为每个字符向量加上一个相同维度的矩阵M pe,计算公式为:
EncoderInput=V ei(V ci)+M pe
接下来将关于解码器中的处理进行详细介绍,经过位置编码处理的字符向量输入编码器后,触发自注意力机制,字符向量将分别乘上矩阵W q,W k,W v获得query矩阵Q,key矩阵K和value矩阵V,自注意力机制的输出Z计算公式为:
Figure PCTCN2020136009-appb-000001
其中,d k是字符向量的维度,而多头自注意力机制的输出则是将所有自注意力机制的输出连接在一起并乘上矩阵W o,其中自注意力机制的数量n由注意力机制数确定,具体计算公式为:
MultiHead(Z1,Z2,…,Zn)=Concat(Z1,Z2,…,Zn)W o
在多头自注意力机制的输出进入前馈神经网络之前,模型对其进行了一次残差连接操作,将编码器的输入信息和多头自注意力机制的输出相结合,具体计算公式为:
Z1,Z2,…,Zn=LayerNorm(MultiHead(Z1,Z2,…,Zn)+EncoderInput)
其中,LayerNorm是一种正则化操作,经过残差连接和正则化操作后的Z1,Z2,…,Zn作为前馈神经网络的输入,前馈神经网络输出维度则由前馈神经网络输出维度控制。前馈神经网络的输出还需要进行一次残差连接和正则化操作才能输入到下一个编码层中,在这次残差连接和正则化操作中,前馈神经网络的输出需要和第一次残差连接和正则化操作后的Z1,Z2,…,Zn相加。之后每个编码层内进行的操作都与上述操作一致,而编码层数量由编码器/解码器层数控制。
在编码器中的操作与解码器大致相同,不同之处在于解码器的输入是中文地名词根字符集合的字符向量,以及在每个解码层中相较于编码层增加了编码器-解码器注意力机制,将解码器输出的矩阵和编码层中获得的多头注意力机制输出相结合,融合了输入和输出潜在的特征。
Transformer模型构建了前馈神经网络层和softmax层对编码器的输出进行操作,其中前馈神经网络层是将编码器的输出映射为与词典维度相同的向量,而softmax层则将映射后向量转化为概率,并将最大概率对应的字符作为输出,模型的最后输出是由每个输出字符组成的。
除了与Transformer内部结构相关的超参数外,在微调过程中还考虑了批处理大小、预训练次数和丢弃正则化概率三个超参数。批处理大小决定了训练数据被划分为批数据后的数据量,预训练次数决定了正式训练前模型预训练的次数,丢弃正则化概率决定了模型训练过程中所有神经元中不更新参数神经元的占比。
如图1所示,地理模型网络服务动态组合方法主要由以下三个部分组成:
1.基于规则的地名词根抽取;
2.地名词根的字符向量表达;
3.Transformer模型的训练和微调。
以英文地名“Hazardville Fire Department”和对应的中文翻译“哈扎德维尔消防局”为例来详细描述外文地名词根汉译流程。
(1)地名源数据预处理
首先,英文地名“Hazardville Fire Department”与中文翻译“哈扎德维尔消防局”结合为地名翻译对,其次由于地名“Hazardville Fire Department”中不存在特殊字符,因此经过英文地名小写处理后转变为“hazardville fire department”。
(2)基于规则的地名词根提取
地名源数据预处理结果“hazardville fire department”与对应中文翻译“哈扎德维尔消防局”作为地名词根数据提取模块的输入,地名词根数据提取模块首先根据地名拆分规则提取出地名词根部分,在本例中,输入地名所提取到的地名词根为“hazardville”和“哈扎德维尔”。地名拆分规则是在分析英文和中文地名特性之后总结出来,其中,英文地名拆分规则会过滤掉地名前缀词、地名后缀词和地名特殊单词,如表1所示,地名前缀词主要包括方位词;地名后缀词主要包括自然环境通名,行政区划通名和兴趣点通名三大类;地名特殊单词是在地名中对语序起转折或承接作用的单词所组成的集合,而中文地名拆分规则如
表2所示会过滤掉地名前缀词和地名后缀词,中文地名前缀词和地名后缀词所包含的内容类似英文地名前缀词和地名后缀词。
表1:英文地名拆分规则
Figure PCTCN2020136009-appb-000002
表2:中文地名拆分规则
Figure PCTCN2020136009-appb-000003
(3)地名词根向量化
基于地名词根数据提取结果,首先将地名词根数据转化为字符集合,之后通过开源的PyTorch中word-embedding层构建的浅层神经网络将字符形式的地名数据转化为计算机能读懂的向量形式,“hazardville”经浅层神经网络向量化的流程如图2所示。
(4)训练和微调Transformer模型
Transformer模型训练和微调所需语料的具体样例如表3所示:
表3:Transformer模型训练和微调所需语料样例
英语原文 标准翻译参照
Union 尤宁
Pelham 佩勒姆
Saul 萨于勒
Donhead 唐黑德
St Mary 圣玛丽
Tuttington 塔廷顿
Mayflower 梅弗劳尔
Macclesfield 麦克尔斯菲尔德
在Transformer模型实际训练和微调过程中,以微调注意力机制数为例,严格按照控制变量法的方式,固定模型中其他参数不变,分别设置注意力机制数为8、32、128、256个,经过50000次训练后,评估各个模型在测试集上的BLEU得分,从而认为注意力机制数取256是局部最优值,具体实验结果如表4所示。
表4:其他条件不变,在不同的注意力机制数下,模型BLEU得分表
Figure PCTCN2020136009-appb-000004
Transformer模型中包括输入维度、前馈层输出维度、编码层数和批处理数等其他6个参数的局部最优值获得方法与上述方法相同。
Transformer模型中的编码器与解码器的输入分别是英文地名字符集合与相应汉译地名字符集合的字符向量,模型具体架构如图3所示,字符向量在输入编码器和解码器前会进行一步位置编码处理,为字符集合中每个字符向量V ci加上一个相同维度的矩阵M pe,计算公式为:
Input=V ci+M pe
经过位置编码的字符向量输入编码器和解码器后,分别乘上矩阵W q,W k,W v获得query矩阵Q,key矩阵K和value矩阵V,而自注意力机制的输出Z的计算公式为:
Figure PCTCN2020136009-appb-000005
其中,d k是字符向量的维度,而多头自注意力机制的输出则是将所有自注意力机制的输出连接在一起并乘上矩阵W o,计算公式为:
MultiHead(Z1,Z2,…,Zn)=Concat(Z1,Z2,…,Zn)W o
在多头自注意力机制的输出进入前馈神经网络之前,模型对其进行了一次残差连接操作,将编码器或解码器输入信息和多头自注意力机制的输出相结合,计算公式为:
Z1,Z2,…,Zn=LayerNorm(MultiHead(Z1,Z2,…,Zn)+EncoderInput)
其中,LayerNorm是一种正则化操作,经过残差连接和正则化操作后的Z1,Z2,…,Zn作为前馈神经网络的输入,从而对源语言和目标语言间的潜在映射关系进行建模,前馈神经网络的输出还需要进行一次残差连接和正则化操作才能输入到下一个编码层或解码层中,在这次残差连接和正则化操作中,前馈神经网络的输出需要和第一次残差连接和正则化操作后的Z1,Z2,…,Zn相加,多头注意力机制具体计算流程如图4所示。
在解码器中操作与编码器大致相同,不同之处在于每个解码层中相较于编码层增加了编码器-解码器注意力机制,将编码器输出的矩阵和解码层中获得的多头注意力机制输出相结合,融合了输入和输出潜在的特征。
Transformer模型构建了前馈神经网络层和softmax层对解码器的输出进行操作,其中前馈神经网络层是将解码器的输出映射为与词典维度相同的向量,而softmax层则将映射后向量转化为概率,并将最大概率对应的字符作为输出,模型的最后输出是由每个输出字符组成的,结合本例,模型最后的输出为“哈扎德维尔”。
以上所述,仅为本发明中的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉该技术的人在本发明所揭露的技术范围内,可想到的变换或替换,都应涵盖在本发明的包含范围之内。

Claims (10)

  1. 一种基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于包括以下步骤:
    步骤1:对原始外文地名语料和对应中文翻译语料进行预处理;
    步骤2:基于由收集、整理的各语种地名、语言特征获取的规则所组成的地名语种规则知识库并结合外文地名的语种特征,识别输入外文地名的语种;
    步骤3:根据识别出的外文地名的语种信息,从地名词根抽取库中选择与语种相对应的地名词根抽取规则,提取外文地名的词根部分,利用中文地名词根抽取规则抽取对应中文翻译中的地名词根部分;
    步骤4:将外文地名和对应中文翻译的词根文本转化为字符集合,并利用独热编码与由浅层前馈神经网络构建的字符嵌入模型获取每个外文字符和中文字符相应的字符向量;
    步骤5:训练和微调Transformer模型,以BLEU得分为依据来调整词嵌入层输出维度、编码器层数、自注意力机制数、前馈神经网络输出维度、批处理数量、预训练次数和丢弃正则化概率七个超参数的取值,使得Transformer模型对测试集的翻译结果能取得最高的BLEU得分;
    步骤6:按照步骤1、2和3提取待汉译地名的词根部分,并将提取结果转化为字符向量输入到训练、微调完毕的Transformer模型中,输出相应的词根汉译结果。
  2. 根据权利要求1所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于,所述预处理包括地名特殊字符剔除处理、外文地名缩写部分扩充处理和外文地名统一小写化处理和发音符号替代处理。
  3. 根据权利要求2所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于,构建特殊字符库、缩写-全称映射库和发音符号替换映射库,并以上述知识库为基础,以遍历地名字符串的方式实现所述地名特殊字符剔除处理、外文地名缩写部分扩充处理和外文地名统一小写化处理。
  4. 根据权利要求1所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于,通过归纳总结获得英语、法语和德语地名中出现频率高且能清晰区分三种语言的单词构建基础地名语种规则知识库。
  5. 根据权利要求4所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于,基于所述基础地名语种规则知识库可结合第三方知识库中归纳的英语、法语和德语中常用人名、地名做进一步扩充,建立地名语种规则知识库辅助地名的语种识别。
  6. 根据权利要求1所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于,所述地名词根提取包含对地名通名和地名中起到转折作用词汇的剔除,即通过构 建一个地名剔除词库,将归纳整理的外文、中文地名常用通名词汇和起到转折作用的词汇储存其中,预处理后的外文、中文地名经过分词处理后,将每个分词结果通过索引与地名剔除词库对比,仅保留不能匹配的分词结果,从而获得外文、中文地名的词根。
  7. 根据权利要求1所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于,步骤6中所述将提取结果转化为字符向量是通过构建浅层前馈神经网络将由独热编码表示的地名词根字符转化为字符向量。
  8. 根据权利要求1所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于所述微调Transformer模型是通过控制变量的方法设置对照实验来确定词嵌入层输出维度、编码器层数、自注意力机制数、前馈神经网络输出维度、批处理数量、预训练次数和丢弃正则化概率七个超参数的局部最优取值。
  9. 根据权利要求8所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于通过采用固定其他超参数不变,改变上述七个超参数中某个超参数的取值,经过模型训练后评价该超参数的不同取值模型在测试集上的BLEU得分,从而判定该超参数在取值范围内的最优取值。
  10. 根据权利要求9所述的基于Transformer深度学习模型的多语种地名词根汉译方法,其特征在于所述模型训练次数不低于50000。
PCT/CN2020/136009 2020-09-15 2020-12-14 一种基于Transformer深度学习模型的多语种地名词根汉译方法 WO2022057116A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2021528844A JP2022552029A (ja) 2020-09-15 2020-12-14 Transformerのディープラーニングモデルに基づいて多言語による地名の語根を中国語に翻訳する方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010967634.XA CN112084796B (zh) 2020-09-15 2020-09-15 一种基于Transformer深度学习模型的多语种地名词根汉译方法
CN202010967634.X 2020-09-15

Publications (1)

Publication Number Publication Date
WO2022057116A1 true WO2022057116A1 (zh) 2022-03-24

Family

ID=73737117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136009 WO2022057116A1 (zh) 2020-09-15 2020-12-14 一种基于Transformer深度学习模型的多语种地名词根汉译方法

Country Status (3)

Country Link
JP (1) JP2022552029A (zh)
CN (1) CN112084796B (zh)
WO (1) WO2022057116A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101348A (zh) * 2020-08-28 2020-12-18 广州探迹科技有限公司 多语种端到端ocr算法及系统
CN114626363A (zh) * 2022-05-16 2022-06-14 天津大学 一种基于翻译的跨语言短语结构分析方法及装置
CN114821257A (zh) * 2022-04-26 2022-07-29 中国科学院大学 导航中视频流与自然语言的智能处理方法和装置、设备
CN114897004A (zh) * 2022-04-15 2022-08-12 成都理工大学 一种基于深度学习Transformer模型的梯形堆积核脉冲识别方法
CN117592462A (zh) * 2024-01-18 2024-02-23 航天宏图信息技术股份有限公司 基于地物群的开源地名数据的相关性处理方法及装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239707A (zh) * 2021-03-01 2021-08-10 北京小米移动软件有限公司 文本翻译方法、文本翻译装置及存储介质
CN113393445B (zh) * 2021-06-21 2022-08-23 上海交通大学医学院附属新华医院 乳腺癌影像确定方法及系统
AU2021104429A4 (en) * 2021-07-22 2021-09-16 Chinese Academy Of Surveying And Mapping Machine Translation Method for French Geographical Names

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140163951A1 (en) * 2012-12-07 2014-06-12 Xerox Corporation Hybrid adaptation of named entity recognition
CN104331401A (zh) * 2014-11-25 2015-02-04 中国农业银行股份有限公司 一种翻译方法及系统
CN111222342A (zh) * 2020-04-15 2020-06-02 北京金山数字娱乐科技有限公司 一种翻译方法和装置
CN111310456A (zh) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 一种实体名称匹配方法、装置及设备

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0754522B2 (ja) * 1987-02-27 1995-06-07 日本電信電話株式会社 自然言語複合名詞解析・変換方式
JP6144458B2 (ja) * 2012-03-06 2017-06-07 日本放送協会 手話翻訳装置及び手話翻訳プログラム
CN108563640A (zh) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 一种多语言对的神经网络机器翻译方法及系统
CN109829173B (zh) * 2019-01-21 2023-09-29 中国测绘科学研究院 一种英文地名翻译方法及装置
CN109902312B (zh) * 2019-03-01 2023-07-11 北京金山数字娱乐科技有限公司 一种翻译方法及装置、翻译模型的训练方法及装置
CN110457715B (zh) * 2019-07-15 2022-12-13 昆明理工大学 融入分类词典的汉越神经机器翻译集外词处理方法
CN111008517A (zh) * 2019-10-30 2020-04-14 天津大学 一种基于张量分解技术的神经语言模型的压缩方法
CN111178091B (zh) * 2019-12-20 2023-05-09 沈阳雅译网络技术有限公司 一种多维度的中英双语数据清洗方法
CN111209749A (zh) * 2020-01-02 2020-05-29 湖北大学 一种将深度学习应用于中文分词的方法
CN111368035A (zh) * 2020-03-03 2020-07-03 新疆大学 一种基于神经网络的汉维-维汉机构名词典的挖掘系统
CN111444343B (zh) * 2020-03-24 2021-04-06 昆明理工大学 基于知识表示的跨境民族文化文本分类方法
CN111581988B (zh) * 2020-05-09 2022-04-29 浙江大学 一种基于任务层面课程式学习的非自回归机器翻译模型的训练方法和训练系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140163951A1 (en) * 2012-12-07 2014-06-12 Xerox Corporation Hybrid adaptation of named entity recognition
CN104331401A (zh) * 2014-11-25 2015-02-04 中国农业银行股份有限公司 一种翻译方法及系统
CN111310456A (zh) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 一种实体名称匹配方法、装置及设备
CN111222342A (zh) * 2020-04-15 2020-06-02 北京金山数字娱乐科技有限公司 一种翻译方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AN, SUYALA: "Chinese-Mongolian Organization Name Translation Based on Transformer", JOURNAL OF CHINESE INFORMATION PROCESSING, vol. 34, no. 1, 1 January 2020 (2020-01-01), pages 58 - 62, XP055912721, ISSN: 1003-0077 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101348A (zh) * 2020-08-28 2020-12-18 广州探迹科技有限公司 多语种端到端ocr算法及系统
CN114897004A (zh) * 2022-04-15 2022-08-12 成都理工大学 一种基于深度学习Transformer模型的梯形堆积核脉冲识别方法
CN114821257A (zh) * 2022-04-26 2022-07-29 中国科学院大学 导航中视频流与自然语言的智能处理方法和装置、设备
CN114821257B (zh) * 2022-04-26 2024-04-05 中国科学院大学 导航中视频流与自然语言的智能处理方法和装置、设备
CN114626363A (zh) * 2022-05-16 2022-06-14 天津大学 一种基于翻译的跨语言短语结构分析方法及装置
CN117592462A (zh) * 2024-01-18 2024-02-23 航天宏图信息技术股份有限公司 基于地物群的开源地名数据的相关性处理方法及装置
CN117592462B (zh) * 2024-01-18 2024-04-16 航天宏图信息技术股份有限公司 基于地物群的开源地名数据的相关性处理方法及装置

Also Published As

Publication number Publication date
CN112084796A (zh) 2020-12-15
JP2022552029A (ja) 2022-12-15
CN112084796B (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022057116A1 (zh) 一种基于Transformer深度学习模型的多语种地名词根汉译方法
CN111382580B (zh) 一种面向神经机器翻译的编码器-解码器框架预训练方法
CN110543639B (zh) 一种基于预训练Transformer语言模型的英文句子简化算法
CN111767718B (zh) 一种基于弱化语法错误特征表示的中文语法错误更正方法
US11573957B2 (en) Natural language processing engine for translating questions into executable database queries
CN101788978B (zh) 一种拼音和汉字相结合的汉外口语自动翻译方法
CN112925918B (zh) 一种基于疾病领域知识图谱的问答匹配系统
Jian et al. [Retracted] LSTM‐Based Attentional Embedding for English Machine Translation
JP2020190970A (ja) 文書処理装置およびその方法、プログラム
CN114780582A (zh) 基于表格问答的自然答案生成系统及其方法
CN114429132A (zh) 一种基于混合格自注意力网络的命名实体识别方法和装置
CN110502759B (zh) 融入分类词典的汉越混合网络神经机器翻译集外词处理方法
CN113408307B (zh) 一种基于翻译模板的神经机器翻译方法
Abandah et al. Correcting arabic soft spelling mistakes using bilstm-based machine learning
CN112257460B (zh) 基于枢轴的汉越联合训练神经机器翻译方法
Vykhovanets et al. An overview of phonetic encoding algorithms
Smadja et al. Translating collocations for use in bilingual lexicons
Kuo et al. A phonetic similarity model for automatic extraction of transliteration pairs
Mi et al. A neural network based model for loanword identification in Uyghur
CN113705223A (zh) 以读者为中心的个性化英文文本简化方法
CN114185573A (zh) 一种人机交互机器翻译系统的实现和在线更新系统及方法
JP3825645B2 (ja) 表現変換方法及び表現変換装置
CN115329784B (zh) 基于预训练模型的句子复述生成系统
CN111046665A (zh) 一种领域术语语义漂移抽取方法
Seresangtakul et al. Thai-Isarn dialect parallel corpus construction for machine translation

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021528844

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20953971

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20953971

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20953971

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.09.2023)