WO2021243706A1 - 一种跨语言生成提问的方法和装置 - Google Patents

一种跨语言生成提问的方法和装置 Download PDF

Info

Publication number
WO2021243706A1
WO2021243706A1 PCT/CN2020/094677 CN2020094677W WO2021243706A1 WO 2021243706 A1 WO2021243706 A1 WO 2021243706A1 CN 2020094677 W CN2020094677 W CN 2020094677W WO 2021243706 A1 WO2021243706 A1 WO 2021243706A1
Authority
WO
WIPO (PCT)
Prior art keywords
question generation
language
generation model
answer
cross
Prior art date
Application number
PCT/CN2020/094677
Other languages
English (en)
French (fr)
Inventor
余建兴
王世祺
印鉴
Original Assignee
中山大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中山大学 filed Critical 中山大学
Priority to PCT/CN2020/094677 priority Critical patent/WO2021243706A1/zh
Publication of WO2021243706A1 publication Critical patent/WO2021243706A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to the field of artificial intelligence, and more specifically, to a method and device for generating questions across languages.
  • the first type of method is to use a grammar or syntax analyzer to convert the text into an intermediate form, such as a grammar or syntax tree, and then use templates or rules to extract questions and answers from the intermediate form. Since the templates and rules are manually designed and the construction and update costs are high, the scalability and coverage of the model are very limited.
  • another type of method uses a sequence-to-sequence-based neural model to directly convert the text into a question. This conversion process relies on the alignment relationship between the text learned from the training data and the question. The sequence-to-sequence method is described in detail in the paper "D. Bahdanau, K.
  • the model is completely data-driven and does not require manual definition of a large number of rules or templates.
  • the neural network model requires a large amount of manually labeled data for training; the performance of the model is significantly affected by the size of the labeled data.
  • Recent research has turned to the neural network model, that is, the neural network automatically learns the mapping relationship between the text and the question from the labeled data, and then uses the sequence-to-sequence model to generate the question.
  • the neural network model relies heavily on a large amount of manually labeled data; the performance of the model is directly affected by the size of the data. This makes it difficult for these models to be quickly deployed in low-resource languages due to expensive annotation costs, that is, languages that lack annotation resources.
  • Hundreds of languages are currently used in the world. Only a few languages have abundant annotation resources, and most of the others have little or no annotation data. As a result, existing methods cannot be directly applied to low-resource languages, and it is difficult to directly construct an effective question generation model. .
  • the other method is based on direct migration, which uses multi-language encoders to map texts in different languages into a common space, and uses label samples in the source language to train cross-language models, and directly applies them to test samples in the target language.
  • This model based on direct migration is described in detail in the paper "S. Upadhyay, Y. Vyas, M. Carpuat, and D. Roth. Robust cross-lingual hypernymy detection using dependency context. In conference of the NAACL, 2018." . However, most of these methods neglect to consider the diversity of samples, which limits the performance of the model.
  • the present invention uses a multi-language encoder to express texts in different languages into a common space across languages, and then derives a basic question generation model based on the space; subsequently, it uses meta-learning based on feedback A small number of similar samples of test cases are used to optimize the adaptability of the basic model to improve the model’s ability to transfer multiple types of samples in different languages, and to provide a method and device for generating questions across languages.
  • a method for generating questions across languages including the following steps:
  • the answer vector generates a context vector through the attention mechanism, and a basic question generation model is obtained based on the context vector;
  • the source language text and the target language text obtain samples through the cross-language question generation model, which can obtain similar samples from the source language annotated data set, establish pseudo tasks for each sample, and generate cross-language questions on the pseudo tasks
  • the model performs meta-training and meta-testing based on the basic question generation model in step S3, and outputs the trained cross-language question generation model.
  • the scale of the source language annotation data set in the step S1 is larger than the scale of the target language annotation data set.
  • a pointer network is used to extract answers from the source language annotated data set, and an answer in a sentence corresponding to the answer is shielded by using a tag.
  • the source language text and the target language text are mapped to a common space through multilingual BERT and then coded.
  • the probability distribution based on the gated recurrent neural network, the probability distribution based on the attention score and the probability distribution based on the feedforward neural network are obtained and weighted respectively to obtain the average probability
  • the distribution is used as the basic questioning generative model.
  • a basic question generation model is trained based on the supervised index.
  • the supervised indicators include fluency, solvability, and semantic association.
  • step S4 firstly, the source language text, the target language text, and the answer obtained in step S2 are mapped to latent variables through the cyclic normal distribution, and the potential space of the sample corresponding to the answer is obtained by splicing the latent variables.
  • the unit vector is then deduced to obtain the relative entropy of the unit vector corresponding to the sample as the similarity.
  • each sample in the target language text is used as a test set, and similar samples are obtained from the source language annotated data set and the target language annotated data set as the training set.
  • the test set and the training set together constitute a pseudo The data set of the task.
  • step S5 the specific process of meta training is as follows:
  • Randomly extract pseudo-tasks until all pseudo-tasks are traversed train the cross-language question generation model through the self-critical strategy gradient training algorithm, and update the parameters of the cross-language question generation model.
  • the loss error value of the parameter is evaluated, and the parameters of the cross-language question generation model are further updated based on the loss error value.
  • a device for generating questions across languages including: an input module, an encoder, an attention mechanism module, a decoder, a context-related retriever, and a meta-learning module that are executed in sequence;
  • the input module is used to obtain source language annotation data set, target language annotation data set, source language text and target language text;
  • the encoder is used to encode the answer and the sentence corresponding to the answer to obtain the answer vector and sentence vector;
  • the attention mechanism module is used to process the answer vector to generate the context vector
  • the decoder is used to process the context vector to obtain the basic question generation model
  • the contextual searcher is used to calculate the similarity between the source language text and the target language text, obtain a cross-language question generation model and output samples;
  • the meta-learning module is used to establish a pseudo-task for each sample, perform meta-training and meta-testing on the cross-language question generation model on the pseudo-task based on the basic question generation model, and output the trained cross-language question generation model.
  • the device for generating questions across languages further includes an evaluation unit. After the evaluation unit outputs the basic question generation model, the evaluation unit scores the basic question generation model and further adjusts the weighting parameters of the basic question generation model. When the score no longer improves At the time, the basic question generation model is input into the contextual searcher.
  • the advantage of the present invention is that it utilizes the abundant annotation resources in the source language to enrich the training data that is lacking in the target language, and then effectively trains the question generation model of the target language. Furthermore, the model introduces meta-learning methods to solve the problem of sample diversity in cross-language generation tasks.
  • the advantages of this method include:
  • This method can transfer the rich annotation data in the source language to the target language, so that the limited annotation data in the target language can still train a high-performance question generation model; and use meta-learning to optimize the model considering sample diversity .
  • This method accurately measures the similarity of the context structure between samples by developing a context-relevant searcher.
  • the searcher has high computational efficiency and does not need to rely on artificial heuristic measurement.
  • Figure 1 is a schematic flow diagram of a method for generating questions across languages.
  • Fig. 2 is another flow diagram of the method for generating questions across languages.
  • Figure 3 is a schematic diagram of the process of generating a basic question generation model.
  • Fig. 4 is a schematic structural diagram of a device for generating questions across languages.
  • a method for generating questions across languages includes the following steps:
  • the scale of the source language annotation data set is larger than the scale of the target language annotation data set
  • each word q t in the question is obtained by sampling from the probability distribution p( ⁇ ), Q ⁇ t represents the 1 th to (t-1) th generated words in the question, and q t represents the t th word.
  • the goal of cross-language question generation is based on a small amount of target language annotation resources D non , and use transfer learning to fuse a large number of annotation resources D en in the source language to learn an effective target language question generator M.
  • the present invention uses the pointer network to extract the Answer.
  • the pointer network regards the extraction of answers as a linear sequence labeling task.
  • the result sequence O of a given text is predicted according to the following probability distribution:
  • W e, W d, v a is the training parameters
  • H the input text is distributed coding vector
  • d i is the i th output of the decoding words corresponding to the state vector.
  • the present invention uses the start and end position index of the answer in the annotation data to train the pointer network. If the answer word is included in the question, the rationality and answerability of the question will decrease. Therefore, based on the solution in the article "Y.Kim,H.Lee,J.Shin,and K.Jung.Improving neural question generation using answer separation.In conference of the AAAI,2019.”, after extracting the answers, use special ⁇ UNK> ⁇ , ⁇
  • GRU gated recurrent neural network
  • the j th word in the sentence is expressed as a vector in and Respectively represent the latent state vector corresponding to the j th word in the forward and backward GRU, Represents the distributed vector of the word, the symbol [ ⁇ ; ⁇ ] represents the splicing operation of two vectors; (b) the overall coding, the overall representation of the sentence is obtained through the start and end states of splicing Among them, the o th word can be expressed as vector. Therefore, the answer obtained by extraction is expressed as
  • the answer vector generates the context vector through the attention mechanism
  • the self-attention mechanism is used to further optimize the distributed representation of the sentence, namely:
  • the attention mechanism comes from the article ("Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017.Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th ACL").
  • the mechanism uses control variables to measure the association relationship between each word in the sentence through formula (1).
  • ⁇ j represents the jth word
  • u j represents the context correlation vector of the jth word
  • It is updated to f j according to u j
  • the updated part is determined by the control variable g j.
  • the representation of the j th word in a given sentence S As well as the representation of answers and evidence points
  • the function f m ( ⁇ ) is used to capture their interaction from multiple dimensions. This method uses three dimensions, including the overall association, that is, calculating Relevance to the answer and the whole point of evidence Cumulative association, ie calculation Correlation with the cumulative vector of each word of the answer and evidence point Maximum correlation, calculation Correlation with the maximum vector of each word of the answer and evidence point
  • the vector m j [m 1 ; m 2 ; m 3 ] of the answer information perception is obtained, and the vector is input into another GRU to obtain the vector with context information Finally, a new vector with answer information perception for the j th word of the sentence is obtained by splicing
  • the above distributed representation vectors are fused by weighting in formula (2) to obtain the vector c t , where ⁇ t j is the normalized attention weight, atk represents the alignment score between text words, and st represents the generated
  • ⁇ t j is the normalized attention weight
  • atk represents the alignment score between text words
  • st represents the generated
  • the hidden variables corresponding to the t th word, v, b, W s , and W h are trainable parameters.
  • the basic question generation model is obtained through the gated recurrent neural network.
  • the present invention adopts the source from the article "Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OKLi. 2016. Incorporating copying mechanism in sequence -to-sequence learning.In Proceedings of the 54th ACL" copy mechanism, construct probability distribution by fusing the attention scores of all words And according to the distribution to copy the words of the input text, to a certain extent solve the problem of unregistered words.
  • the network uses the maximum output (maxout) as the activation function.
  • the gating mechanism uses the gating mechanism to selectively generate questions from the above three distributions, such as generating question words from the question word distribution, using the words generating the question content from the word distribution, or using the copy mechanism from the input unregistered Words are generated in the word distribution.
  • the gate switch is controlled by a discrete vector, which is learned during the decoding process of each generated word. Specifically, the vector is the probability of the following three-dimensionality:
  • f( ⁇ ) generates the probability value through the pre-feedback neural network
  • q t-1 is the t-1 words generated in the decoding process.
  • the basic question generation model is optimized through supervised training, and the average score is obtained by scoring in terms of fluency, answerability and semantic relevance, etc., and by weighting.
  • the score function r(Y) is obtained by the weighted average sum of the following three types of indicators, which is used to measure the difference between the question text Q output by the model and the labeled question Q*, including:
  • the present invention uses QBLEU 4 (Q, Q * ) to measure the answerability of the generated question. Specifically, the calculation formula of the accuracy rate is:
  • the solvable function is weighted by the following formula
  • is the weight parameter
  • the present invention sets the weight ⁇ of reinforcement learning to be low, which is 0.3.
  • the present invention first maps a given text sentence S to a latent variable z s through the von Mises distribution, or cyclic normal distribution (von Mises-Fisher, vMF distribution for short).
  • the vMF distribution refers to the following formula (5):
  • z s and ⁇ s are unit vectors
  • Z ⁇ is a regularization term that depends only on the concentration parameters ⁇ and d dimensions of constant concentration
  • h s is the distributed representation corresponding to the sentence
  • W p and b p are trainable parameter.
  • this distribution makes similarity calculation easier and more robust.
  • the answer extraction h a variable potential also mapped z a.
  • the present invention first maps the evaluation samples to the latent space, and then calculates the distribution of the latent variables corresponding to the samples in the space Relative entropy (KL divergence), and then measure the similarity between samples, namely:
  • is the direction vector of the vMF distribution
  • ⁇ and d are constants
  • C ⁇ ⁇ I d/2 ( ⁇ )/(2I d/2-1 ( ⁇ ))
  • I d represents the modified Bessel function of order d (Bessel function).
  • the retriever automatically learns from the data to obtain the mapping function and calculates the corresponding similarity.
  • the training target cross-language question generation model is:
  • S, A) means that similar samples (S', A', Q') are retrieved from the D en and D non annotated data sets;
  • p m ( ⁇ ) means that the meta-learner searches according to The results are generated to generate questions. If a simple training method such as maximizing the marginal likelihood probability through joint learning is adopted, it will be difficult to calculate. Therefore, the present invention trains the retriever separately.
  • a priori meta-question generator provides a conditional probability distribution of question Q on a given target input (S, A), and is based on the joint distribution probability: p r ((S',A',Q ')
  • z) is a gated recurrent neural network (GRU) decoder, which is used to predict and generate a question Q based on the latent variable z.
  • GRU gated recurrent neural network
  • the source language text and the target language text obtain samples through a cross-language questioning generation model.
  • the samples can obtain similar samples from the source language annotated data set, and a pseudo task is established for each sample.
  • Meta-learning includes two iterative steps: meta-training and meta-testing. By fine-tuning the model with a small number of similar samples, an optimized model can be obtained, which can effectively capture the diversity of samples, and output better results in new test tasks in a targeted and fast manner.
  • the cross-language question generation model is subjected to meta-training and meta-testing based on the basic question generation model in step S3 on the pseudo task, and the trained cross-language question generation model is output.
  • the present invention uses each test example in the target language data set D non as the test set of a single meta-task T i
  • the K samples before and obtained from D en D non annotation data set as pseudo training set T i of the dummy task can be recorded as
  • the present invention first randomly selects a pseudo task and uses it to train the above-mentioned basic cross-language question generation model M ⁇ , where ⁇ represents the model parameter.
  • the learning rate of L ⁇ can refer to formula (8).
  • a self-critical strategy gradient training algorithm is used to train the model.
  • the self-critical strategy gradient training algorithm is proposed in the article "SJRennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel 2017. Self-Critical Sequence Training for Image Captioning. In Proceedings of the CVPR". It is a kind of Efficient reinforcement learning methods in the industry. Specifically, the algorithm converts the discontinuous reinforcement learning loss function into Among them, Q b represents the output sequence result of the benchmark method.
  • the benchmark method generates training in a locally optimal way, that is, using the greedy algorithm to generate the word with the highest probability each time;
  • Q s is the sequence result output by the generator M ⁇ , Every word It is obtained by using the probability value of formula (3).
  • the model can be optimized to generate sequences with higher scores than the benchmark method.
  • the updated parameter ⁇ i ' is obtained, and the present invention uses pseudo-task To evaluate the loss error value of this parameter. Based on this error, the present invention further uses all pseudo-tasks to train the question generation model M ⁇ , that is, the minimized loss error.
  • the identity matrix can be simplified by a first-order approximation Then use formula (9) to do the meta update operation.
  • the optimal generative model can be obtained It is more sensitive to changes between different pseudo-tasks, which helps to learn the common internal representations of tasks, rather than the characteristics of individual tasks. Therefore, only one or a few fine-tuning operations can be performed on a small amount of data to obtain a sample-specific model, which can achieve higher performance without overfitting.
  • a new test example of the target language is given, and the present invention first constructs a dummy task for it, that is, treats it as a test set of a dummy task T j Then use the retriever to obtain the first K similar samples from the labeled data D en and D non to construct a pseudo training set Then, by minimizing the loss function of formula (4), and using the learning rate of ⁇ to compare the meta model Perform a gradient update to obtain the optimal model, and then use the model to generate results for a given test example.
  • a device for generating questions across languages includes: an input module, an encoder, an attention mechanism module, a decoder, a context-related retriever, and a meta-learning module that are executed in sequence;
  • the input module is used to obtain source language annotation data set, target language annotation data set, source language text and target language text;
  • the encoder is used to encode the answer and the sentence corresponding to the answer to obtain the answer vector and sentence vector;
  • the attention mechanism module is used to process the answer vector to generate the context vector
  • the decoder is used to process the context vector to obtain the basic question generation model
  • the contextual searcher is used to calculate the similarity between the source language text and the target language text, obtain a cross-language question generation model and output samples;
  • the meta-learning module is used to establish pseudo-tasks for each sample, perform meta-training and meta-testing based on the basic question-generation model on the cross-language question generation model on the pseudo-task, and output the trained cross-language question generation model.
  • the device for generating questions across languages further includes an evaluation unit.
  • the evaluation unit After the evaluation unit outputs the basic question generation model, the evaluation unit scores the basic question generation model and further adjusts the weighting parameters of the basic question generation model. When it is no longer improved, input the basic question generation model into the contextual searcher.
  • the CMRC data set is composed of articles ("Y.Cui, T. Liu, W. Che, L. Xiao, Z. Chen, and et al. A span-extraction dataset for chinese machine reading comprehension. In conference of the EMNLP-IJCNLP) ,2019.”); the DRCD dataset is proposed by the article ("C.Chieh Shao, T. Liu, Y. Lai, Y. Tseng, and S. Tsai. DRCD: a chinese machine reading comprehension dataset.

Abstract

一种跨语言生成提问的方法和装置,其中方法包括:S1.获取标注数据集,建立用于提问生成的概率分布;S2.抽取答案和答案对应的句子,在编码后得到答案向量和句子向量;S3.答案向量通过注意力机制生成上下文向量,得到基础提问生成模型;S4.计算文本之间的相似度,从而得到跨语言提问生成模型;S5.通过跨语言提问生成模型得到样本,对每个样本建立伪任务,对跨语言提问生成模型进行基于基础提问生成模型的元学习,输出最终的跨语言提问生成模型。优点在于,利用源语言的标注资源丰富目标语言短缺的训练数据,进而有效地训练出目标语言的提问生成模型;还引入元学习来解决跨语言生成任务中样本的多样性难题。

Description

一种跨语言生成提问的方法和装置 技术领域
本发明涉及人工智能领域,更具体地,涉及一种跨语言生成提问的方法和装置。
背景技术
机器阅读理解是人工智能和自然语言处理领域的研究热点,作为与之对偶的研究课题,如美国专利申请(US6959417B2,Question and answer generator)所述,提问生成(QG)能够基于文本生成提问和与之对应的答案,应用到非常多的产业当中,包括提供训练数据来支撑问答模型的构建、生成用于教学的考题或习题、通过提问的方式来获得对话反馈等。传统提问生成方法主要通过启发式的规则或手工模板把文本转换为提问,但这些人工方法的通用性和可扩展性较低。
针对提问生成的课题,在学术领域目前主流的方法可归纳为两类。第一类方法是利用语法或者句法分析器把文本转换为中间形式,如语法或句法树,然后利用模板或者规则把该中间形式提取出提问和答案。由于模板和规则是人工设计的,构建和更新成本都高,因此模型的可扩展性和覆盖度都很有限。为了解决以上问题,另一类方法使用基于序列到序列的神经模型直接把文本转换成提问,这个转换过程依靠从训练数据中学习到的文本和提问之间的对齐关系来实现。序列到序列的方法在论文“D.Bahdanau,K.Cho and Y.Bengio.2015.Neural Machine Translation by Jointly Learning to Align and Translate”中有详细介绍。该模型完全是数据驱动的,不需要人工定义大量的规则或者模板。但神经网络模型需要大量人工标注的数据来训练;模型的性能受标注数据规模的显著影响。
最近的研究转向了神经网络模型,即通过神经网络从标注数据中自动学习出文本和提问之间的映射关系,进而使用基于序列到序列的模型来生成提问。但神经网络模型很大程度上依赖于大量人工标注的数据;模型的性能直接受数据规模大小的影响。这导致这些模型由于昂贵的标注成本很难快速部署到低资源语言中,即那些缺乏标注资源的语言。当前全球使用数百种语言,只有少量的语言有丰富的标注资源,其他大多数只有少量甚至没有标注数据,导致现有方法无法直接应用于低资源语言,也难以直接构建出有效的提问生成模型。
对于基于神经网络模型的跨语言提问生成的任务主要有两个方向。一种是基于翻译的方法,即把源语言的标注样本翻译成目标语言,来作为目标语言模型的 训练数据;或者把目标语言的测试样例翻译成源语言,然后根据源语言模型预测结果并把该结果翻译回目标语言。这种基于翻译的模型在论文“S.Schuster,S.Gupta,R.Shah,and M.Lewis.Cross-lingual transfer learning for multilingual task-oriented dialog.In NAACL,2019.”中有详细介绍。但翻译器通常需要串联到提问生成模型中,而非端到端融合的统一模型。这种拼接的模型会导致误差积累而造成模型整体性能较差。另一种方法是基于直接迁移,通过利用多语言的编码器把不同语言的文本映射到共同空间中,利用源语言的标注样本训练跨语言的模型,并直接应用于目标语言的测试样本。这种基于直接迁移的模型在论文“S.Upadhyay,Y.Vyas,M.Carpuat,and D.Roth.Robust cross-lingual hypernymy detection using dependency context.In conference of the NAACL,2018.”中有详细介绍。但这些方法大多忽略考虑样本的多样性,从而限制了模型的性能。
发明内容
本发明为克服上述现有技术所述的缺陷,利用多语言编码器将不同语言的文本表示到跨语言的共同空间中,然后在空间上得出基础提问生成模型;随后,利用元学习基于给定测试用例的少量相似样本对基础模型进行适配性的优化,以提高模型在不同语言中对多种类型样本的迁移能力,提供一种跨语言生成提问的方法和装置。
为解决上述技术问题,本发明的技术方案如下:
一种跨语言生成提问的方法,包括以下步骤:
S1.获取源语言标注数据集和目标语言标注数据集,建立用于提问生成的概率分布;
S2.获取源语言文本和目标语言文本,抽取答案和答案对应的句子,将答案和答案对应的句子进行编码,得到答案向量和句子向量;
S3.答案向量通过注意力机制生成上下文向量,基于上下文向量得到基础提问生成模型;
S4.计算源语言文本和目标语言文本的相似度,通过相似度得到跨语言提问生成模型;
S5.源语言文本和目标语言文本通过跨语言提问生成模型得到样本,所述样本能够从源语言标注数据集得出相似样本,对每个样本建立伪任务,在伪任务上对跨语言提问生成模型进行基于所述步骤S3的基础提问生成模型的元训练和元 测试,输出经过训练的跨语言提问生成模型。
进一步地,所述步骤S1的源语言标注数据集的规模大于目标语言标注数据集的规模。
进一步地,在所述步骤S2中,使用指针网络从所述源语言标注数据集抽取答案,并使用标记屏蔽答案对应的句子中的答案。
进一步地,在所述步骤S2中,通过多语言BERT将源语言文本和目标语言文本映射到共同空间后编码。
进一步地,在所述步骤S3中,得到并对基于门控循环神经网络的概率分布、基于注意力分值构建的概率分布和基于前馈式神经网络的概率分布分别进行加权,得到平均的概率分布作为基础提问生成模型。
进一步地,在所述步骤S3完成以后,基于有监督指标训练基础提问生成模型。
进一步地,所述有监督指标包括流畅度、可解答和语义关联。
进一步地,在所述步骤S4中,首先通过循环正态分布将源语言文本、目标语言文本和所述步骤S2获得的答案映射到潜在变量,通过拼接潜在变量获得答案对应的样本在潜在空间的单元向量,随后经过推导得出样本对应的单元向量的相对熵作为相似度。
进一步地,在所述步骤S5中,将目标语言文本中每个样本作为测试集,通过从源语言标注数据集和目标语言标注数据集中获得相似样本作为训练集,测试集和训练集共同构成伪任务的数据集。
进一步地,在所述步骤S5中,元训练的具体过程如下:
随机抽取伪任务直至遍历所有伪任务,通过自临界策略梯度训练算法训练跨语言提问生成模型并更新跨语言提问生成模型的参数。
进一步地,元测试的具体过程如下:
在更新跨语言提问生成模型的参数后,评估参数的损失误差值,基于损失误差值进一步更新跨语言提问生成模型的参数。
一种跨语言生成提问的装置,包括:依次执行的输入模块、编码器、注意力机制模块、解码器、上下文关联检索器和元学习模块;
输入模块用于获取源语言标注数据集、目标语言标注数据集、源语言文本和目标语言文本;
编码器用于将答案和答案对应的句子进行编码,得到答案向量和句子向量;
注意力机制模块用于处理答案向量生成上下文向量;
解码器用于处理上下文向量得到基础提问生成模型;
上下文关联检索器用于计算源语言文本和目标语言文本的相似度,得到跨语言提问生成模型并输出样本;
元学习模块用于对每个样本建立伪任务,在伪任务上对跨语言提问生成模型进行基于基础提问生成模型的元训练和元测试,输出经过训练的跨语言提问生成模型。
进一步地,跨语言生成提问的装置还包括评估单元,在评估单元输出基础提问生成模型后,由评估单元对基础提问生成模型进行评分并进一步调整基础提问生成模型的加权参数,当评分不再提高时,将基础提问生成模型输入到上下文关联检索器中。
与现有技术相比,本发明技术方案的有益效果是:
本发明的优点在于,利用源语言中丰富的标注资源来丰富目标语言短缺的训练数据,进而有效地训练出目标语言的提问生成模型。进一步地,模型引入元学习方法来解决跨语言生成任务中样本的多样性难题。本方法的优点包括:
(1)该方法能够把源语言中丰富的标注数据迁移到目标语言中,让在目标语言有限的标注数据依然能训练出性能优越的提问生成模型;而且使用元学习考虑样本多样性来优化模型。
(2)该方法通过开发上下文关联的检索器来精确地度量样本间上下文结构的相似度,该检索器计算效率高,不需要依赖人工启发式度量。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是跨语言生成提问的方法的流程示意图。
图2是跨语言生成提问的方法的另一流程示意图。
图3是生成基础提问生成模型的流程示意图。
图4是跨语言生成提问的装置的结构示意图。
具体实施方式
附图仅用于示例性说明,不能理解为对本专利的限制;
对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。
下面结合附图和实施例对本发明的技术方案做进一步的说明。
一种跨语言生成提问的方法,如图1和图2所示,包括以下步骤:
S1.获取源语言标注数据集和目标语言标注数据集,建立用于提问生成的概率分布;
具体的,源语言标注数据集的规模大于目标语言标注数据集的规模;
获取大规模的源语言标注数据集的模型
Figure PCTCN2020094677-appb-000001
以及小规模的目标语言标注数据集为
Figure PCTCN2020094677-appb-000002
其中S为文档句子,A为答案,Q为提问,并且□>n。
通过最大化以下用于提问生成的概率分布来生成最佳的提问:
Figure PCTCN2020094677-appb-000003
其中,提问中的每个词q t通过从概率分布p(·)中采样获得,Q <t代表提问中第1 th到(t-1) th个生成的词,q t表示第t th个词。跨语言的提问生成目标是在少量的目标语言标注资源D non的基础上,利用迁移学习把源语言中大量的标注资源D en来融合一起学习有效的目标语言提问生成器M。
S2.根据图3所示的生成基础提问生成模型的流程图,首先获取源语言文本和目标语言文本,建立编码器,抽取答案和答案对应的句子;
具体的,基于在论文“O.Vinyals,M.Fortunato,and N.Jaitly.Pointer networks.In conference of the NIPS.2015.”所述的指针网络,本发明采用指针网络来抽取给定文本中的答案。指针网络将抽取答案看成线性序列标注任务,为了标记答案的开始和结束位置,根据以下概率分布来预测给定文本的结果序列O:
Figure PCTCN2020094677-appb-000004
其中,W e、W d、v a是可训练的参数,H是输入文本的分布式编码向量,d i是第i th个输出词对应的解码状态向量。本发明利用标注数据中答案的开始和结束位置索引来训练指针网络。如果提问中包含了答案词,会导致提问的合理性和可解答性下降。因此,基于文章“Y.Kim,H.Lee,J.Shin,and K.Jung.Improving neural  question generation using answer separation.In conference of the AAAI,2019.”中的方案,在抽取答案后,使用特殊的标记来<UNK>屏蔽输入句子中的答案,并分别对它们进行编码以避免答案包含问题。
然后,将答案和答案对应的句子进行编码,得到答案向量和句子向量;
具体的,对于给定输入的句子和抽取的答案,首先利用基于文章“J.Devlin,M.W.Chang,K.Lee,and et al.BERT:Pre-training of deep bidirectional transformers for language understanding.In conference of the NAACL,2019.”的多语言BERT(mBERT)把这些文本映射到跨语言的共同空间中,用于表示成分布式向量,其中mBERT在104种语言上进行了预训练获得的分布式向量。每个词利用文章“Y.Wu,M.Schuster,Z.Chen,and et al.Google’s neural machine translation system:Bridging the gap between human and machine translation.2016.”所述的WordPiece模型来分词,该模型中有110k个跨语言共享词汇表,其中每个词的分布式表示通过双向门控循环神经网络(GRU)来捕捉上下文信息的分布式向量。GRU编码器来源于文章“K.Cho,B.Merrienboer,C.Gulcehre,D.Bahdanau,F.Bougares,H.Schwenk,and et al.Learning phrase representations using rnn encoder-decoder for statistical machine translation.In conference of the EMNLP,2014.”,能够捕捉语言序列前后关联信息。给定句子每个词的分布式向量,经过GRU的处理后能生成两类表示,包括(a)带上下文信息的词向量,对于句子中第j th个词,表示成向量
Figure PCTCN2020094677-appb-000005
其中
Figure PCTCN2020094677-appb-000006
Figure PCTCN2020094677-appb-000007
分别表示前向和后向GRU中第j th个词对应的潜在状态向量,
Figure PCTCN2020094677-appb-000008
表示这个词的分布式向量,符号[·;·]表示两个向量的拼接操作;(b)整体的编码,通过拼接开始和终止状态获得句子的整体表示
Figure PCTCN2020094677-appb-000009
其中它们第o th个词可表示成
Figure PCTCN2020094677-appb-000010
向量。因此,抽取获得的答案表示成
Figure PCTCN2020094677-appb-000011
S3.答案向量通过注意力机制生成上下文向量;
具体的,为了能有效刻画句子中单词在语义上的长关联依赖,使用自身注意力机制来进一步优化句子的分布式表示方式,即:
Figure PCTCN2020094677-appb-000012
注意力机制来源于文章(“Wenhui Wang,Nan Yang,Furu Wei,Baobao Chang, and Ming Zhou.2017.Gated self-matching networks for reading comprehension and question answering.In Proceedings of the 55th ACL”)。具体地,给定句子的表示H,该机制使用控制变量通过公式(1)来衡量句子内部各个单词之间的关联关系。其中α j表示第j个单词
Figure PCTCN2020094677-appb-000013
与句子H中其他单词的关联分数,u j表示第j个单词的上下文关联向量,
Figure PCTCN2020094677-appb-000014
根据u j来更新为f j,由控制变量g j来确定更新的部分。
Figure PCTCN2020094677-appb-000015
为了增强句子和答案之间的关联信息,以及考虑到句子屏蔽答案后的信息损失需要补充答案的信息,采用答案感知的交互编码方式,即
Figure PCTCN2020094677-appb-000016
给定句子S中第j th个词的表示
Figure PCTCN2020094677-appb-000017
以及答案和证据点的表示
Figure PCTCN2020094677-appb-000018
通过函数f m(·)来从多个维度捕捉它们的交互关联。本方法采用三个维度,包括整体关联,即计算
Figure PCTCN2020094677-appb-000019
和答案和证据点整体的关联
Figure PCTCN2020094677-appb-000020
累计关联,即计算
Figure PCTCN2020094677-appb-000021
和答案和证据点各个词累计向量的关联
Figure PCTCN2020094677-appb-000022
最大关联,计算
Figure PCTCN2020094677-appb-000023
和答案和证据点各个词最大向量的关联
Figure PCTCN2020094677-appb-000024
总体交互关联的函数被定义为f m(μ,ν,W)=cos(W k□μ,W k□ν),其中□表示向量间的点乘数学符号,W表示权重矩阵,该矩阵的每列W k表示对应关联维度的权重。
通过拼接上述维度对应的关联向量,获得答案信息感知的向量m j=[m 1;m 2;m 3],把该向量输入另一个GRU中来获得带上下文信息的向量
Figure PCTCN2020094677-appb-000025
最后通过拼接获得针对句子第j th个词的带答案信息感知的新向量
Figure PCTCN2020094677-appb-000026
通过公式(2)加权来融合以上的分布式表示向量,获得向量c t,其中αt j是归一化后的注意力权重,a tk表示文本单词之间的对齐分数,s t表示生成出的第t th 个词对应的隐含变量,v,b,W s,W h是可训练的参数。
Figure PCTCN2020094677-appb-000027
然后,基于上下文向量通过门控循环神经网络得到基础提问生成模型。
具体的,基于上下文向量c t,使用另一个GRU来生成提问;提问的每个单词根据p voc=Softmax(W os t+b o)的概率分布来生成,其中s t=GRU(s t-1,c t),s t和s t-1表示第t th和(t-1) th个生成词对应的解码潜在向量,W o和b o表示可训练的参数。
为了解决无登录词的问题(即生成的词未在训练数据的词集合中出现),本发明采用来源于文章“Jiatao Gu,Zhengdong Lu,Hang Li,and Victor O.K.Li.2016.Incorporating copying mechanism in sequence-to-sequence learning.In Proceedings of the 54th ACL”的复制机制,通过融合所有词的注意力分值来构建概率分布
Figure PCTCN2020094677-appb-000028
并按该分布来复制输入文本的词,在一定程度解决未登录词的问题。
另外,采用概率分布p qw=Softmax(g(s t,c t,h a))来保证提问词和答案类型之间的一致性,其中g(·)是两层的前馈式神经网络,该网络以最大输出(maxout)作为激活函数。最后,使用门控机制来从以上三种分布中选择性地生成问题,譬如从提问词分布中采用生成提问词、从词分布中采用生成提问内容的词、或者利用复制机制从输入的未登录词分布中生成词。门控开关由离散向量来控制,该向量在每一个生成词的解码过程中学习获得。具体地,该向量是以下三维度的概率:
p gv,p gc,p gq=Softmax(f(s t,c t,q t-1))
其中,f(·)通过前反馈神经网络来生成概率值,q t-1是在解码过程中生成的t-1个词。通过对以上三种分布加权求和,根据公式(3)的基础提问生成模型来生成提问的第t个词q t
p(q t|S,A,Q <t)=p gv·p voc+p gc·p cp+p gq·p qw--公式(3);
在生成基础提问生成模型的基础上,通过有监督的训练优化基础提问生成模型,通过在流畅度、可解答和语义关联等方面进行评分并通过加权求出平均评分。
具体的,为了提升训练的收敛速度,首先使用有监督的方法基于多种语言的 标注数据通过最小化负交叉熵
Figure PCTCN2020094677-appb-000029
来训练基础的跨语言提问生成模型,其中Q表示模型的预测结果,Q *表示标注数据的真实结果,T表示提问对应的单词个数。
根据文章“R.Paulus,C.Xiong,and R.Socher.A deep reinforced model for abstractive summarization.In conference of the ICLR,2018.”所提到的问题,考虑到传统的有监督学习存在硬匹配偏差和训练和测试之间的评估差异等不足,导致单纯优化有监督的离散目标函数并不能在连续的评估函数中获得最优解。为了解决该问题,本方法借助于强化学习来微调模型,让模型更容易获得最优解。强化学习是用于优化非连续函数的目标。具体地,目标是找出最佳的生成单词策略π θ来最小化所生成提问对应的损失函数:
Figure PCTCN2020094677-appb-000030
其中,分值函数r(Y)通过以下三类指标做加权平均和获得,用于衡量模型输出的提问文本Q和标注提问Q *之间的差异,包括:
(a)流畅度:本发明采用基于语言模型计算负困惑度的方式来衡量所生成的提问文本的流畅度。根据文章("X.Zhang and M.Lapata.2017.Sentence Simplification with Deep Reinforcement Learning.In Proceedings of EMNLP")所述的计算方式,在实际应用中能有效衡量生成文本的质量,具体如下:
Figure PCTCN2020094677-appb-000031
(b)可解答:本发明采用QBLEU 4(Q,Q *)来衡量生成的提问的可解答性。具体地,准确率的计算公式为:
Figure PCTCN2020094677-appb-000032
召回率的计算公式为:
Figure PCTCN2020094677-appb-000033
其中i∈{r,n,q,f},∑ iw i=1,|l i|,|r i|分别表示属于i th种类型的生成提问和标注提问单词数,r,n,q,f分别代表相关内容词、实体词、提问词和功能词。
通过以下公式加权获可解答函数
QBLEU 4(·,·)=δAnswerability+(1-δ)BLEU 4
其中,
Figure PCTCN2020094677-appb-000034
δ是权重参数;BLEU n=4是匹配度函数,来源于文章("K.Papineni,S.Roukos,T.Ward,and W.J.Zhu.2019.BLEU:A Method for Automatic Evaluation of Machine Translation.In Proceedings of the 40th ACL"),该函数通过计算文本对应子串的重叠度来衡量翻译文本和真实文本的匹配状况,即越多子串能匹配,分值越高。
(c)语义关联:考虑到问题表达方式的多样性,本发明奖励地提升那些与真实问题Q *在分布式空间中高度相似的提问Q的分值。为了计算相似度,本方法采用由文章"H.Gong,S.Bhat,L.Wu,J.Xiong,and W.Hwu.2019.2019.Reinforcement Learning Based Text Style Transfer without Parallel Training Corpus.In Proceedings of the 57th NAACL"提出的词步长距离(WMD),是一种非常高效和鲁棒性很强的方法,该方法用于计算两个文本在分布式空间中的语义相似度。通过生成文本的词语长度来正则化,就能获得语义关联指标的分值-WMD(Q,Q *)/Length(Q *),其中WMD(.)函数计算公式如下:
Figure PCTCN2020094677-appb-000035
Figure PCTCN2020094677-appb-000036
考虑到使用单一的损失函数有可能导致生成提问的可读性不强,为了解决该问题,本发明采用混合目标的损失函数来提升可读性,参考公式(4),其中λ是权重参数,公式(4)具体如下:
L=λL rl+(1-λ)L sl--公式(4)
在实践中,考虑到模型需要约束来逼近标注结果,来避免各类局部最优的可能,本发明把强化学习的权重λ设置较低,为0.3。
S4.通过上下文关联检索器计算源语言文本和目标语言文本的相似度;
具体的,本发明首先通过冯·米塞斯分布,或称循环正态分布(von Mises-Fisher,简称vMF分布)将给定的文本句子S映射到潜在变量z s。vMF分布参考以下公式(5):
Figure PCTCN2020094677-appb-000037
其中,z s和μ s为单元向量,Z κ是仅依赖于常数的集中度参数κ和d维数的正则化项,h s是句子对应的分布式表示,W p和b p是可训练参数。如文章“J.Xu and G.Durrett.Spherical latent spaces for stable variational autoencoders.In conference of the EMNLP,2018.”所述,该分布使得相似度计算变得更容易和更健壮。类似地,抽取的答案h a也被映射到潜在变量z a。通过拼接获得每个测试样例在潜在空间上的分布式表示z=[z s;z a]。
S8.在潜在空间中的相似度计算:
具体的,给定两个评测样本(S i,A i)和(S j,A j),本发明首先把评测样本映射到潜在空间中,然后在该空间中计算样本对应的潜在变量分布的相对熵(KL divergence),进而衡量样本之间的相似度,即:
KL(p(z i|S i,A i)||p(z j|S j,A j))
考虑到z是vMF分布,它对应的相对熵通过进一步用“T.B.Hashimoto,K.Guu,Y.Oren,and P.S.Liang.A retrieve-and-edit framework for predicting structured outputs.In conference of the NIPS,2018.”所述的数学推导获得公式(6),具体为:
Figure PCTCN2020094677-appb-000038
其中,μ是vMF分布的方向向量,κ和d是常量,C κ=κI d/2(κ)/(2I d/2-1(κ)),I d表示d阶的修正贝塞尔函数(Bessel function)。
然后,通过相似度得到跨语言提问生成模型;
检索器从数据中自动学习获得映射函数并计算对应的相似度,训练的目标跨语言提问生成模型为:
p(Q|S,A)=∑p r((S',A',Q')|S,A)p m(Q|S,A,(S',A',Q'))
其中,p r(·|S,A)表示从D en和D non标注数据集中检索出相似的样本(S',A',Q');p m(·)表示是指元学习者根据检索到的结果来生成提问。如果采用例如通过联合学习最大化边际似然概率的简单训练方法,会导致难以计算,因此本发明单独训练检索器。
具体地,假设有先验的元提问生成器在给定的目标输入(S,A)上提供了提问Q的条件概率分布,并基于联合分布概率:p r((S',A',Q')|S,A)p data(S,A,Q)提供了对应的检索样本;基于该假设,利用数学推导得到这个元提问生成器的优化函数下界,参考以下的公式(7):
log p(Q|S,A)≥E Q~p(Q|S,A)log p(Q|z)-8C κ--公式(7)
其中,p(Q|z)是门控循环神经网络(GRU)解码器,用于基于潜在变量z来预测生成提问Q。优化函数下界E Q~p(Q|S,A)log p(Q|z)通过文章“T.R.Davidson,L.Falorsi,N.De Cao,T.Kipf,and J.M.Tomczak.Hyperspherical variational auto-encoders.In conference of the UAI,2018.”提出的重参数梯度优化的数学方法来计算。
S5.源语言文本和目标语言文本通过跨语言提问生成模型得到样本,样本能够从源语言标注数据集得出相似样本,对每个样本建立伪任务。
首先通过检索器为每个目标语言的测试样例建立伪任务,然后通过元学习基于所有伪任务来训练跨语言的提问生成模型,其中元学习包括元训练和元测试两个迭代步骤。通过少量的几个相似样本对模型进行微调,就能够获得优化后的模型,能有效捕捉样本的多样性,有针对性且快速地在新的测试任务中输出较好的结果。
然后,在伪任务上对跨语言提问生成模型进行基于所述步骤S3的基础提问生成模型的元训练和元测试,输出经过训练的跨语言提问生成模型。
具体的,本发明将目标语言数据集D non中每个测试样例作为单个元任务T i的测试集
Figure PCTCN2020094677-appb-000039
通过从标注数据集D en和D non中利用检索其获得前K个相似的样本作为伪任务T i的伪训练集。即伪任务可以记做
Figure PCTCN2020094677-appb-000040
基于以上伪任务集
Figure PCTCN2020094677-appb-000041
本发明首先随机抽取一个伪任务,并用于训练以上所述的基础跨语言提问生成模型M θ,其中θ表示模型参数。新的模型参数θ'可以通过梯度更新获得,即θ'=U m(θ;α),其中U(·)表示梯度更新操作,m表示更新次数,α表示用于最小化模型学习目标损失函数L θ的学习率。单次的更新操作可以参考公式(8)。
Figure PCTCN2020094677-appb-000042
由于以上模型优化目标函数中的非连续损失函数是不可微不可导,因此使用了自临界策略梯度训练算法来训练模型。自临界策略梯度训练算法在文章”S.J.Rennie,E.Marcheret,Y.Mroueh,J.Ross,and V.Goel 2017.Self-Critical Sequence Training for Image Captioning.In Proceedings of the CVPR"中提出,是一种业界高效的强化学习方法。具体地,该算法把非连续的强化学习损失函数转换成
Figure PCTCN2020094677-appb-000043
其中Q b表示基准方法的输出序列结 果,该基准方法通过一种局部最优的方式生成训练,即使用贪婪算法每次生成概率最大的词;Q s是生成器M θ所输出的序列结果,每个词
Figure PCTCN2020094677-appb-000044
通过采用公式(3)的概率值来获得。通过最小化该损失函数就能优化模型,让其生成比基准方法分值更高的序列。
经过元训练后获得更新后的参数θ i',本发明利用伪任务
Figure PCTCN2020094677-appb-000045
来评估该参数的损失误差值。基于该误差,本发明进一步地利用所有的伪任务来训练提问生成模型M θ,即最小化的损失误差
Figure PCTCN2020094677-appb-000046
通过以β的学习率进行一阶的梯度更新,能获得
Figure PCTCN2020094677-appb-000047
为了减少计算成本,可以通过一阶近似简化了单位矩阵
Figure PCTCN2020094677-appb-000048
进而以公式(9)来做元更新操作。
Figure PCTCN2020094677-appb-000049
通过对所有的伪任务进行迭代学习,能够获得最优的生成模型
Figure PCTCN2020094677-appb-000050
它对不同伪任务之间的变化更为敏感,这有助于学习出任务共同的内部表征,而不是单个任务的特征。因此,只需在较少数据上进行一个或少量几个微调操作即可获得具有样本针对性的模型,从而既不过度拟合又能获得较高的性能。
给出了一个目标语言的新的测试样例,本发明先为其构建伪任务,即视其为一个伪任务T j的测试集
Figure PCTCN2020094677-appb-000051
然后利用检索器从标注数据D en和D non获得前K个相似样本来构造一个伪训练集
Figure PCTCN2020094677-appb-000052
随后,通过最小化公式(4)的损失函数,并以γ的学习率对元模型
Figure PCTCN2020094677-appb-000053
进行一次梯度更新,从而获得最优的模型,然后使用该模型对给定测试样例生成结果。
一种跨语言生成提问的装置,如图4所示,包括:依次执行的输入模块、编码器、注意力机制模块、解码器、上下文关联检索器和元学习模块;
输入模块用于获取源语言标注数据集、目标语言标注数据集、源语言文本和目标语言文本;
编码器用于将答案和答案对应的句子进行编码,得到答案向量和句子向量;
注意力机制模块用于处理答案向量生成上下文向量;
解码器用于处理上下文向量得到基础提问生成模型;
上下文关联检索器用于计算源语言文本和目标语言文本的相似度,得到跨语言提问生成模型并输出样本;
元学习模块用于对每个样本建立伪任务,在伪任务上对跨语言提问生成模型进行基于基础提问生成模型的元训练和元测试,输出经过训练的跨语言提问生成 模型。
在本实施例中,跨语言生成提问的装置还包括评估单元,在评估单元输出基础提问生成模型后,由评估单元对基础提问生成模型进行评分并进一步调整基础提问生成模型的加权参数,当评分不再提高时,将基础提问生成模型输入到上下文关联检索器中。
为了衡量模型的性能,申请人使用当前主流的三种数据集进行了实验,包括简体中文的CMRC数据集、繁体中文的DRCD数据集和韩国语的KorQuAD数据集。其中CMRC数据集由文章("Y.Cui,T.Liu,W.Che,L.Xiao,Z.Chen,and et al.A span-extraction dataset for chinese machine reading comprehension.In conference of the EMNLP-IJCNLP,2019.")提出;DRCD数据集由文章("C.Chieh Shao,T.Liu,Y.Lai,Y.Tseng,and S.Tsai.DRCD:a chinese machine reading comprehension dataset.In arXiv prePrint:1806.00920,2018.")提出;KorQuAD数据集由文章("S.Lim,M.Kim,and J.Lee.Korquad1.0:Korean qa dataset for machine reading comprehension.In arXiv prePrint:1909.07005,2019.")提出。这三个数据集分别被切分为训练/验证集,样本数量分别为10k/3.3k、27k/3.5k和60k/5.7k;在dev集上测试了所有的评估。另外,英语作为源语言,对应的数据集是Squad1.1。该数据集由文章("P.Rajpurkar,J.Zhang,K.Lopyrev,and P.Liang.SQuAD:100,000+questions for machine comprehension of text.In conference of the EMNLP,2016.")提出,包含90k标注样本。以上所有的数据集都属于同一领域,即由维基百科领域的众包构建的。本发明使用三种传统指标方法来衡量生成的提问的质量,包括BLEU-4、METEOR和ROUGE-L。其中指标BLEU-4由论文提出(“Kishore Papineni,Salim Roukos,Todd Ward,and Wei-Jing Zhu.2002.Bleu:a method for automatic evaluation of machine translation.In Proceedings of the 40th ACL”);METEOR由论文提出(“Kishore Papineni,Salim Roukos,Todd Ward,and Wei-Jing Zhu.2002.Bleu:a method for automatic evaluation of machine translation.In Proceedings of the 40th ACL”);ROUGE-L由论文提出(“Chin-Yew Lin.2004.ROUGE:A package for automatic evaluation of summaries.In Text Summarization Branches Out”)。实验结果表明,本发明的方法明显地优于传统方法。
显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明 的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。

Claims (13)

  1. 一种跨语言生成提问的方法,其特征在于,包括以下步骤:
    S1.获取源语言标注数据集和目标语言标注数据集,建立用于提问生成的概率分布;
    S2.获取源语言文本和目标语言文本,抽取答案和答案对应的句子,将答案和答案对应的句子进行编码,得到答案向量和句子向量;
    S3.答案向量通过注意力机制生成上下文向量,基于上下文向量得到基础提问生成模型;
    S4.计算源语言文本和目标语言文本的相似度,通过相似度得到跨语言提问生成模型;
    S5.源语言文本和目标语言文本通过跨语言提问生成模型得到样本,所述样本能够从源语言标注数据集得出相似样本,对每个样本建立伪任务,在伪任务上对跨语言提问生成模型进行基于所述步骤S3的基础提问生成模型的元训练和元测试,输出经过训练的跨语言提问生成模型。
  2. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,所述步骤S1的源语言标注数据集的规模大于目标语言标注数据集的规模。
  3. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,在所述步骤S2中,使用指针网络从所述源语言标注数据集抽取答案,并使用标记屏蔽答案对应的句子中的答案。
  4. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,在所述步骤S2中,通过多语言BERT将源语言文本和目标语言文本映射到共同空间后编码。
  5. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,在所述步骤S3中,得到并对基于门控循环神经网络的概率分布、基于注意力分值构建的概率分布和基于前馈式神经网络的概率分布分别进行加权,得到平均的概率分布作为基础提问生成模型。
  6. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,在所述步骤S3完成以后,基于有监督指标训练基础提问生成模型。
  7. 根据权利要求6所述的跨语言生成提问的方法,其特征在于,所述有监督指标包括流畅度、可解答和语义关联。
  8. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,在所述步 骤S4中,首先通过循环正态分布将源语言文本、目标语言文本和所述步骤S2获得的答案映射到潜在变量,通过拼接潜在变量获得答案对应的样本在潜在空间的单元向量,随后经过推导得出样本对应的单元向量的相对熵作为相似度。
  9. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,在所述步骤S5中,将目标语言文本中每个样本作为测试集,通过从源语言标注数据集和目标语言标注数据集中获得相似样本作为训练集,测试集和训练集共同构成伪任务的数据集。
  10. 根据权利要求1所述的跨语言生成提问的方法,其特征在于,在所述步骤S5中,元训练的具体过程如下:
    随机抽取伪任务直至遍历所有伪任务,通过自临界策略梯度训练算法训练跨语言提问生成模型并更新跨语言提问生成模型的参数。
  11. 根据权利要求10所述的跨语言生成提问的方法,其特征在于,元测试的具体过程如下:
    在更新跨语言提问生成模型的参数后,评估参数的损失误差值,基于损失误差值进一步更新跨语言提问生成模型的参数。
  12. 一种基于权利要求1所述的跨语言生成提问的方法的装置,其特征在于,包括:依次执行的输入模块、编码器、注意力机制模块、解码器、上下文关联检索器和元学习模块;
    输入模块用于获取源语言标注数据集、目标语言标注数据集、源语言文本和目标语言文本;
    编码器用于将答案和答案对应的句子进行编码,得到答案向量和句子向量;
    注意力机制模块用于处理答案向量生成上下文向量;
    解码器用于处理上下文向量得到基础提问生成模型;
    上下文关联检索器用于计算源语言文本和目标语言文本的相似度,得到跨语言提问生成模型并输出样本;
    元学习模块用于对每个样本建立伪任务,在伪任务上对跨语言提问生成模型进行基于基础提问生成模型的元训练和元测试,输出经过训练的跨语言提问生成模型。
  13. 根据权利要求12所述的装置,其特征在于,所述装置还包括评估单元,在评估单元输出基础提问生成模型后,由评估单元对基础提问生成模型进行评分 并进一步调整基础提问生成模型的加权参数,当评分不再提高时,将基础提问生成模型输入到上下文关联检索器中。
PCT/CN2020/094677 2020-06-05 2020-06-05 一种跨语言生成提问的方法和装置 WO2021243706A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/094677 WO2021243706A1 (zh) 2020-06-05 2020-06-05 一种跨语言生成提问的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/094677 WO2021243706A1 (zh) 2020-06-05 2020-06-05 一种跨语言生成提问的方法和装置

Publications (1)

Publication Number Publication Date
WO2021243706A1 true WO2021243706A1 (zh) 2021-12-09

Family

ID=78830047

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/094677 WO2021243706A1 (zh) 2020-06-05 2020-06-05 一种跨语言生成提问的方法和装置

Country Status (1)

Country Link
WO (1) WO2021243706A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089589A (zh) * 2023-02-10 2023-05-09 阿里巴巴达摩院(杭州)科技有限公司 问句生成方法及装置
CN116303974A (zh) * 2023-05-04 2023-06-23 之江实验室 基于目标生成式回应语言模型的回应方法和装置
CN116432752A (zh) * 2023-04-27 2023-07-14 华中科技大学 一种隐式篇章关系识别模型的构建方法及其应用
CN117235243A (zh) * 2023-11-16 2023-12-15 青岛民航凯亚系统集成有限公司 民用机场大语言模型训练优化方法及综合服务平台
CN117271751A (zh) * 2023-11-16 2023-12-22 北京百悟科技有限公司 交互方法、装置、设备和存储介质
CN117389541A (zh) * 2023-12-13 2024-01-12 中国人民解放军国防科技大学 基于对话检索生成模板的配置系统及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776583A (zh) * 2015-11-24 2017-05-31 株式会社Ntt都科摩 机器翻译评价方法和设备及机器翻译方法和设备
US9798653B1 (en) * 2010-05-05 2017-10-24 Nuance Communications, Inc. Methods, apparatus and data structure for cross-language speech adaptation
CN110134771A (zh) * 2019-04-09 2019-08-16 广东工业大学 一种基于多注意力机制融合网络问答系统的实现方法
CN111078853A (zh) * 2019-12-13 2020-04-28 上海智臻智能网络科技股份有限公司 问答模型的优化方法、装置、计算机设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9798653B1 (en) * 2010-05-05 2017-10-24 Nuance Communications, Inc. Methods, apparatus and data structure for cross-language speech adaptation
CN106776583A (zh) * 2015-11-24 2017-05-31 株式会社Ntt都科摩 机器翻译评价方法和设备及机器翻译方法和设备
CN110134771A (zh) * 2019-04-09 2019-08-16 广东工业大学 一种基于多注意力机制融合网络问答系统的实现方法
CN111078853A (zh) * 2019-12-13 2020-04-28 上海智臻智能网络科技股份有限公司 问答模型的优化方法、装置、计算机设备和存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KONG, LINGYU: "Overview of Cross-Language Question Answering System", JOURNAL OF MODERN INFORMATION, no. 10, 31 October 2008 (2008-10-31), pages 53 - 56, XP055878077 *
QUAN ZHE; WANG ZHI-JIE; LE YUQUAN; YAO BIN; LI KENLI; YIN JIAN: "An Efficient Framework for Sentence Similarity Modeling", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 27, no. 4, 1 April 2019 (2019-04-01), USA, pages 853 - 865, XP011714650, ISSN: 2329-9290, DOI: 10.1109/TASLP.2019.2899494 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089589A (zh) * 2023-02-10 2023-05-09 阿里巴巴达摩院(杭州)科技有限公司 问句生成方法及装置
CN116089589B (zh) * 2023-02-10 2023-08-29 阿里巴巴达摩院(杭州)科技有限公司 问句生成方法及装置
CN116432752A (zh) * 2023-04-27 2023-07-14 华中科技大学 一种隐式篇章关系识别模型的构建方法及其应用
CN116432752B (zh) * 2023-04-27 2024-02-02 华中科技大学 一种隐式篇章关系识别模型的构建方法及其应用
CN116303974A (zh) * 2023-05-04 2023-06-23 之江实验室 基于目标生成式回应语言模型的回应方法和装置
CN117235243A (zh) * 2023-11-16 2023-12-15 青岛民航凯亚系统集成有限公司 民用机场大语言模型训练优化方法及综合服务平台
CN117271751A (zh) * 2023-11-16 2023-12-22 北京百悟科技有限公司 交互方法、装置、设备和存储介质
CN117271751B (zh) * 2023-11-16 2024-02-13 北京百悟科技有限公司 交互方法、装置、设备和存储介质
CN117389541A (zh) * 2023-12-13 2024-01-12 中国人民解放军国防科技大学 基于对话检索生成模板的配置系统及设备
CN117389541B (zh) * 2023-12-13 2024-02-23 中国人民解放军国防科技大学 基于对话检索生成模板的配置系统及设备

Similar Documents

Publication Publication Date Title
WO2021243706A1 (zh) 一种跨语言生成提问的方法和装置
CN112214995B (zh) 用于同义词预测的分层多任务术语嵌入学习
Tan et al. Neural machine translation: A review of methods, resources, and tools
WO2022036616A1 (zh) 一种基于低标注资源生成可推理问题的方法和装置
Gao et al. A review on cyber security named entity recognition
Qing-dao-er-ji et al. Research on the LSTM Mongolian and Chinese machine translation based on morpheme encoding
Peng et al. Sequence-to-sequence models for cache transition systems
Zhang et al. I know what you want: Semantic learning for text comprehension
van der Heijden et al. A comparison of architectures and pretraining methods for contextualized multilingual word embeddings
Sharath et al. Question answering over knowledge base using language model embeddings
Song et al. A method for identifying local drug names in xinjiang based on BERT-BiLSTM-CRF
Li et al. Unifying model explainability and robustness for joint text classification and rationale extraction
Jia et al. Span-based semantic role labeling with argument pruning and second-order inference
Ma et al. Multi-teacher knowledge distillation for end-to-end text image machine translation
Xue et al. A method of chinese tourism named entity recognition based on bblc model
Li et al. Incorporating translation quality estimation into chinese-korean neural machine translation
Gao et al. ERGM: A multi-stage joint entity and relation extraction with global entity match
Yang et al. Bidirectional relation-guided attention network with semantics and knowledge for relational triple extraction
Acharjee et al. Sequence-to-sequence learning-based conversion of pseudo-code to source code using neural translation approach
Xu Multi-region English translation synchronization mechanism driven by big data
CN113986251A (zh) 基于卷积和循环神经网络的gui原型图转代码方法
Feng et al. Improved neural machine translation with pos-tagging through joint decoding
Biswas et al. Is aligning embedding spaces a challenging task? a study on heterogeneous embedding alignment methods
Cao et al. Predict, pretrained, select and answer: Interpretable and scalable complex question answering over knowledge bases
Wang et al. Chinese Text Implication Recognition Method based on ERNIE-Gram and CNN

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20939034

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.05.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20939034

Country of ref document: EP

Kind code of ref document: A1