WO2024036840A1 - 基于主题增强的开放域对话回复方法及系统 - Google Patents

基于主题增强的开放域对话回复方法及系统 Download PDF

Info

Publication number
WO2024036840A1
WO2024036840A1 PCT/CN2022/139320 CN2022139320W WO2024036840A1 WO 2024036840 A1 WO2024036840 A1 WO 2024036840A1 CN 2022139320 W CN2022139320 W CN 2022139320W WO 2024036840 A1 WO2024036840 A1 WO 2024036840A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
dialogue
topic
reply
enhancement
Prior art date
Application number
PCT/CN2022/139320
Other languages
English (en)
French (fr)
Inventor
李太豪
黄剑韬
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to US18/297,610 priority Critical patent/US12019989B2/en
Publication of WO2024036840A1 publication Critical patent/WO2024036840A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application belongs to the field of artificial intelligence and relates to an open domain dialogue reply method and system based on topic enhancement.
  • Open-domain dialogue reply generation is a challenging task in natural language processing. Open-domain dialogue refers to general chatting without limiting the domain. Currently, artificial intelligence has made major breakthroughs in task-based dialogue reply tasks. However, in open-domain dialogue replies, it is impossible to control changes in user intentions, so the model needs to have stronger generalization capabilities and robustness.
  • the system for generating dialogue replies is mainly divided into two modes.
  • One is based on a retrieval model to find replies with similar content in a specific database or corpus.
  • Many knowledge questions and answers or task-based dialogues now use this retrieval model.
  • this retrieval model In open domain chat, there is no specific corpus to query, so the effect of this retrieval model is often not as good as expected.
  • GPT Generative Pre-trained Transformer
  • T5 Text-to-Text Transfer Transformer
  • BART Bidirectional and Auto-Regressive Transformers
  • the present application provides an open domain conversation reply method and system based on topic enhancement.
  • An open domain conversation reply method based on topic enhancement including the following steps:
  • Enhance the semantic and topic information of each sentence of dialogue and then use the pre-trained sentence representation model to learn the vector representation of the original sentence and the enhanced sentence;
  • the sentence vectors enhanced by topic aggregation are input into the pre-trained generative model GPT, and the decoding strategy of beam search is used to generate a candidate set of dialogue replies. Finally, the contrastive learning method is used to train the reply ranking selection model to select the most suitable reply.
  • collecting open source Chinese open domain dialogue text corpus and preprocessing to obtain a Chinese dialogue corpus data set includes: collecting open source Chinese open domain dialogue text corpus through a web crawler and filtering and cleaning the data. , obtain the Chinese dialogue data set.
  • using the public natural language processing toolkit HanLP to perform sentence segmentation, word segmentation and part-of-speech tagging of the dialogue and using regular expressions to extract noun words includes: using the public natural language processing toolkit HanLP to perform Chinese dialogue Each dialogue in the corpus data set is segmented, and m sentences of dialogue are obtained: ⁇ S 1 , S 2 , S 3 ,..., S m ⁇ .
  • Each dialogue is segmented and n words are obtained: ⁇ t 1 , t 2 ,t 3 ,...,t n ⁇ , each word t Classify each word with a part-of-speech tag, and use regular expressions to extract all words that match the noun part, that is, select adjectives, nouns, names of people, place names, names of institutions, and proper nouns with noun functions from the part-of-speech category.
  • each sentence of dialogue is enhanced with semantic and topic information, and then a pre-trained sentence representation model is used to learn the vector representation of the original sentence and the enhanced sentence, including:
  • the semantic enhancement method includes: 1) using a Chinese synonym dictionary to randomly replace phrases in the dialogue text with synonyms; 2) randomly transposing adjacent phrases in the dialogue text; 3) using dialogue Non-noun phrases in the text are randomly repeated multiple times or deleted; or 4) use the SimBERT model to rewrite the dialogue text;
  • the methods of enhancing the topic information include: 1) using a large-scale word vector model to obtain similar words of nouns or noun phrases, and using the similar words to replace the nouns or noun phrases in the original dialogue text; or 2) using the dialogue text A noun phrase or phrase is repeated a random number of times.
  • the graph convolutional neural network is used to extract semantic and topic information of dialogue sentences, and topic aggregation enhancement processing is performed to obtain a sentence vector after topic aggregation enhancement, including:
  • a directed graph is constructed using the original dialogue text and the enhanced dialogue text.
  • the method of aggregation enhancement processing is:
  • Directly adjacent to the first order Indicates that two nodes are connected by an edge, which refers to the original sentence and the directly adjacent enhanced sentence.
  • second-order indirect neighbor Indicates that there is no edge directly connecting the two nodes, but a common adjacent node
  • H l+1 ⁇ (W( ⁇ N+ ⁇ N′)H l +b),
  • H l represents the original sentence vector before topic enhancement
  • W and b represent linearly changing weights
  • ⁇ and ⁇ are learnable parameters.
  • the sentence vectors enhanced by topic aggregation are input into the pre-trained generative model GPT, a beam search decoding strategy is used to generate a dialogue reply candidate set, and finally a contrastive learning method is used to train the reply ranking selection model, to select the most appropriate response, including:
  • the obtained topic aggregation enhanced sentence vector and the original sentence vector are spliced together and input into the pre-trained generative model GPT.
  • Beam Search is used to generate a candidate set of dialogue replies;
  • the contrastive learning method is used to train the reply ranking selection model to obtain the most suitable reply for the original sentence.
  • the method of using contrastive learning to train the reply ranking and selection model to obtain the most suitable reply for the original sentence includes: constructing positive and negative examples in the open domain Chinese dialogue data collected through web crawlers, and converting the same conversation into The context of the conversation is used as a positive example, the previous text of the conversation and the replies of other conversations are used as negative examples, and the reply ranking selection model is trained to determine whether the reply is suitable, including: splicing the context and context together in pairs, and then inputting them into the pre-trained BERT In the model, the vector Si corresponding to the [CLS] token in the BERT model output is then taken out for classification.
  • the loss function for training the reply ranking selection model is:
  • Represents the previous sentence in a conversation sentence i Represents the following sentence in a conversation that responds to sentence S 1 i
  • Represents another dialogue that is, the subsequent sentence of the reply in other dialogue sentence j
  • N represents that there are N other dialogue sentences.
  • An open domain conversation reply system based on topic enhancement including:
  • the text collection module based on web crawlers, is used to collect Chinese open domain conversation text corpus and filter and clean the data;
  • the word segmentation and part-of-speech tagging module is used for sentence segmentation and segmentation. Based on the components of the syntactic structure or language morphology of each phrase or phrase, it assigns a part-of-speech tag to each word through part-of-speech classification, and then extracts a part-of-speech tag through regular expressions. words of noun nature;
  • the semantic and topic enhancement module is used to allow the model to better learn the semantic representation of sentences and perform semantic and topic data enhancement on the original sentences, including: 1) random synonym replacement, 2) random adjacent word replacement, 3) random deletion or Repeat non-noun phrases, 4) use SimBERT for sentence rewriting, 5) use word vector models for synonym replacement of nouns, or 6) random repetition of noun phrases;
  • the text encoding module uses the pre-trained sentence representation model to obtain the vector representation of the original sentence and the enhanced sentence, and then uses the graph convolutional neural network to obtain the topic-enhanced sentence vector representation by aggregating the data-enhanced sentence vector representation;
  • the sentence sorting module based on contrastive learning adopts the method of contrastive learning, taking the context of the same conversation as positive examples, taking the preceding text of this conversation and the reply of another conversation as negative examples, and training the reply ranking selection model for screening. Come up with the most suitable reply text;
  • the reply generation module inputs the topic-enhanced sentence vector representation obtained by the graph convolutional neural network as a Prompt into the pre-trained generation model GPT, and uses Beam Search to generate a topic-related reply candidate set, which is then trained through the previous Use the sentence sorting module to sort and filter to find the most suitable reply.
  • Figure 1 is a block diagram of a topic-enhanced open domain dialogue reply system according to an embodiment of the present application
  • Figure 2 is a schematic flow chart of an open domain dialogue reply method based on topic enhancement according to an embodiment of the present application
  • Figure 3 is a schematic structural diagram of an open domain dialogue reply device based on topic enhancement according to an embodiment of the present application.
  • an embodiment means that a particular feature, structure or characteristic described in connection with the embodiment may be included in at least one embodiment of the application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by those of ordinary skill in the art that the embodiments described in this application may be combined with other embodiments without conflict.
  • the topic-enhanced open domain dialogue reply system of this application includes:
  • the text collection module based on web crawlers, is used to collect Chinese open domain conversation text corpus and filter and clean the data;
  • the word segmentation and part-of-speech tagging module is used for sentence segmentation and segmentation. Based on the components of the syntactic structure or language morphology of each phrase or phrase, it assigns a part-of-speech tag to each word through part-of-speech classification, and then extracts a part-of-speech tag through regular expressions. words of noun nature;
  • the semantic and topic enhancement module is used to allow the model to better learn the semantic representation of sentences and enhance the semantic and topic data of the original sentences, including the following content methods: 1) random synonym replacement, 2) random adjacent word replacement, 3) Randomly delete or repeat non-noun phrases, 4) use SimBERT for sentence rewriting, 5) use word vector models for synonym replacement of nouns, or 6) random repetition of noun phrases;
  • the text encoding module uses the pre-trained sentence representation model to obtain the vector representation of the original sentence and the enhanced sentence, and then uses the graph convolutional neural network to obtain the topic-enhanced sentence vector representation by aggregating the data-enhanced sentence vector representation;
  • the sentence sorting module based on contrastive learning adopts the method of contrastive learning, taking the context of the same conversation as positive examples, taking the preceding text of this conversation and the reply of another conversation as negative examples, and training the reply ranking selection model for screening. Come up with the most suitable reply text;
  • the reply generation module inputs the topic-enhanced sentence vector representation obtained by the graph convolutional neural network as a Prompt into the pre-trained generation model GPT, and uses Beam Search to generate a topic-related reply candidate set, which is then trained through the previous Use the sentence sorting module to sort and filter to find the most suitable reply.
  • the open domain and dialogue reply system of this application based on contrastive learning, graph convolutional neural network and topic enhancement uses semantic and topic enhancement, and aggregates it through the graph convolution network to generate a reply candidate set with topic consistency.
  • contrastive learning is used to optimize the reply ranking and selection model to ensure the generation of reply content that has both topic consistency and semantic fluency.
  • the topic-enhanced open domain dialogue reply method of this application includes the following steps:
  • Step 1 Collect open source Chinese open-domain dialogue text corpus and preprocess it to obtain a dialogue corpus data set.
  • Chinese open-domain conversation text corpus is collected through web crawlers, including Weibo corpus, Douban conversation corpus, and Baidu Tieba conversation corpus. The corpus was filtered and cleaned, and finally nearly 3 million conversation data were obtained.
  • Step 2 Use the public natural language processing toolkit HanLP to segment the dialogue, segment it, and tag it part-of-speech, and use regular expressions to extract noun words.
  • each word t (adjective of noun function), n (noun), nr (person's name), ns (place name), nt (organization name), nz (proper noun), and use regular expressions to extract all words that match the above parts of speech.
  • Step 3 Enhance the semantic and topic information of each sentence of dialogue, and then use the pre-trained sentence representation model to learn the vector representation of the original sentence and the enhanced sentence, including the following sub-steps:
  • Step 3.1 In order to better allow the network model to learn the semantic representation of sentences, perform semantic enhancement of the data for each dialogue S y (1 ⁇ y ⁇ m), including: 1) Using the Chinese synonym dictionary to convert the dialogue text (sentences of the dialogue) ); 2) Randomly replace adjacent phrases in the dialogue text; 3) Use non-noun phrases in the dialogue text to randomly repeat multiple times or delete them; or 4) Use the SimBERT model to rewrite Conversation text.
  • Step 3.2 If there are nouns or noun phrases found in the dialogue sentences through the part-of-speech tagging model, in addition to enhancing the sentence semantics in step 3.1, the extracted nouns will also be used to enhance the topic information, including: 1) using large-scale The word vector model obtains similar words of these nouns or noun phrases, and uses these similar words to replace the nouns or noun phrases in the original dialogue text (original sentence); or 2) use noun phrases or phrases in the dialogue text to randomly repeat multiple times. Second-rate.
  • Step 3.3 After obtaining the dialogue text with enhanced semantic and topic information, use the method of steps 2 to 3 above to perform another data enhancement process on the enhanced dialogue text to ensure the richness of the enhanced semantics and topic.
  • Step 3.4 Then use the pre-trained sentence representation model RoBERTa to learn the vector representation of the original sentence and the enhanced sentence, input a sentence into the RoBERTa model, and extract the vector corresponding to the [CLS] token in the model output as the vector representation of the sentence. .
  • Step 4 Use the graph convolutional neural network to extract the semantic and topic information of the dialogue sentences, and perform topic aggregation enhancement processing to obtain the topic aggregation enhanced sentence vector, including the following sub-steps:
  • Step 4.1 First, use the original dialogue text and the enhanced dialogue text to construct a directed graph.
  • Step 4.2 After constructing the directed graph G, the next step is to use the graph convolutional neural network to perform semantic and topic aggregation enhancement processing on the original sentence along the edge direction.
  • the specific operations are as follows:
  • Directly adjacent to the first order Indicates that two nodes are connected by an edge, which refers to the original sentence and the directly adjacent enhanced sentence.
  • H l+1 ⁇ (W( ⁇ N+ ⁇ N′)H l +b),
  • H l represents the original sentence vector before topic enhancement
  • W and b represent linearly changing weights
  • ⁇ and ⁇ are learnable parameters, which are used to control the topic enhancement of first-order adjacent and second-order adjacent enhanced sentences. Impact.
  • Step 5 Input the sentence vector enhanced by topic aggregation into the pre-trained generative model GPT, use the decoding strategy of beam search to generate a candidate set of dialogue replies, and finally use the contrastive learning method to train the reply ranking selection model to select the most suitable reply.
  • Step 5.1 Concatenate the obtained topic aggregation-enhanced sentence vector as a topic Prompt and the original sentence vector, input it into the pre-trained generative model GPT, and use Beam Search to generate dialogue replies during the decoding process.
  • Candidate set unlike Greedy Search, which generates only the feature vector token with the highest probability at each time step, Beam Search retains beam size candidate feature vector tokens with the highest probability at each step when generating a reply.
  • Step 5.2 After using Beam Search to generate multiple conversation reply candidate sets, use the comparative learning method to train a reply ranking selection model to select the most suitable reply.
  • Positive and negative examples are constructed from the open-domain Chinese dialogue data collected through web crawlers.
  • the context of the same dialogue is used as positive examples.
  • the previous text of this dialogue and the replies of other dialogues are used as negative examples.
  • the training model determines whether the reply is suitable. , including: splicing the preceding and following texts together, then inputting them into the pre-trained BERT model, and then taking out the vector Si corresponding to the [CLS] token in the output for classification.
  • the loss function for reply ranking selection model training is:
  • Represents the previous sentence in a conversation sentence i Represents the following sentence in a conversation that responds to sentence S 1 i , Represents another dialogue, that is, the subsequent sentence of the reply in other dialogue sentence j, N represents that there are N other dialogue sentences;
  • the method of using contrastive learning is to let the distance between positive examples Get closer while increasing the distance between negative examples.
  • the method provided in this embodiment can realize controllable reply generation in open domain topics through graph convolutional neural network, contrastive learning and topic enhancement.
  • the open-domain dialogue reply generation method of this application combines data enhancement, using part-of-speech tagging and large-scale word vector models to use strategies to enhance the semantics and topic information of sentences in limited dialogue data;
  • the graph convolutional neural network uses semantics and topic-enhanced sentences, a topic fusion and enhancement is made to the original sentence;
  • contrastive learning using the method of constructing positive and negative examples, shortens the distance between relevant replies during the model learning process, allowing the model to generate responses from Appropriate responses are sorted from the candidate set;
  • this application solves the problems encountered in open domain dialogue response generation such as general responses and lack of topic consistency, and improves the effect of response generation.
  • this application also provides an embodiment of a topic-enhanced open-domain conversation reply device.
  • an open domain dialogue reply device based on topic enhancement provided by an embodiment of the present application includes one or more processors for implementing an open domain dialogue reply method based on topic enhancement in the above embodiment.
  • the embodiment of the topic-enhanced open domain dialogue reply device of the present application can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer.
  • the device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 3, it is a hardware structure diagram of any device with data processing capabilities where the topic-enhanced open domain dialogue reply device of the present application is located. In addition to the processor shown in Figure 3 , memory, network interface, and non-volatile memory, any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. In this regard No longer.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • Embodiments of the present application also provide a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the topic-enhanced open domain dialogue reply method in the above embodiments is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium can also be an external storage device, such as a plug-in hard disk, a smart memory card (SMC), an SD card, a flash card (Flash Card), etc. equipped on the device.
  • the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种基于主题增强的开放域对话回复方法及系统。该方法包括:采集并预处理得到中文对话语料数据集;进行对话的断句、分词和词性标注并抽取名词性词语;对每一句对话进行语义及主题信息的增强处理,用预训练句子表征模型学习原始句子与增强后句子的向量表征;使用图卷积神经网络进行主题聚合增强的处理;将主题聚合增强后的句向量输入预训练的生成模型,生成对话回复候选集,采用对比学习方法训练回复排序选择模型选出最适合的回复。

Description

基于主题增强的开放域对话回复方法及系统
相关申请
本申请要求2022年8月16日申请的,申请号为202210981384.4,发明名称为“一种基于主题增强的开放域对话回复方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能领域,涉及一种基于主题增强的开放域对话回复方法及系统。
背景技术
开放域对话回复生成是自然语言处理中一项具有挑战性的任务,开放域对话是指在不限定领域下进行通用聊天。目前在任务型对话回复任务中,人工智能取得了重大的突破,可是开放域的对话回复中,无法掌控用户的意图变化,所以需要模型具有更强的泛化能力及鲁棒性。
目前对话回复生成的系统主要分成两种模式,一种是基于检索模型在特定数据库或语料库中找寻相似内容的回复,现在很多知识问答或者任务型对话都是采用这种检索模型。而在开放域聊天中,没有一个特定的语料库可供查询,所以这种检索模型的效果往往不如预期。而随着深度学习的兴起,尤其是大规模预训练生成模型如GPT(Generative Pre-trained Transformer)、T5(Text-to-Text Transfer Transformer)、BART(Bidirectional and Auto-Regressive Transformers)等的提出,基于深度学习的生成式对话系统也受到了越来越多的关注。虽然在大规模的对话预料中预训练然后再微调的方式,可以生成语义通顺的回复,可是在开放域对话回复中往往生成比较泛的回复,缺少主题的一致性。
在开放域对话回复中,目前的技术往往只考虑生成回复与前文的连贯性,却忽略了主题之间的一致性,导致模型通常会给出很泛的回复。而且由于开放域中没有固定的主题,难以及时侦测主题,并给出一致的回复。
发明内容
根据本申请的各种实施例,本申请提供一种基于主题增强的开放域对话回复方法及系统。
一种基于主题增强的开放域对话回复方法,包括以下步骤:
采集开源的中文开放域对话文本语料并预处理得到中文对话语料数据集;
利用公开的自然语言处理工具包HanLP进行对话的断句、分词和词性标注并用正则表达式抽取出名词性词语;
对每一句对话进行语义及主题信息的增强处理,后使用预训练的句子表征模型学习原始句子与增强后句子的向量表征;
使用图卷积神经网络提取对话句子的语义及主题信息,并进行主题聚合增强的处理,得到主题聚合增强后的句向量;
将主题聚合增强后的句向量输入到预训练的生成模型GPT中,采用束搜索的解码策略生成对话回复候选集,最后采用对比学习的方法训练回复排序选择模型,以选出最适合的回复。
在一些实施例中,所述采集开源的中文开放域对话文本语料并预处理得到中文对话语料数据集包括:通过网络爬虫的方式,采集开源的中文开放域对话文本语料并进行数据的过滤与清洗,得到中文对话语料数据集。
在一些实施例中,所述利用公开的自然语言处理工具包HanLP进行对话的断句、分词和词性标注并用正则表达式抽取出名词性词语包括:利用公开的自然语言处理工具包HanLP,对中文对话语料数据集中的每一段对话进行断句,得到m句对话:{S 1,S 2,S 3,...,S m},每一句对话进行分词,得到n个词语:{t 1,t 2,t 3,...,t n},对每一个词语t x(1≤x≤n)按照现代汉语语料库加工规范进行词性分类,依据词语在句法结构或语言形态上承担的成分,通过词性分类赋予每个词词性标记,并用正则表达式将符合名词性的词语全部抽取出来,即从词性类别中选择具有名词功能的形容词、名词、人名、地名、机构团体名、专有名词。
在一些实施例中,所述对每一句对话进行语义及主题信息的增强处理,后使用预训练的句子表征模型学习原始句子与增强后句子的向量表征,包括:
将每一句对话S y(1≤y≤m)进行数据的语义增强;
对抽取出来的名词进行主题信息的增强;
对增强的对话文本做再一次的数据增强的处理;
然后使用预训练的句子表征模型RoBERTa(Robustly Optimized BERT Pretraining Approach)学习原始句子与增强后的句子的向量表征。
在一些实施例中,所述语义增强的方式包括:1)利用中文近义词词典将对话文本中的词组进行随机同义词替换;2)将对话文本中的邻近词组进行随机的调换位置;3)使用对 话文本中的非名词词组进行随机的重复多次或者删除;或4)使用SimBERT模型改写对话文本;
所述主题信息的增强的方式包括:1)使用大规模词向量模型得到名词或名词性短语的相似词,并用所述相似词替换原本对话文本中的名词或名词短语;或2)使用对话文本中的名词词组或短语进行随机的重复多次。
在一些实施例中,所述使用图卷积神经网络提取对话句子的语义及主题信息,并进行主题聚合增强的处理,得到主题聚合增强后的句向量,包括:
首先用原始的对话文本与增强后的对话文本建构一张有向图,图上的节点ν O代表编码后的原始句子,ν A代表增强后的句子集合;每一增强后的句子都有一条边ε指向原始句子,最后得到一张有向图G=(ν,ε);
在建构好有向图G后,使用图卷积神经网络,沿着边的方向对原始句子进行语义以及主题的聚合增强处理;
所述聚合增强处理的方式为:
在有向图G中存在两种关系:
一阶直接相邻
Figure PCTCN2022139320-appb-000001
表示两个节点中有一条边相连,指的是原始句子与直接相邻的增强后的句子,
二阶间接相邻
Figure PCTCN2022139320-appb-000002
表示两个节点之间没有一条边直接相连,而是有一个共同的相邻节点;
通过邻接矩阵
Figure PCTCN2022139320-appb-000003
Figure PCTCN2022139320-appb-000004
分别计算对应的度矩阵,计算公式分别为:
Figure PCTCN2022139320-appb-000005
并通过各对应的度矩阵对邻接矩阵
Figure PCTCN2022139320-appb-000006
Figure PCTCN2022139320-appb-000007
分别做一个归一化的操作,对应的计算公式分别为:
Figure PCTCN2022139320-appb-000008
然后使用一个线性变化以及Sigmod激活函数计算每一个经过一阶相邻与二阶相邻增强后的句子向量H l+1,其计算公式为:
H l+1=σ(W(αN+βN′)H l+b),
Figure PCTCN2022139320-appb-000009
其中H l表示主题增强前的原始句向量,W和b表示线性变化的权重,α与β则为可学习的参数。
在一些实施例中,所述将主题聚合增强后的句向量输入到预训练的生成模型GPT中,采用束搜索的解码策略生成对话回复候选集,最后采用对比学习的方法训练回复排序选择模型,以选出最适合的回复,包括:
将得到的主题聚合增强后的句向量和原始句向量拼接起来,输入到预训练的生成模型GPT中,在解码的过程中采用Beam Search束搜索的方式产生对话回复候选集;
采用对比学习的方法训练回复排序选择模型,得到原始句子最适合的回复。
在一些实施例中,所述采用对比学习的方法训练回复排序选择模型,得到原始句子最适合的回复包括:在通过网络爬虫搜集到的开放域中文对话语料中构建正负例,将同一段对话的前后文作为正例,该段对话的前文与其他段对话的回复作为负例,训练回复排序选择模型判断该回复是否适合,包括:将前后文两两拼接在一起,然后输入到预训练BERT模型中,然后将BERT模型输出中[CLS]token对应的向量S i取出来做分类。
在一些实施例中,所述回复排序选择模型训练的损失函数为:
Figure PCTCN2022139320-appb-000010
其中,
Figure PCTCN2022139320-appb-000011
表示一段对话句子i中的前文句子,
Figure PCTCN2022139320-appb-000012
表示一段对话中的对句子S 1 i回复的后文句子,
Figure PCTCN2022139320-appb-000013
表示另一段对话即其他段对话句子j中的回复的后文句子,N表示有N个其他段对话句子。
一种基于主题增强的开放域对话回复系统,包括:
文本采集模块,基于网络爬虫,用于采集中文开放域对话文本语料,并对数据进行过滤与清洗;
分词与词性标注模块,用于断句分词,并依据每一个词组或短语在句法结构或语言形态上承担的成分,通过词性分类赋予每个词一个词性的标记,然后通过正则表达式,提取出具有名词性质的词;
语义及主题增强模块,用于让模型更好地学习到句子语义表征,对原始句子进行语义 及主题的数据增强,包括:1)随机同义词替换,2)随机邻近词调换,3)随机删除或重复非名词词组,4)使用SimBERT做句子改写,5)使用词向量模型做名词的同义词替换,或6)名词词组的随机重复;
文本编码模块,使用预训练句子表征模型得到原始句子与增强后句子的向量表征,然后利用图卷积神经网络,通过对数据增强后的句子向量表征做聚合,得到主题增强后的句子向量表征;
基于对比学习的句子排序模块,采用对比学习的方法,将同一段对话的前后文作为正例,取该段对话的前文与另一段对话的回复作为负例,训练回复排序选择模型,用于筛选出最适合的回复文本;
回复生成模块,将图卷积神经网络得到的主题增强的句子向量表征作为Prompt输入到预训练生成模型GPT中,并采用Beam Search束搜索的方式产生主题相关的回复候选集,然后通过前面训练好的句子排序模块进行排序筛选,找到最适合的回复。
本申请的一个或多个实施例的细节在以下附图和描述中提出,以使本申请的其他特征、目的和优点更加简明易懂。
附图说明
为了更好地描述和说明这里公开的本申请的实施例和/或示例,可以参考一幅或多幅附图。用于描述附图的附加细节或示例不应当被认为是对所公开的申请、目前描述的实施例和/或示例以及目前理解的这些申请的最佳模式中的任何一者的范围的限制。
图1是本申请实施例的一种基于主题增强的开放域对话回复系统框图;
图2是本申请实施例的一种基于主题增强的开放域对话回复方法流程示意图;
图3是本申请实施例的一种基于主题增强的开放域对话回复装置的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行描述和说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。基于本申请提供的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。此外,还可以理解的是,虽然这种开发过程中所作出的努力可能是复杂并且冗长的,然而对于与本申请公开的内容相关的本领域的普通技术人员而言,在本申请揭露的技术内容的基础上进行的一些设计,制造或者生产等变更只是常规的技术手段,不应当理解为本申请公开的内容不充分。
在本申请中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域普通技术人员显式地和隐式地理解的是,本申请所描述的实施例在不冲突的情况下,可以与其它实施例相结合。
除非另作定义,本申请所涉及的技术术语或者科学术语应当为本申请所属技术领域内具有一般技能的人士所理解的通常意义。本申请所涉及的“一”、“一个”、“一种”、“该”等类似词语并不表示数量限制,可表示单数或复数。本申请所涉及的“多个”是指大于或者等于两个。本申请所涉及的术语“包括”、“包含”、“具有”以及它们任何变形,意图在于覆盖不排他的包含。
如图1所示,本申请的一种基于主题增强的开放域对话回复系统,包括:
文本采集模块,基于网络爬虫,用于采集中文开放域对话文本语料,并对数据进行过滤与清洗;
分词与词性标注模块,用于断句分词,并依据每一个词组或短语在句法结构或语言形态上承担的成分,通过词性分类赋予每个词一个词性的标记,然后通过正则表达式,提取出具有名词性质的词;
语义及主题增强模块,用于让模型更好地学习到句子语义表征,对原始句子进行语义及主题的数据增强,包括以下内容方式:1)随机同义词替换,2)随机邻近词调换,3)随机删除或重复非名词词组,4)使用SimBERT做句子改写,5)使用词向量模型做名词的同义词替换,或6)名词词组的随机重复;
文本编码模块,使用预训练句子表征模型得到原始句子与增强后句子的向量表征,然后利用图卷积神经网络,通过对数据增强后的句子向量表征做聚合,得到主题增强后的句子向量表征;
基于对比学习的句子排序模块,采用对比学习的方法,将同一段对话的前后文作为正例,取该段对话的前文与另一段对话的回复作为负例,训练回复排序选择模型,用于筛选出最适合的回复文本;
回复生成模块,将图卷积神经网络得到的主题增强的句子向量表征作为Prompt输入到预训练生成模型GPT中,并采用Beam Search束搜索的方式产生主题相关的回复候选集,然后通过前面训练好的句子排序模块进行排序筛选,找到最适合的回复。
本申请的基于对比学习、图卷积神经网络与主题增强的开放域与对话回复系统,利用语意及主题增强,并透过图卷积网络进行聚合,生成具有主题一致性的回复候选集,同时利用对比学习的思想优化回复排序选择模型,确保生成兼具主题一致性与语意流畅性的回 复内容。
如图2所示,本申请的一种基于主题增强的开放域对话回复方法,包括如下步骤:
步骤一:采集开源的中文开放域对话文本语料并预处理得到对话语料数据集。
透过网络爬虫采集中文开放域对话文本语料,其中包括微博语料、豆瓣会话语料、百度贴吧对话语料。并对这些语料进行数据的过滤与清洗,最后得到近300万笔对话数据。
步骤二:利用公开的自然语言处理工具包HanLP进行对话的断句、分词和词性标注并用正则表达式抽取出名词性词语。
利用公开的自然语言处理工具包HanLP提供的中文分词与词性标注套件,对每一段对话进行断句,得到m句对话:{S 1,S 2,S 3,...,S m},每一句对话进行分词,得到n个词语:{t 1,t 2,t 3,...,t n},并由机器给出句子中分词的结果,依据其在句法结构或语言形态上承担的成分,通过词性分类赋予每个词词性标记。
其中,对每一个词语t x(1≤x≤n)按照PKU规范(现代汉语语料库加工规范)被分成43类,为了找到与主题相关的词组,从这43类的词性类别中选择an(具有名词功能的形容词)、n(名词)、nr(人名)、ns(地名)、nt(机构团体名)、nz(专有名词),并用正则表达式将符合上述词性的词语全部抽取出来。
步骤三:对每一句对话进行语义及主题信息的增强处理,后使用预训练的句子表征模型学习原始句子与增强后句子的向量表征,包括以下子步骤:
步骤3.1:为了更好地让网络模型学习到句子语义表征,将每一句对话S y(1≤y≤m)进行数据的语义增强,包括:1)利用中文近义词词典将对话文本(对话的句子)中的词组进行随机同义词替换;2)将对话文本中的临近词组进行随机的调换位置;3)使用对话文本中的非名词词组进行随机的重复多次或者删除;或4)使用SimBERT模型改写对话文本。
步骤3.2:若对话句子中有通过词性标注模型找出名词或名词性短语,除了步骤3.1对句子语义进行增强外,还会使用抽取出来的名词进行主题信息的增强,包括:1)使用大规模词向量模型得到这些名词或名词性短语的相似词,并用这些相似词替换原本对话文本(原始句子)中的名词或名词短语;或2)使用对话文本中的名词词组或短语进行随机的重复多次。
步骤3.3:得到语义及主题信息增强的对话文本之后,再利用上述步骤二至步骤三的方法,对增强的对话文本做再一次的数据增强的处理,确保增强后语义及主题的丰富性。
步骤3.4:然后使用预训练的句子表征模型RoBERTa学习原始句子与增强后的句子的 向量表征,输入一个句子到RoBERTa模型中,将模型输出中[CLS]token对应的向量取出作为该句子的向量表征。
步骤四:使用图卷积神经网络提取对话句子的语义及主题信息,并进行主题聚合增强的处理,得到主题聚合增强后的句向量,包括以下子步骤:
步骤4.1:首先用原始的对话文本与增强后的对话文本建构一张有向图,图上的节点ν O代表编码后的原始句子,ν A代表增强后的句子集合;每一增强后的句子都有一条边ε指向原始句子,最后得到一张有向图G=(ν,ε);
步骤4.2:在建构好有向图G后,接下来要使用图卷积神经网络,沿着边的方向对原始句子进行语义以及主题的聚合增强处理,具体操作如下:
在有向图G中一共存在两种关系:
一阶直接相邻
Figure PCTCN2022139320-appb-000014
表示两个节点中有一条边相连,指的是原始句子与直接相邻的增强后的句子,
二阶间接相邻
Figure PCTCN2022139320-appb-000015
表示两个节点之间没有一条边直接相连,而是有一个共同的相邻节点,指的是增强后的句子之间,因为在建立的有向图网路中,没有直接相连的结点也会存在某些主题之间的联系,通过这种二阶间接相邻关系可以提取到更多文本主题相关的特征;
通过邻接矩阵
Figure PCTCN2022139320-appb-000016
Figure PCTCN2022139320-appb-000017
分别计算对应的度矩阵,计算公式分别为:
Figure PCTCN2022139320-appb-000018
并通过各对应的度矩阵对邻接矩阵
Figure PCTCN2022139320-appb-000019
Figure PCTCN2022139320-appb-000020
分别做一个归一化的操作,防止某一节点因为相连的边比较多而造成较大的影响力,对应的计算公式分别为:
Figure PCTCN2022139320-appb-000021
然后使用一个线性变化以及Sigmod激活函数计算每一个经过一阶相邻与二阶相邻增强后的句子向量H l+1,其计算公式为
H l+1=σ(W(αN+βN′)H l+b),
Figure PCTCN2022139320-appb-000022
其中H l表示主题增强前的原始句向量,W和b表示线性变化的权重,α与β则为可学习的参数,用来控制一阶相邻与二阶相邻的增强后语句对主题增强的影响。
步骤五:将主题聚合增强后的句向量输入到预训练的生成模型GPT中,采用束搜索的解码策略生成对话回复候选集,最后采用对比学习的方法训练回复排序选择模型,以选出最适合的回复。
步骤5.1:将得到的主题聚合增强后的句向量作为一个主题的Prompt和原始句向量拼接起来,输入到预训练的生成模型GPT中,在解码的过程中采用Beam Search束搜索的方式产生对话回复候选集,不同于Greedy Search贪心搜索的是每一个time step时间步只产生概率最高的特征向量token,Beam Search在生成回复的时候每一步都保留束宽beam size个概率最高的候选特征向量token。
步骤5.2:在使用Beam Search产生多个对话回复候选集后,采用对比学习的方法训练一个回复排序选择模型,以选出最适合的回复。
通过网络爬虫搜集到的开放域中文对话语料中构建正负例,将同一段对话的前后文作为正例,该段对话的前文与其他段对话的回复作为负例,训练模型判断该回复是否适合,包括:将前后文两两拼接在一起,然后输入到预训练BERT模型中,然后将输出中[CLS]token对应的向量S i取出来做分类。其中,回复排序选择模型训练的损失函数为:
Figure PCTCN2022139320-appb-000023
其中,
Figure PCTCN2022139320-appb-000024
表示一段对话句子i中的前文句子,
Figure PCTCN2022139320-appb-000025
表示一段对话中的对句子S 1 i回复的后文句子,
Figure PCTCN2022139320-appb-000026
表示另一段对话即其他段对话句子j中的回复的后文句子,N表示有N个其他段对话句子;
采用对比学习的方法是让正例之间的距离
Figure PCTCN2022139320-appb-000027
更加靠近,同时让负例之间的距离增加。
综上所述,本实施例提供的方法,通过图卷积神经网络、对比学习与主题增强,可以实现开放域主题可控回复生成。
本申请的开放域对话回复生成方法,结合了数据增强,通过词性标注与大规模词向量模型在有限的对话语料中,利用策略增强了句子的语意及主题信息;图卷积神经网络,通 过语意及主题增强后的句子,对原始句子做了一个主题的融合与增强;对比学习,利用构建正负例的方法,模型学习的过程中拉近了相关回复的距离,让模型可以从生成的回复候选集中排序出适合的回复;本申请解决了开放域对话回复生成中遇到的生成回复比较泛,缺少主题一致性等问题,提高回复生成的效果。
与前述一种基于主题增强的开放域对话回复方法的实施例相对应,本申请还提供了一种基于主题增强的开放域对话回复装置的实施例。
参见图3,本申请实施例提供的一种基于主题增强的开放域对话回复装置,包括一个或多个处理器,用于实现上述实施例中的一种基于主题增强的开放域对话回复方法。
本申请的一种基于主题增强的开放域对话回复装置的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图3所示,为本申请的一种基于主题增强的开放域对话回复装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图3所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本申请实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的一种基于主题增强的开放域对话回复方法。
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处 理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种基于主题增强的开放域对话回复方法,其特征在于,包括:
    采集开源的中文开放域对话文本语料并预处理得到中文对话语料数据集;
    利用公开的自然语言处理工具包HanLP进行对话的断句、分词和词性标注并用正则表达式抽取出名词性词语;
    对每一句对话进行语义及主题信息的增强处理,后使用预训练的句子表征模型学习原始句子与增强后句子的向量表征;
    使用图卷积神经网络提取对话句子的语义及主题信息,并进行主题聚合增强的处理,得到主题聚合增强后的句向量;
    将主题聚合增强后的句向量输入到预训练的生成模型GPT中,采用束搜索的解码策略生成对话回复候选集,最后采用对比学习的方法训练回复排序选择模型,以选出最适合的回复。
  2. 如权利要求1所述的基于主题增强的开放域对话回复方法,其中,所述采集开源的中文开放域对话文本语料并预处理得到中文对话语料数据集包括:通过网络爬虫的方式,采集开源的中文开放域对话文本语料并进行数据的过滤与清洗,得到中文对话语料数据集。
  3. 如权利要求1所述的基于主题增强的开放域对话回复方法,其中,所述利用公开的自然语言处理工具包HanLP进行对话的断句、分词和词性标注并用正则表达式抽取出名词性词语包括:利用公开的自然语言处理工具包HanLP,对中文对话语料数据集中的每一段对话进行断句得到m句对话:{S 1,S 2,S 3,...,S m},每一句对话进行分词得到n个词语:{t 1,t 2,t 3,...,t n},对每一个词语t x(1≤x≤n)按照现代汉语语料库加工规范进行词性分类,依据词语在句法结构或语言形态上承担的成分,通过词性分类赋予每个词词性标记,并用正则表达式将符合名词性的词语全部抽取出来,即从词性类别中选择具有名词功能的形容词、名词、人名、地名、机构团体名、专有名词。
  4. 如权利要求3所述的基于主题增强的开放域对话回复方法,其中,所述对每一句对话进行语义及主题信息的增强处理,后使用预训练的句子表征模型学习原始句子与增强后句子的向量表征包括:
    将每一句对话S y(1≤y≤m)进行数据的语义增强;
    对抽取出来的名词进行主题信息的增强;
    对增强的对话文本做再一次的数据增强的处理;
    然后使用预训练的句子表征模型RoBERTa学习原始句子与增强后的句子的向量表征。
  5. 如权利要求4所述的基于主题增强的开放域对话回复方法,其中,所述语义增强的方式包括:1)利用中文近义词词典将对话文本中的词组进行随机同义词替换;2)将对话文本中的邻近词组进行随机的调换位置;3)使用对话文本中的非名词词组进行随机的重复多次或者删除;或4)使用SimBERT模型改写对话文本;
    所述主题信息的增强的方式包括:1)使用大规模词向量模型得到名词或名词性短语的相似词,并用所述相似词替换原本对话文本中的名词或名词短语;或2)使用对话文本中的名词词组或短语进行随机的重复多次。
  6. 如权利要求4所述的一种基于主题增强的开放域对话回复方法,其中,所述使用图卷积神经网络提取对话句子的语义及主题信息,并进行主题聚合增强的处理,得到主题聚合增强后的句向量包括:
    用原始的对话文本与增强后的对话文本建构一张有向图,图上的节点ν O代表编码后的原始句子,ν A代表增强后的句子集合;每一增强后的句子都有一条边ε指向原始句子,最后得到一张有向图G=(ν,ε);
    在建构好有向图G后,使用图卷积神经网络,沿着边的方向对原始句子进行语义以及主题的聚合增强处理;
    所述聚合增强处理的方式包括:
    在有向图G中存在两种关系:
    一阶直接相邻
    Figure PCTCN2022139320-appb-100001
    表示两个节点中有一条边相连,指的是原始句子与直接相邻的增强后的句子,
    二阶间接相邻
    Figure PCTCN2022139320-appb-100002
    表示两个节点之间没有一条边直接相连,而是有一个共同的相邻节点;
    通过邻接矩阵
    Figure PCTCN2022139320-appb-100003
    Figure PCTCN2022139320-appb-100004
    分别计算对应的度矩阵,计算公式分别为:
    Figure PCTCN2022139320-appb-100005
    并通过各对应的度矩阵对邻接矩阵
    Figure PCTCN2022139320-appb-100006
    Figure PCTCN2022139320-appb-100007
    分别做归一化的操作,对应的计算 公式分别为:
    Figure PCTCN2022139320-appb-100008
    然后使用一个线性变化以及Sigmod激活函数计算每一个经过一阶相邻与二阶相邻增强后的句子向量H l+1,其计算公式为:
    H l+1=σ(W(αN+βN′)H l+b),
    Figure PCTCN2022139320-appb-100009
    其中H l表示主题增强前的原始句向量,W和b表示线性变化的权重,α与β则为可学习的参数。
  7. 如权利要求6所述的基于主题增强的开放域对话回复方法,其中,所述将主题聚合增强后的句向量输入到预训练的生成模型GPT中,采用束搜索的解码策略生成对话回复候选集,最后采用对比学习的方法训练回复排序选择模型,以选出最适合的回复包括:
    将得到的主题聚合增强后的句向量和原始句向量拼接起来,输入到预训练的生成模型GPT中,在解码的过程中采用Beam Search束搜索的方式产生对话回复候选集;
    采用对比学习的方法训练回复排序选择模型,得到原始句子最适合的回复。
  8. 如权利要求7所述的基于主题增强的开放域对话回复方法,其中,所述采用对比学习的方法训练回复排序选择模型得到原始句子最适合的回复包括:在通过网络爬虫搜集到的开放域中文对话语料中构建正负例,将同一段对话的前后文作为正例,该段对话的前文与其他段对话的回复作为负例,训练回复排序选择模型去判断该回复是否适合,包括:将前后文两两拼接在一起,然后输入到预训练BERT模型中,然后将BERT模型输出中[CLS]token对应的向量S i取出来做分类。
  9. 如权利要求8所述的基于主题增强的开放域对话回复方法,其中,所述回复排序选择模型训练的损失函数为:
    Figure PCTCN2022139320-appb-100010
    其中,
    Figure PCTCN2022139320-appb-100011
    表示一段对话句子i中的前文句子,
    Figure PCTCN2022139320-appb-100012
    表示一段对话中的对句子S1i回复的后文句子,
    Figure PCTCN2022139320-appb-100013
    表示另一段对话即其他段对话句子j中的回复的后文句子,N表示有N个其 他段对话句子。
  10. 一种基于主题增强的开放域对话回复系统,其特征在于,包括:
    文本采集模块,基于网络爬虫,用于采集中文开放域对话文本语料,并对数据进行过滤与清洗;
    分词与词性标注模块,用于断句分词,并依据每一个词组或短语在句法结构或语言形态上承担的成分,通过词性分类赋予每个词一个词性的标记,然后通过正规表达式,提取出具有名词性质的词;
    语义及主题增强模块,用于让模型更好地学习到句子语义表征,对原始句子进行语义及主题的数据增强,包括:1)随机同义词替换,2)随机邻近词调换,3)随机删除或重复非名词词组,4)使用SimBERT模型做句子改写,5)使用词向量模型做名词的同义词替换,或6)名词词组的随机重复;
    文本编码模块,使用预训练句子表征模型得到原始句子与增强后句子的向量表征,然后利用图卷积神经网络,通过对数据增强后的句子向量表征做聚合,得到主题增强后的句子向量表征;
    基于对比学习的句子排序模块,采用对比学习的方法,将同一段对话的前后文作为正例,取该段对话的前文与另一段对话的回复作为负例,训练回复排序选择模型,用于筛选出最适合的回复文本;
    回复生成模块,将图卷积神经网络得到的主题增强的句子向量表征作为Prompt输入到预训练生成模型GPT中,并采用Beam Search束搜索的方式产生主题相关的回复候选集,然后通过前面训练好的句子排序模块进行排序筛选,找到最适合的回复。
PCT/CN2022/139320 2022-08-16 2022-12-15 基于主题增强的开放域对话回复方法及系统 WO2024036840A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/297,610 US12019989B2 (en) 2022-08-16 2023-04-08 Open domain dialog reply method and system based on thematic enhancement

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210981384.4 2022-08-16
CN202210981384.4A CN115048944B (zh) 2022-08-16 2022-08-16 一种基于主题增强的开放域对话回复方法及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/297,610 Continuation US12019989B2 (en) 2022-08-16 2023-04-08 Open domain dialog reply method and system based on thematic enhancement

Publications (1)

Publication Number Publication Date
WO2024036840A1 true WO2024036840A1 (zh) 2024-02-22

Family

ID=83167008

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/139320 WO2024036840A1 (zh) 2022-08-16 2022-12-15 基于主题增强的开放域对话回复方法及系统

Country Status (2)

Country Link
CN (1) CN115048944B (zh)
WO (1) WO2024036840A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118230722A (zh) * 2024-05-22 2024-06-21 陕西拓方信息技术有限公司 基于ai的智能语音识别方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048944B (zh) * 2022-08-16 2022-12-20 之江实验室 一种基于主题增强的开放域对话回复方法及系统
CN115879422B (zh) * 2023-02-16 2023-06-13 之江实验室 一种对话回复生成方法、装置和存储介质
CN115879421B (zh) * 2023-02-16 2024-01-09 之江实验室 一种增强bart预训练任务的句子排序方法及装置
CN116910646B (zh) * 2023-07-04 2024-02-09 南京航空航天大学 So网站中知识单元的内部链接目的分类方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069611A (zh) * 2019-04-12 2019-07-30 武汉大学 一种主题增强的聊天机器人回复生成方法及装置
WO2020159395A1 (ru) * 2019-01-29 2020-08-06 Публичное Акционерное Общество "Сбербанк России" Способ создания модели анализа диалогов на базе искусственного интеллекта
CN112417112A (zh) * 2020-11-10 2021-02-26 中山大学 一种基于图表征增强的开放域对话系统评估方法
CN114428850A (zh) * 2022-04-07 2022-05-03 之江实验室 一种文本检索匹配方法和系统
CN115048944A (zh) * 2022-08-16 2022-09-13 之江实验室 一种基于主题增强的开放域对话回复方法及系统

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6832501B2 (ja) * 2016-06-17 2021-02-24 パナソニックIpマネジメント株式会社 意味生成方法、意味生成装置及びプログラム
CN108319599B (zh) * 2017-01-17 2021-02-26 华为技术有限公司 一种人机对话的方法和装置
US20180329884A1 (en) * 2017-05-12 2018-11-15 Rsvp Technologies Inc. Neural contextual conversation learning
CN108960574A (zh) * 2018-06-07 2018-12-07 百度在线网络技术(北京)有限公司 问答的质量确定方法、装置、服务器和存储介质
CN109829052A (zh) * 2019-02-19 2019-05-31 田中瑶 一种基于人机交互的开放式对话方法和系统
US11449556B2 (en) * 2020-02-04 2022-09-20 Accenture Global Solutions Limited Responding to user queries by context-based intelligent agents
US11676067B2 (en) * 2020-02-14 2023-06-13 Nice Ltd. System and method for creating data to train a conversational bot
CN111310438B (zh) * 2020-02-20 2021-06-08 齐鲁工业大学 基于多粒度融合模型的中文句子语义智能匹配方法及装置
CN112417125B (zh) * 2020-12-01 2023-03-24 南开大学 基于深度强化学习的开放域对话回复方法及系统
CN113254582A (zh) * 2021-05-26 2021-08-13 四川大学 一种基于预训练模型的知识驱动对话方法
CN113515613A (zh) * 2021-06-25 2021-10-19 华中科技大学 一种集成闲聊、知识和任务问答的智能机器人
CN113934835B (zh) * 2021-12-16 2022-03-25 之江实验室 结合关键词和语义理解表征的检索式回复对话方法及系统
CN114443827A (zh) * 2022-01-28 2022-05-06 福州大学 基于预训练语言模型的局部信息感知对话方法及系统
CN114564568A (zh) * 2022-02-25 2022-05-31 福州大学 基于知识增强与上下文感知的对话状态追踪方法及系统
CN114722834A (zh) * 2022-04-25 2022-07-08 中国平安人寿保险股份有限公司 基于对比学习的语义识别模型训练方法、设备和介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020159395A1 (ru) * 2019-01-29 2020-08-06 Публичное Акционерное Общество "Сбербанк России" Способ создания модели анализа диалогов на базе искусственного интеллекта
CN110069611A (zh) * 2019-04-12 2019-07-30 武汉大学 一种主题增强的聊天机器人回复生成方法及装置
CN112417112A (zh) * 2020-11-10 2021-02-26 中山大学 一种基于图表征增强的开放域对话系统评估方法
CN114428850A (zh) * 2022-04-07 2022-05-03 之江实验室 一种文本检索匹配方法和系统
CN115048944A (zh) * 2022-08-16 2022-09-13 之江实验室 一种基于主题增强的开放域对话回复方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOZHE GUO, PENG DUNLU; ZHANG YATONG; PENG XUEGUI: "GRS: A generative retrieval dialogue model for intelligent customer service in the field of e-commerce", HUADONG SHIFAN DAXUE XUEBAO (ZIRAN KEXUE BAN), HUADONG SHIFAN DAXUE, CH, no. 5, 25 September 2020 (2020-09-25), CH , pages 156 - 166, XP093140816, ISSN: 1000-5641, DOI: 10.3969/j.issn.1000-5641.202091010 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118230722A (zh) * 2024-05-22 2024-06-21 陕西拓方信息技术有限公司 基于ai的智能语音识别方法及系统

Also Published As

Publication number Publication date
CN115048944A (zh) 2022-09-13
CN115048944B (zh) 2022-12-20

Similar Documents

Publication Publication Date Title
WO2024036840A1 (zh) 基于主题增强的开放域对话回复方法及系统
Neelakandan et al. A gradient boosted decision tree-based sentiment classification of twitter data
Khan et al. Multi-class sentiment analysis of urdu text using multilingual BERT
RU2686000C1 (ru) Извлечение информационных объектов с использованием комбинации классификаторов, анализирующих локальные и нелокальные признаки
CN111143576A (zh) 一种面向事件的动态知识图谱构建方法和装置
Nagamanjula et al. A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis
Kmail et al. An automatic online recruitment system based on exploiting multiple semantic resources and concept-relatedness measures
Tang et al. Learning sentence representation for emotion classification on microblogs
CN114428850B (zh) 一种文本检索匹配方法和系统
CN111611393A (zh) 一种文本分类方法、装置及设备
Wen et al. Structure regularized neural network for entity relation classification for chinese literature text
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Priyadharshan et al. Text summarization for Tamil online sports news using NLP
Nasim et al. Cluster analysis of urdu tweets
Da et al. Deep learning based dual encoder retrieval model for citation recommendation
CN112417155B (zh) 基于指针-生成Seq2Seq模型的庭审询问生成方法、装置、介质
Zhao et al. Ia-icgcn: Integrating prior knowledge via intra-event association and inter-event causality for chinese causal event extraction
CN115952794A (zh) 融合双语敏感词典和异构图的汉泰跨语言敏感信息识别方法
CN114969324A (zh) 基于主题词特征扩展的中文新闻标题分类方法
US12019989B2 (en) Open domain dialog reply method and system based on thematic enhancement
Singh et al. Neural approaches towards text summarization
Akkineni et al. Hybrid Method for Framing Abstractive Summaries of Tweets.
Demilie Comparative analysis of automated text summarization techniques: The case of Ethiopian languages
Fu et al. A Syntax-based BSGCN Model for Chinese Implicit Sentiment Analysis with Multi-classification
Li Cross-cultural learning resource recommendation method and corpus construction based on online comment sentiment analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22955607

Country of ref document: EP

Kind code of ref document: A1