CN111553157A

CN111553157A - Entity replacement-based dialog intention identification method

Info

Publication number: CN111553157A
Application number: CN202010271707.1A
Authority: CN
Inventors: 张堃; 王天宇; 周波; 李文俊
Original assignee: Hangzhou Borazhe Technology Co ltd; Nantong University
Current assignee: Hangzhou Borazhe Technology Co ltd; Nantong University
Priority date: 2020-04-08
Filing date: 2020-04-08
Publication date: 2020-08-18

Abstract

本发明公开了基于实体替换的对话意图识别方法，包括以下步骤：步骤一、文本分词；步骤二、文本过滤；步骤三、文本命名实体识别：步骤四、文本命名实体替换；步骤五、文本特征提取：步骤六、文本意图识别；实现实体替换的对话意图识别，本方法利用命名实体识别结果，将文本信息中的实体名称替换为实体类型，降低了对话系统语料数据的量级与不平衡度，从而综合提升对话过程意图识别的准确度。The invention discloses a dialogue intention recognition method based on entity replacement, comprising the following steps: step 1, text segmentation; step 2, text filtering; step 3, text named entity recognition; step 4, text named entity replacement; step 5, text feature Extraction: Step 6, text intention recognition; to realize the dialogue intention recognition of entity replacement, this method uses the named entity recognition result to replace the entity name in the text information with the entity type, which reduces the magnitude and imbalance of the dialogue system corpus data , so as to comprehensively improve the accuracy of intention recognition in the dialogue process.

Description

A Dialogue Intent Recognition Method Based on Entity Replacement

技术领域technical field

本发明涉及一种基于对话意图识别方法，具体涉及一种基于实体替换的对话意图识别方法。The present invention relates to a method for identifying dialogue intentions, in particular to a method for identifying dialogue intentions based on entity replacement.

背景技术Background technique

近年来，在人工智能与半导体芯片技术的飞速发展和语音交互需求日益扩增的影响下，诸如智能音箱、智能家具、智能语音客服等各类基于对话系统的应用产品逐渐在市场上百花齐放。In recent years, under the influence of the rapid development of artificial intelligence and semiconductor chip technology and the increasing demand for voice interaction, various application products based on dialogue systems, such as smart speakers, smart furniture, and smart voice customer service, are gradually blooming in the market.

此类对话系统一般由语音识别(ASR),自然语言理解(NLU),对话管理(DM),自然语言生成(NLG)和语音合成(TTS)这五个模块组成。目前，语音识别模块利用深度学习技术已有了较好的解决方案，自然语言生成和语音合成模块相对较易控制，对话系统设计的难点主要在于自然语言理解和对话管理模块。其中自然语言理解模块的目标是将语音识别模块所得到的文本信息转化为语义表示，使得机器具备人一样的语言理解能力。因此语言理解模块的准确率是维持对话系统正常运转的前提和保障。Such dialogue systems generally consist of five modules: Speech Recognition (ASR), Natural Language Understanding (NLU), Dialogue Management (DM), Natural Language Generation (NLG) and Speech Synthesis (TTS). At present, the speech recognition module has a better solution using deep learning technology. The natural language generation and speech synthesis modules are relatively easy to control, and the difficulty in the design of the dialogue system mainly lies in the natural language understanding and dialogue management modules. The goal of the natural language understanding module is to convert the text information obtained by the speech recognition module into a semantic representation, so that the machine has the same language understanding ability as a human being. Therefore, the accuracy of the language understanding module is the premise and guarantee for maintaining the normal operation of the dialogue system.

随着深度学习算法、机器算力和大数据技术的不断优化与升级，例如语音订餐系统、语音点歌系统等简易对话系统的意图识别准确率已基本达到了商用化水平。然而，由于复杂对话系统中语料数据的量级和意图的复杂度相较前者有显著提高，语料数据不平衡、意图种类繁多等问题加剧了对话意图识别的难度。如发明专利“人机交互中自然语言意图理解方法及装置”(CN201710219326)以文本信息的词向量作为输入，利用意图识别模型得到文本信息的意图类型。一旦训练样本类别分布不均衡，其意图识别模型容易出现严重的过拟合和欠拟合现象，存在一定的局限性。如发明专利“一种意图识别方法及装置”(CN201811368503)将文本信息输入至少一个意图识别模型，生成与每个意图识别模型对应的预测结果，最终确定文本意图。随着意图种类的增多，该方法模型训练的成本和难度会大幅提高，不适用于复杂对话系统的意图识别。With the continuous optimization and upgrading of deep learning algorithms, machine computing power and big data technology, the accuracy of intent recognition of simple dialogue systems such as voice ordering systems and voice song ordering systems has basically reached the commercial level. However, since the magnitude of corpus data and the complexity of intentions in complex dialogue systems are significantly higher than the former, problems such as imbalanced corpus data and a wide variety of intentions aggravate the difficulty of dialogue intention recognition. For example, the invention patent "Method and Device for Understanding Natural Language Intent in Human-Computer Interaction" (CN201710219326) takes the word vector of text information as input, and uses the intent recognition model to obtain the intent type of the text information. Once the distribution of training sample categories is unbalanced, the intent recognition model is prone to serious overfitting and underfitting, which has certain limitations. For example, in the invention patent "An Intent Recognition Method and Device" (CN201811368503), text information is input into at least one intention recognition model, a prediction result corresponding to each intention recognition model is generated, and the text intention is finally determined. With the increase of intent types, the cost and difficulty of model training will increase greatly, and it is not suitable for intent recognition of complex dialogue systems.

发明内容SUMMARY OF THE INVENTION

发明目的：本发明旨在弥补现有技术手段的缺乏与不足，提供一种基于实体替换的对话意图识别方法；该方法利用命名实体识别结果，将文本信息中的实体名称替换为实体类型，降低了对话系统语料数据的量级与不平衡度，从而综合提升对话过程意图识别的准确度。Purpose of the invention: The present invention aims to make up for the lack and inadequacy of the existing technical means, and provides a method for identifying dialogue intentions based on entity replacement; the method utilizes the named entity identification result to replace the entity name in the text information with the entity type, reducing the need for The magnitude and imbalance of the corpus data of the dialogue system are improved, so as to comprehensively improve the accuracy of intention recognition in the dialogue process.

技术方案：为了实现上述发明目的，本发明采用的技术方案为：Technical scheme: In order to realize the above-mentioned purpose of the invention, the technical scheme adopted in the present invention is:

一种基于实体替换的对话意图识别方法，包括以下步骤：A dialogue intent recognition method based on entity replacement, comprising the following steps:

步骤一、文本分词：Step 1. Text segmentation:

利用分词工具对语音识别模块所得到的文本信息进行分词，得到分词结果集合Token；其中分词结果表示为集合{W}，W代表切分的单词；Use the word segmentation tool to segment the text information obtained by the speech recognition module, and obtain a word segmentation result set Token; wherein the word segmentation result is represented as a set {W}, and W represents the segmented word;

步骤二、文本过滤：Step 2. Text filtering:

根据对话系统建立所需的停用词词库，利用停用词词库对步骤一所得的分词结果集合Token进行文本信息过滤，得到文本清洗后的结果Token^*；Establish the required stop word thesaurus according to the dialogue system, use the stop word thesaurus to filter the text information of the word segmentation result set Token obtained in step 1, and obtain the result Token ^* after the text cleaning;

步骤三、文本命名实体识别：Step 3. Text Named Entity Recognition:

通过命名实体识别，得到命名实体识别结果为{E：T}，其中E代表实体名称，T代表实体类型；Through named entity recognition, the named entity recognition result is {E:T}, where E represents the entity name and T represents the entity type;

步骤四、文本命名实体替换：Step 4. Text named entity replacement:

用特定字符将对话系统中所涉及到的命名实体类型做一一映射，记为{T：C}，重新组合得到新的语料，其中T代表实体类型，C代表特定字符；所选特定字符需确保不存在于对话系统的语料中；Use specific characters to map the named entity types involved in the dialogue system one-to-one, denoted as {T:C}, and recombine to obtain a new corpus, where T represents the entity type and C represents a specific character; the selected specific character needs to be Make sure that it does not exist in the corpus of the dialogue system;

步骤五、文本特征提取：Step 5. Text feature extraction:

基于不同类型的预训练模型，利用步骤四中得到的新的语料，对上述预训练模型进行微调，得到微调后的特征提取模型；利用微调后的特征提取模型得到对话系统语料的词向量Vec；Based on different types of pre-training models, using the new corpus obtained in step 4, fine-tune the above-mentioned pre-training model to obtain a fine-tuned feature extraction model; use the fine-tuned feature extraction model to obtain the word vector Vec of the dialogue system corpus;

步骤六、文本意图识别：Step 6. Text Intent Recognition:

采用双向长短期记忆Bi-LSTM+注意力机制Attention的网络结构实现文本意图识别。The network structure of bi-directional long short-term memory Bi-LSTM + attention mechanism Attention is used to realize text intent recognition.

进一步的，步骤三、文本命名实体识别具体工作步骤，具体工作如下：Further, step 3, the specific work steps of text named entity recognition, the specific work is as follows:

1)基于规则匹配，1) Based on rule matching,

根据对话系统的需求设计相应的正则表达式，基于正则表达式对命名实体进行抽取，匹配出符合要求的字段；Design corresponding regular expressions according to the requirements of the dialogue system, extract named entities based on regular expressions, and match fields that meet the requirements;

2)基于实体词典2) Based on entity dictionary

根据对话系统构建相应的命名实体词典，基于命名实体词典对步骤一中所得的分词结果进行匹配；Build a corresponding named entity dictionary according to the dialogue system, and match the word segmentation results obtained in step 1 based on the named entity dictionary;

3)基于模型3) Model based

通过收集对话系统的历史语料或语料生成的方式得到原始语料Sentence，对Sentence中的各个位置进行人工或自动标注，完成序列标注任务；标注完成后得到标注语句Sentence^*，由B-T，I-T，O，E-T，S-T组成，进而通过训练命名实体识别模型实现基于模型的命名实体识别。The original corpus Sentence is obtained by collecting the historical corpus or corpus generation of the dialogue system, and manually or automatically labeling each position in the ^Sentence to complete the sequence labeling task; It consists of ET and ST, and then realizes the model-based named entity recognition by training the named entity recognition model.

进一步的，步骤三中的基于模型中，序列标注可采用BIO标注模式或BIOES标注模式；其中在BIOES标注模式中，B为Begin，代表实体的开始，I为Intermediate，代表实体的中间，O为Other，代表非实体的无关字符，E为End，代表实体的结尾，S为Single，代表该实体由单字符组成。Further, in the model-based method in step 3, the sequence annotation can be in the BIO annotation mode or the BIOES annotation mode; wherein in the BIOES annotation mode, B is Begin, representing the beginning of the entity, I is Intermediate, representing the middle of the entity, and O is the Other, unrelated characters representing non-entity, E is End, representing the end of the entity, S is Single, representing that the entity consists of a single character.

进一步的，所述步骤四、中文本命名实体替换具体工作步骤：将步骤三中所得的命名实体识别结果{E：T}中的实体名称T用特定字符C做替换，得到命名实体替换后的结果集合{E：C}，代入步骤二中所得的分词结果Token^*，将包含在实体名称E中的单词W用特定字符C替换后，重新组合得到新的语料Sentence′；Further, described step 4, the concrete working steps of Chinese text named entity replacement: replace the entity name T in the named entity recognition result {E:T} obtained in step 3 with the specific character C, obtain after the named entity replacement. The result set {E:C} is substituted into the word segmentation result Token ^* obtained in step 2, after the word W contained in the entity name E is replaced with a specific character C, the new corpus Sentence' is obtained by recombining;

进一步的，所述步骤六文本意图识别中的网络结构，网络结构主要由4个部分组成，具体分别为：Further, the network structure in the text intention recognition in step 6, the network structure is mainly composed of 4 parts, specifically:

1)输入层：将步骤五中所获得的对话系统语料的词向量Vec作为输入V；1) Input layer: take the word vector Vec of the dialogue system corpus obtained in step 5 as the input V;

2)双向LSTM层：利用双向长短期记忆网络对输入层的词向量进行前向计算得到向量V_L，后向计算得到向量V_R；对前后向量进行拼接得到拼接后的LSTM层输出向量V_C，其中V_C＝[V_L，V_R]；2) Bidirectional LSTM layer: use the bidirectional long short-term memory network to perform forward calculation on the word vector of the input layer to obtain a vector _VL , and backward calculation to obtain a vector _VR ; splicing the front and rear vectors to obtain the spliced LSTM layer output vector _VC , where V _C =[ _VL , _VR ];

3)Attention层：对LSTM层的输出向量V_C进行Attention加权，进一步得到输出结果V_A，计算方法如下：3) Attention layer: Perform Attention weighting on the output vector _VC of the LSTM layer, and further obtain the output result VA. _The calculation method is as follows:

V_m＝tanh(V_c)V _m =tanh(V _c )

α＝softmax(w^TV_m)α=softmax(w ^T V _m )

V_A＝V_cα^T V _A =V _c α ^T

其中w为Attention层的权重矩阵。where w is the weight matrix of the Attention layer.

4)输出层：将Attention层的输出结果V_A利用Softmax分类器对语句意图进行预测，得到意图预测结果

4) Output layer: the output result VA of the Attention layer is used to predict the sentence intent using the _Softmax classifier, and the intent prediction result is obtained

其中W_S，b_S分别为输出层的权重矩阵和偏置值。where W _S and b _S are the weight matrix and bias value of the output layer, respectively.

有益效果：与现有技术相比，本方法利用命名实体识别结果，将文本信息中的实体名称替换为实体类型，降低了对话系统语料数据的量级与不平衡度，从而综合提升对话过程意图识别的准确度。Beneficial effect: Compared with the prior art, the method uses the named entity recognition result to replace the entity name in the text information with the entity type, which reduces the magnitude and imbalance of the corpus data of the dialogue system, thereby comprehensively improving the intention of the dialogue process recognition accuracy.

附图说明Description of drawings

图1是本发明一种基于实体替换的对话意图识别方法的示意性流程图；Fig. 1 is a schematic flow chart of a method for recognizing dialog intent based on entity replacement of the present invention;

图2是本发明一种文本命名实体替换过程的示例；Fig. 2 is an example of a text named entity replacement process of the present invention;

图3是本发明一种语料序列标注方式；Fig. 3 is a kind of corpus sequence labeling method of the present invention;

图4是本发明一种实现文本意图识别的网络结构。FIG. 4 is a network structure for realizing text intention recognition according to the present invention.

具体实施方式Detailed ways

下面结合具体实施例进一步说明本发明，但这些实施例并不用来限制本发明。The present invention is further described below in conjunction with specific embodiments, but these embodiments are not intended to limit the present invention.

一种基于实体替换的对话意图识别方法，如图1所示，该方法包含如下几个步骤：A dialogue intent recognition method based on entity replacement, as shown in Figure 1, the method includes the following steps:

步骤一：文本分词Step 1: Text segmentation

利用分词工具对语音识别模块所得到的文本信息进行分词，得到分词结果集合Token，其中分词结果可表示为集合{W}，W代表切分的单词。A word segmentation tool is used to segment the text information obtained by the speech recognition module, and a word segmentation result set Token is obtained, wherein the word segmentation result can be expressed as a set {W}, and W represents the segmented words.

步骤二：文本过滤Step 2: Text Filtering

根据对话系统建立所需的停用词词库，通常停用词包括但不限于助词、语气词、连接词等等。利用停用词词库对步骤一所得的分词结果集合Token进行文本信息过滤，得到文本清洗后的结果Token^*。A required stop word lexicon is established according to the dialogue system. Usually, stop words include but are not limited to auxiliary words, modal particles, conjunctions, and the like. Use the stop word thesaurus to filter the text information of the word segmentation result set Token obtained in step 1, and obtain the result Token ^* after text cleaning.

步骤三：文本命名实体识别Step 3: Text Named Entity Recognition

命名实体识别包括但不限于以下三种方式，同时多种方式可混合使用，得到命名实体识别结果为{E：T}，其中E代表实体名称，T代表实体类型。Named entity recognition includes but is not limited to the following three methods, and multiple methods can be mixed to obtain a named entity recognition result as {E:T}, where E represents the entity name and T represents the entity type.

1)基于规则匹配，1) Based on rule matching,

根据对话系统的需求设计相应的正则表达式，基于正则表达式对例如电话号码、邮箱地址、身份证号码等类型的命名实体进行抽取，匹配出符合要求的字段。Design corresponding regular expressions according to the requirements of the dialogue system, extract named entities such as phone numbers, email addresses, ID numbers and other types based on the regular expressions, and match the fields that meet the requirements.

2)基于实体词典2) Based on entity dictionary

根据对话系统构建相应的命名实体词典，基于命名实体词典对步骤一中所得的分词结果进行匹配，匹配方式包括但不限于字符串多模匹配、切词匹配等等。A corresponding named entity dictionary is constructed according to the dialogue system, and the word segmentation result obtained in step 1 is matched based on the named entity dictionary. The matching methods include but are not limited to string multimodal matching, word segmentation matching, and so on.

3)基于模型3) Model based

通过收集对话系统的历史语料或语料生成的方式得到原始语料Sentence，对Sentence中的各个位置进行人工或自动标注，完成序列标注任务。通常序列标注可采用BIO标注模式或BIOES标注模式。以BIOES标注模式为例，B为Begin，代表实体的开始，I为Intermediate，代表实体的中间，O为Other，代表非实体的无关字符，E为End，代表实体的结尾，S为Single，代表该实体由单字符组成。标注完成后得到标注语句Sentence^*，由B-T，I-T，O，E-T，S-T组成，进而通过训练命名实体识别模型实现基于模型的命名实体识别。具体如图3所示为某订餐系统的语料数据序列标注结果。命名实体识别一般可采用HMM、CRF等模型，优选的，本发明专利中采用双向长短期记忆(BiLSTM)+条件随机场(CRF)模型实现命名实体识别可取得较优效果。The original corpus Sentence is obtained by collecting the historical corpus or corpus generation of the dialogue system, and manually or automatically labeling each position in the Sentence to complete the sequence labeling task. Usually sequence annotation can be in BIO annotation mode or BIOES annotation mode. Taking the BIOES annotation mode as an example, B is Begin, representing the beginning of an entity, I is Intermediate, representing the middle of an entity, O is Other, representing a non-entity irrelevant character, E is End, representing the end of an entity, S is Single, representing The entity consists of single characters. After the annotation is completed, the annotation sentence Sentence ^* is obtained, which is composed of BT, IT, O, ET, and ST, and then the model-based named entity recognition is realized by training the named entity recognition model. Specifically, Figure 3 shows the labeling result of a corpus data sequence of a meal ordering system. Named entity recognition can generally use HMM, CRF and other models. Preferably, the bidirectional long short-term memory (BiLSTM) + conditional random field (CRF) model is used in the patent of the present invention to achieve named entity recognition, which can achieve better results.

步骤四：文本命名实体替换Step 4: Text Named Entity Replacement

用特定字符将对话系统中所涉及到的命名实体类型做一一映射，记为{T：C}，其中T代表实体类型，C代表特定字符。所选特定字符需确保不存在于对话系统的语料中，包括但不限于英文字符、罗马数字、希腊字母等等。The named entity types involved in the dialogue system are mapped one-to-one with specific characters, denoted as {T:C}, where T represents the entity type and C represents a specific character. The selected specific characters need to ensure that they do not exist in the corpus of the dialogue system, including but not limited to English characters, Roman numerals, Greek letters, etc.

将步骤三中所得的命名实体识别结果{E：T}中的实体名称T用特定字符C做替换，得到命名实体替换后的结果集合{E：C}，代入步骤二中所得的分词结果Token^*，将包含在实体名称E中的单词W用特定字符C替换后，重新组合得到新的语料Sentence′。Replace the entity name T in the named entity recognition result {E:T} obtained in step 3 with a specific character C to obtain the result set {E:C} after the named entity replacement, and substitute it into the word segmentation result Token obtained in step 2 ^* , after replacing the word W contained in the entity name E with a specific character C, recombine to obtain a new corpus Sentence'.

例如语料中包含3条语句分别为S1，S2，S3，经文本信息分词后得到S1＝abc₁d，S2＝abc₂d，S3＝abc₃d，其中a、b、c₁、c₂、c₃、d代表语料分词结果Token中的不同词汇，且c₁、c₂、c₃代表同种命名实体类型下的不同实体名称。用特定字符c₀替换c₁、c₂、c₃后，得到3条完成命名实体替换后的语料分别为S1′，S2′，S3′，其中S1′＝abc_od，S2′＝abc_od，S3′＝abc_od，从而缩小意图识别模型中语料的多样性，降低文本信息的不平衡度。具体如图2所示为某天气查询系统的语料数据命名实体替换示例。For example, the corpus contains 3 sentences which are S1, S2, S3 respectively. After word segmentation of text information, S1=abc ₁ d, S2=abc ₂ d, S3=abc ₃ d, where a, b, c ₁ , c ₂ , c ₃ , d represent different words in the corpus tokenization result Token, and c ₁ , c ₂ , and c ₃ represent different entity names under the same named entity type. After replacing c ₁ , c ₂ , and c ₃ with a specific character c ₀ , three corpora after the named entity replacement are obtained are S1', S2', S3', where S1'=abc _o d, S2'=abc _o d, S3′=abc _o d, thereby reducing the diversity of corpus in the intention recognition model and reducing the imbalance of text information. Specifically, Figure 2 shows an example of corpus data named entity replacement for a weather query system.

步骤五：文本特征提取Step 5: Text Feature Extraction

基于BERT,GPT,XLNet,XLM等预训练模型，利用步骤四中得到的语料Sentence′，对上述预训练模型进行微调，得到微调后的特征提取模型。利用微调后的特征提取模型得到对话系统语料的词向量Vec。Based on pre-training models such as BERT, GPT, XLNet, and XLM, and using the corpus Sentence' obtained in step 4, fine-tune the above-mentioned pre-training model to obtain a fine-tuned feature extraction model. The word vector Vec of the dialogue system corpus is obtained by using the fine-tuned feature extraction model.

步骤六：文本意图识别Step 6: Text Intent Recognition

本发明中采用双向长短期记忆(Bi-LSTM)+注意力机制(Attention)的网络结构实现文本意图识别。该网络结构主要由4个部分组成，如图4所示，具体分别为：In the present invention, the network structure of bidirectional long short-term memory (Bi-LSTM) + attention mechanism (Attention) is used to realize text intention recognition. The network structure is mainly composed of 4 parts, as shown in Figure 4, which are as follows:

2)双向LSTM层：利用双向长短期记忆网络对输入层的词向量进行前向计算得到向量V_L，后向计算得到向量V_R。对前后向量进行拼接得到拼接后的LSTM层输出向量V_C，其中V_C＝[V_L，V_R]；2) Bidirectional LSTM layer: using a bidirectional long short-term memory network to perform forward calculation on the word vector of the input layer to obtain a vector _VL , and backward calculation to obtain a vector VR _. Splicing the front and rear vectors to obtain the spliced LSTM layer output vector V _C , where V _C =[ _VL , _VR ];

V_m＝tanh(V_c)V _m =tanh(V _c )

α＝softmax(w^TV_m)α=softmax(w ^T V _m )

V_A＝V_cα^T V _A =V _c α ^T

本方法利用命名实体识别结果，将文本信息中的实体名称替换为实体类型，降低了对话系统语料数据的量级与不平衡度，从而综合提升对话过程意图识别的准确度。This method uses the named entity recognition result to replace the entity name in the text information with the entity type, which reduces the magnitude and imbalance of the corpus data of the dialogue system, thereby comprehensively improving the accuracy of intention recognition in the dialogue process.

以上所述，仅是本发明的较佳实施例而已，并非对本发明作任何形式上的限制，虽然本发明已以较佳实施例揭示如上，然而并非用以限定本发明，任何熟悉本专业的技术人员，在不脱离本发明技术方案范围内，当可利用上述揭示的技术内容作出些许更动或修饰为等同变化的等效实施例，但凡是未脱离本发明技术方案的内容，依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰，均仍属于本发明技术方案的范围内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Technical personnel, within the scope of the technical solution of the present invention, can make some changes or modifications to equivalent embodiments of equivalent changes by using the technical content disclosed above, but any content that does not depart from the technical solution of the present invention, according to the present invention Any simple modifications, equivalent changes and modifications made to the above embodiments still fall within the scope of the technical solutions of the present invention.

Claims

1. a dialogue intention recognition method based on entity replacement, is characterized in that, comprises the following steps:

Step 1. Text segmentation:

Use the word segmentation tool to segment the text information obtained by the speech recognition module, and obtain the token result set of word segmentation;

Step 2. Text filtering:

Establish the required stop word thesaurus according to the dialogue system, use the stop word thesaurus to filter the text information of the word segmentation result set Token obtained in step 1, and obtain the result Token ^* after the text cleaning;

Step 3. Text Named Entity Recognition:

Perform named entity recognition on the text cleaning results obtained in step 2 through a deep learning model;

Step 4. Text named entity replacement:

Use specific characters to map the named entity types involved in the dialogue system one-to-one, denoted as {T:C}, and recombine to obtain a new corpus, where T represents the entity type, and C represents a specific character; Make sure that it does not exist in the corpus of the dialogue system;

Step 5. Text feature extraction:

Based on different types of pre-training models, using the new corpus obtained in step 4, fine-tune the above-mentioned pre-training model to obtain a fine-tuned feature extraction model; use the fine-tuned feature extraction model to obtain the word vector Vec of the dialogue system corpus;

Step 6. Text Intent Recognition:

The network structure of bi-directional long short-term memory Bi-LSTM + attention mechanism Attention is used to realize text intent recognition.

2. the method for recognizing dialog intention based on entity replacement according to claim 1, is characterized in that: the word segmentation result of described step one set Token is represented as set {W}, and W represents the word of segmentation;

3. The method for recognizing dialog intent based on entity replacement according to claim 1, characterized in that: in the step 3, after entity recognition is performed by naming, a named entity recognition result is obtained as {E:T}, wherein E represents Entity name, T for entity type.

4. The method for recognizing dialog intent based on entity replacement according to claim 1, characterized in that: step 3, the specific work steps of text named entity recognition, and the specific work is as follows:

1) Based on rule matching,

Design corresponding regular expressions according to the requirements of the dialogue system, extract named entities based on regular expressions, and match fields that meet the requirements;

2) Based on entity dictionary

Build a corresponding named entity dictionary according to the dialogue system, and match the word segmentation results obtained in step 1 based on the named entity dictionary;

3) Model based

The original corpus Sentence is obtained by collecting the historical corpus or corpus generation of the dialogue system, and manually or automatically labeling each position in the Sentence to complete the sequence labeling task; after the labeling is completed, the labeling sentence Sentence ^* is obtained, and then the named entity recognition model is trained by training. Implements model-based named entity recognition.

5 . The method for recognizing dialog intent based on entity replacement according to claim 4 , wherein the labeled sentence Sentence ^* is composed of BT, IT, O, ET, and ST. 6 .

6 . The method for recognizing dialog intent based on entity replacement according to claim 4 , wherein: in the model-based, sequence annotation can be in a BIO annotation mode or a BIOES annotation mode. 7 .

7. the method for recognizing dialog intention based on entity replacement according to claim 6, is characterized in that: in BIOES labeling pattern, B is Begin, represents the beginning of entity, I is Intermediate, represents the middle of entity, and O is Other, represents Non-entity irrelevant characters, E is End, representing the end of the entity, S is Single, representing that the entity consists of a single character.

8. The method for recognizing dialog intention based on entity replacement according to claim 1, characterized in that: the step 4, the concrete working steps of Chinese text named entity replacement: the named entity recognition result {E:T obtained in step 3 } The entity name T in } is replaced with a specific character C, and the result set {E:C} after the replacement of the named entity is obtained, which is substituted into the word segmentation result Token ^* obtained in step 2, and the word W contained in the entity name E is replaced by a specific character. After the character C is replaced, a new corpus Sentence' is recombined.

9. The method for recognizing dialog intent based on entity replacement according to claim 1, wherein the network structure in the text intent recognizing in the step 6 is mainly composed of 4 parts, which are specifically:

1) Input layer: take the word vector Vec of the dialogue system corpus obtained in step 5 as the input V;

2) Bidirectional LSTM layer: use the bidirectional long short-term memory network to perform forward calculation on the word vector of the input layer to obtain a vector _VL , and backward calculation to obtain a vector _VR ; splicing the front and rear vectors to obtain the spliced LSTM layer output vector _VC , where V _C =[ _VL , _VR ];

3) Attention layer: Perform Attention weighting on the output vector _VC of the LSTM layer, and further obtain the output result VA. _The calculation method is as follows:

V _m =tanh(V _c )

α=softmax(w ^T V _m )

V _A =V _c α ^T

where w is the weight matrix of the Attention layer.

where W _S and b _S are the weight matrix and bias value of the output layer, respectively.

10 . The method for recognizing dialog intent based on entity replacement according to claim 1 , wherein in the step 3, a bi-directional long short-term memory BiLSTM + conditional random field CRF model is used to realize named entity recognition, which can achieve better results. 11 .