CN111553157A - Entity replacement-based dialog intention identification method - Google Patents

Entity replacement-based dialog intention identification method Download PDF

Info

Publication number
CN111553157A
CN111553157A CN202010271707.1A CN202010271707A CN111553157A CN 111553157 A CN111553157 A CN 111553157A CN 202010271707 A CN202010271707 A CN 202010271707A CN 111553157 A CN111553157 A CN 111553157A
Authority
CN
China
Prior art keywords
entity
text
recognition
named entity
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010271707.1A
Other languages
Chinese (zh)
Inventor
张堃
王天宇
周波
李文俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Borazhe Technology Co ltd
Nantong University
Original Assignee
Hangzhou Borazhe Technology Co ltd
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Borazhe Technology Co ltd, Nantong University filed Critical Hangzhou Borazhe Technology Co ltd
Priority to CN202010271707.1A priority Critical patent/CN111553157A/en
Publication of CN111553157A publication Critical patent/CN111553157A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了基于实体替换的对话意图识别方法,包括以下步骤:步骤一、文本分词;步骤二、文本过滤;步骤三、文本命名实体识别:步骤四、文本命名实体替换;步骤五、文本特征提取:步骤六、文本意图识别;实现实体替换的对话意图识别,本方法利用命名实体识别结果,将文本信息中的实体名称替换为实体类型,降低了对话系统语料数据的量级与不平衡度,从而综合提升对话过程意图识别的准确度。The invention discloses a dialogue intention recognition method based on entity replacement, comprising the following steps: step 1, text segmentation; step 2, text filtering; step 3, text named entity recognition; step 4, text named entity replacement; step 5, text feature Extraction: Step 6, text intention recognition; to realize the dialogue intention recognition of entity replacement, this method uses the named entity recognition result to replace the entity name in the text information with the entity type, which reduces the magnitude and imbalance of the dialogue system corpus data , so as to comprehensively improve the accuracy of intention recognition in the dialogue process.

Description

一种基于实体替换的对话意图识别方法A Dialogue Intent Recognition Method Based on Entity Replacement

技术领域technical field

本发明涉及一种基于对话意图识别方法,具体涉及一种基于实体替换的对话意图识别方法。The present invention relates to a method for identifying dialogue intentions, in particular to a method for identifying dialogue intentions based on entity replacement.

背景技术Background technique

近年来,在人工智能与半导体芯片技术的飞速发展和语音交互需求日益扩增的影响下,诸如智能音箱、智能家具、智能语音客服等各类基于对话系统的应用产品逐渐在市场上百花齐放。In recent years, under the influence of the rapid development of artificial intelligence and semiconductor chip technology and the increasing demand for voice interaction, various application products based on dialogue systems, such as smart speakers, smart furniture, and smart voice customer service, are gradually blooming in the market.

此类对话系统一般由语音识别(ASR),自然语言理解(NLU),对话管理(DM),自然语言生成(NLG)和语音合成(TTS)这五个模块组成。目前,语音识别模块利用深度学习技术已有了较好的解决方案,自然语言生成和语音合成模块相对较易控制,对话系统设计的难点主要在于自然语言理解和对话管理模块。其中自然语言理解模块的目标是将语音识别模块所得到的文本信息转化为语义表示,使得机器具备人一样的语言理解能力。因此语言理解模块的准确率是维持对话系统正常运转的前提和保障。Such dialogue systems generally consist of five modules: Speech Recognition (ASR), Natural Language Understanding (NLU), Dialogue Management (DM), Natural Language Generation (NLG) and Speech Synthesis (TTS). At present, the speech recognition module has a better solution using deep learning technology. The natural language generation and speech synthesis modules are relatively easy to control, and the difficulty in the design of the dialogue system mainly lies in the natural language understanding and dialogue management modules. The goal of the natural language understanding module is to convert the text information obtained by the speech recognition module into a semantic representation, so that the machine has the same language understanding ability as a human being. Therefore, the accuracy of the language understanding module is the premise and guarantee for maintaining the normal operation of the dialogue system.

随着深度学习算法、机器算力和大数据技术的不断优化与升级,例如语音订餐系统、语音点歌系统等简易对话系统的意图识别准确率已基本达到了商用化水平。然而,由于复杂对话系统中语料数据的量级和意图的复杂度相较前者有显著提高,语料数据不平衡、意图种类繁多等问题加剧了对话意图识别的难度。如发明专利“人机交互中自然语言意图理解方法及装置”(CN201710219326)以文本信息的词向量作为输入,利用意图识别模型得到文本信息的意图类型。一旦训练样本类别分布不均衡,其意图识别模型容易出现严重的过拟合和欠拟合现象,存在一定的局限性。如发明专利“一种意图识别方法及装置”(CN201811368503)将文本信息输入至少一个意图识别模型,生成与每个意图识别模型对应的预测结果,最终确定文本意图。随着意图种类的增多,该方法模型训练的成本和难度会大幅提高,不适用于复杂对话系统的意图识别。With the continuous optimization and upgrading of deep learning algorithms, machine computing power and big data technology, the accuracy of intent recognition of simple dialogue systems such as voice ordering systems and voice song ordering systems has basically reached the commercial level. However, since the magnitude of corpus data and the complexity of intentions in complex dialogue systems are significantly higher than the former, problems such as imbalanced corpus data and a wide variety of intentions aggravate the difficulty of dialogue intention recognition. For example, the invention patent "Method and Device for Understanding Natural Language Intent in Human-Computer Interaction" (CN201710219326) takes the word vector of text information as input, and uses the intent recognition model to obtain the intent type of the text information. Once the distribution of training sample categories is unbalanced, the intent recognition model is prone to serious overfitting and underfitting, which has certain limitations. For example, in the invention patent "An Intent Recognition Method and Device" (CN201811368503), text information is input into at least one intention recognition model, a prediction result corresponding to each intention recognition model is generated, and the text intention is finally determined. With the increase of intent types, the cost and difficulty of model training will increase greatly, and it is not suitable for intent recognition of complex dialogue systems.

发明内容SUMMARY OF THE INVENTION

发明目的:本发明旨在弥补现有技术手段的缺乏与不足,提供一种基于实体替换的对话意图识别方法;该方法利用命名实体识别结果,将文本信息中的实体名称替换为实体类型,降低了对话系统语料数据的量级与不平衡度,从而综合提升对话过程意图识别的准确度。Purpose of the invention: The present invention aims to make up for the lack and inadequacy of the existing technical means, and provides a method for identifying dialogue intentions based on entity replacement; the method utilizes the named entity identification result to replace the entity name in the text information with the entity type, reducing the need for The magnitude and imbalance of the corpus data of the dialogue system are improved, so as to comprehensively improve the accuracy of intention recognition in the dialogue process.

技术方案:为了实现上述发明目的,本发明采用的技术方案为:Technical scheme: In order to realize the above-mentioned purpose of the invention, the technical scheme adopted in the present invention is:

一种基于实体替换的对话意图识别方法,包括以下步骤:A dialogue intent recognition method based on entity replacement, comprising the following steps:

步骤一、文本分词:Step 1. Text segmentation:

利用分词工具对语音识别模块所得到的文本信息进行分词,得到分词结果集合Token;其中分词结果表示为集合{W},W代表切分的单词;Use the word segmentation tool to segment the text information obtained by the speech recognition module, and obtain a word segmentation result set Token; wherein the word segmentation result is represented as a set {W}, and W represents the segmented word;

步骤二、文本过滤:Step 2. Text filtering:

根据对话系统建立所需的停用词词库,利用停用词词库对步骤一所得的分词结果集合Token进行文本信息过滤,得到文本清洗后的结果Token*Establish the required stop word thesaurus according to the dialogue system, use the stop word thesaurus to filter the text information of the word segmentation result set Token obtained in step 1, and obtain the result Token * after the text cleaning;

步骤三、文本命名实体识别:Step 3. Text Named Entity Recognition:

通过命名实体识别,得到命名实体识别结果为{E:T},其中E代表实体名称,T代表实体类型;Through named entity recognition, the named entity recognition result is {E:T}, where E represents the entity name and T represents the entity type;

步骤四、文本命名实体替换:Step 4. Text named entity replacement:

用特定字符将对话系统中所涉及到的命名实体类型做一一映射,记为{T:C},重新组合得到新的语料,其中T代表实体类型,C代表特定字符;所选特定字符需确保不存在于对话系统的语料中;Use specific characters to map the named entity types involved in the dialogue system one-to-one, denoted as {T:C}, and recombine to obtain a new corpus, where T represents the entity type and C represents a specific character; the selected specific character needs to be Make sure that it does not exist in the corpus of the dialogue system;

步骤五、文本特征提取:Step 5. Text feature extraction:

基于不同类型的预训练模型,利用步骤四中得到的新的语料,对上述预训练模型进行微调,得到微调后的特征提取模型;利用微调后的特征提取模型得到对话系统语料的词向量Vec;Based on different types of pre-training models, using the new corpus obtained in step 4, fine-tune the above-mentioned pre-training model to obtain a fine-tuned feature extraction model; use the fine-tuned feature extraction model to obtain the word vector Vec of the dialogue system corpus;

步骤六、文本意图识别:Step 6. Text Intent Recognition:

采用双向长短期记忆Bi-LSTM+注意力机制Attention的网络结构实现文本意图识别。The network structure of bi-directional long short-term memory Bi-LSTM + attention mechanism Attention is used to realize text intent recognition.

进一步的,步骤三、文本命名实体识别具体工作步骤,具体工作如下:Further, step 3, the specific work steps of text named entity recognition, the specific work is as follows:

1)基于规则匹配,1) Based on rule matching,

根据对话系统的需求设计相应的正则表达式,基于正则表达式对命名实体进行抽取,匹配出符合要求的字段;Design corresponding regular expressions according to the requirements of the dialogue system, extract named entities based on regular expressions, and match fields that meet the requirements;

2)基于实体词典2) Based on entity dictionary

根据对话系统构建相应的命名实体词典,基于命名实体词典对步骤一中所得的分词结果进行匹配;Build a corresponding named entity dictionary according to the dialogue system, and match the word segmentation results obtained in step 1 based on the named entity dictionary;

3)基于模型3) Model based

通过收集对话系统的历史语料或语料生成的方式得到原始语料Sentence,对Sentence中的各个位置进行人工或自动标注,完成序列标注任务;标注完成后得到标注语句Sentence*,由B-T,I-T,O,E-T,S-T组成,进而通过训练命名实体识别模型实现基于模型的命名实体识别。The original corpus Sentence is obtained by collecting the historical corpus or corpus generation of the dialogue system, and manually or automatically labeling each position in the Sentence to complete the sequence labeling task; It consists of ET and ST, and then realizes the model-based named entity recognition by training the named entity recognition model.

进一步的,步骤三中的基于模型中,序列标注可采用BIO标注模式或BIOES标注模式;其中在BIOES标注模式中,B为Begin,代表实体的开始,I为Intermediate,代表实体的中间,O为Other,代表非实体的无关字符,E为End,代表实体的结尾,S为Single,代表该实体由单字符组成。Further, in the model-based method in step 3, the sequence annotation can be in the BIO annotation mode or the BIOES annotation mode; wherein in the BIOES annotation mode, B is Begin, representing the beginning of the entity, I is Intermediate, representing the middle of the entity, and O is the Other, unrelated characters representing non-entity, E is End, representing the end of the entity, S is Single, representing that the entity consists of a single character.

进一步的,所述步骤四、中文本命名实体替换具体工作步骤:将步骤三中所得的命名实体识别结果{E:T}中的实体名称T用特定字符C做替换,得到命名实体替换后的结果集合{E:C},代入步骤二中所得的分词结果Token*,将包含在实体名称E中的单词W用特定字符C替换后,重新组合得到新的语料Sentence′;Further, described step 4, the concrete working steps of Chinese text named entity replacement: replace the entity name T in the named entity recognition result {E:T} obtained in step 3 with the specific character C, obtain after the named entity replacement. The result set {E:C} is substituted into the word segmentation result Token * obtained in step 2, after the word W contained in the entity name E is replaced with a specific character C, the new corpus Sentence' is obtained by recombining;

进一步的,所述步骤六文本意图识别中的网络结构,网络结构主要由4个部分组成,具体分别为:Further, the network structure in the text intention recognition in step 6, the network structure is mainly composed of 4 parts, specifically:

1)输入层:将步骤五中所获得的对话系统语料的词向量Vec作为输入V;1) Input layer: take the word vector Vec of the dialogue system corpus obtained in step 5 as the input V;

2)双向LSTM层:利用双向长短期记忆网络对输入层的词向量进行前向计算得到向量VL,后向计算得到向量VR;对前后向量进行拼接得到拼接后的LSTM层输出向量VC,其中VC=[VL,VR];2) Bidirectional LSTM layer: use the bidirectional long short-term memory network to perform forward calculation on the word vector of the input layer to obtain a vector VL , and backward calculation to obtain a vector VR ; splicing the front and rear vectors to obtain the spliced LSTM layer output vector VC , where V C =[ VL , VR ];

3)Attention层:对LSTM层的输出向量VC进行Attention加权,进一步得到输出结果VA,计算方法如下:3) Attention layer: Perform Attention weighting on the output vector VC of the LSTM layer, and further obtain the output result VA. The calculation method is as follows:

Vm=tanh(Vc)V m =tanh(V c )

α=softmax(wTVm)α=softmax(w T V m )

VA=VcαT V A =V c α T

其中w为Attention层的权重矩阵。where w is the weight matrix of the Attention layer.

4)输出层:将Attention层的输出结果VA利用Softmax分类器对语句意图进行预测,得到意图预测结果

Figure BDA0002441736800000051
4) Output layer: the output result VA of the Attention layer is used to predict the sentence intent using the Softmax classifier, and the intent prediction result is obtained
Figure BDA0002441736800000051

Figure BDA0002441736800000052
Figure BDA0002441736800000052

其中WS,bS分别为输出层的权重矩阵和偏置值。where W S and b S are the weight matrix and bias value of the output layer, respectively.

有益效果:与现有技术相比,本方法利用命名实体识别结果,将文本信息中的实体名称替换为实体类型,降低了对话系统语料数据的量级与不平衡度,从而综合提升对话过程意图识别的准确度。Beneficial effect: Compared with the prior art, the method uses the named entity recognition result to replace the entity name in the text information with the entity type, which reduces the magnitude and imbalance of the corpus data of the dialogue system, thereby comprehensively improving the intention of the dialogue process recognition accuracy.

附图说明Description of drawings

图1是本发明一种基于实体替换的对话意图识别方法的示意性流程图;Fig. 1 is a schematic flow chart of a method for recognizing dialog intent based on entity replacement of the present invention;

图2是本发明一种文本命名实体替换过程的示例;Fig. 2 is an example of a text named entity replacement process of the present invention;

图3是本发明一种语料序列标注方式;Fig. 3 is a kind of corpus sequence labeling method of the present invention;

图4是本发明一种实现文本意图识别的网络结构。FIG. 4 is a network structure for realizing text intention recognition according to the present invention.

具体实施方式Detailed ways

下面结合具体实施例进一步说明本发明,但这些实施例并不用来限制本发明。The present invention is further described below in conjunction with specific embodiments, but these embodiments are not intended to limit the present invention.

一种基于实体替换的对话意图识别方法,如图1所示,该方法包含如下几个步骤:A dialogue intent recognition method based on entity replacement, as shown in Figure 1, the method includes the following steps:

步骤一:文本分词Step 1: Text segmentation

利用分词工具对语音识别模块所得到的文本信息进行分词,得到分词结果集合Token,其中分词结果可表示为集合{W},W代表切分的单词。A word segmentation tool is used to segment the text information obtained by the speech recognition module, and a word segmentation result set Token is obtained, wherein the word segmentation result can be expressed as a set {W}, and W represents the segmented words.

步骤二:文本过滤Step 2: Text Filtering

根据对话系统建立所需的停用词词库,通常停用词包括但不限于助词、语气词、连接词等等。利用停用词词库对步骤一所得的分词结果集合Token进行文本信息过滤,得到文本清洗后的结果Token*A required stop word lexicon is established according to the dialogue system. Usually, stop words include but are not limited to auxiliary words, modal particles, conjunctions, and the like. Use the stop word thesaurus to filter the text information of the word segmentation result set Token obtained in step 1, and obtain the result Token * after text cleaning.

步骤三:文本命名实体识别Step 3: Text Named Entity Recognition

命名实体识别包括但不限于以下三种方式,同时多种方式可混合使用,得到命名实体识别结果为{E:T},其中E代表实体名称,T代表实体类型。Named entity recognition includes but is not limited to the following three methods, and multiple methods can be mixed to obtain a named entity recognition result as {E:T}, where E represents the entity name and T represents the entity type.

1)基于规则匹配,1) Based on rule matching,

根据对话系统的需求设计相应的正则表达式,基于正则表达式对例如电话号码、邮箱地址、身份证号码等类型的命名实体进行抽取,匹配出符合要求的字段。Design corresponding regular expressions according to the requirements of the dialogue system, extract named entities such as phone numbers, email addresses, ID numbers and other types based on the regular expressions, and match the fields that meet the requirements.

2)基于实体词典2) Based on entity dictionary

根据对话系统构建相应的命名实体词典,基于命名实体词典对步骤一中所得的分词结果进行匹配,匹配方式包括但不限于字符串多模匹配、切词匹配等等。A corresponding named entity dictionary is constructed according to the dialogue system, and the word segmentation result obtained in step 1 is matched based on the named entity dictionary. The matching methods include but are not limited to string multimodal matching, word segmentation matching, and so on.

3)基于模型3) Model based

通过收集对话系统的历史语料或语料生成的方式得到原始语料Sentence,对Sentence中的各个位置进行人工或自动标注,完成序列标注任务。通常序列标注可采用BIO标注模式或BIOES标注模式。以BIOES标注模式为例,B为Begin,代表实体的开始,I为Intermediate,代表实体的中间,O为Other,代表非实体的无关字符,E为End,代表实体的结尾,S为Single,代表该实体由单字符组成。标注完成后得到标注语句Sentence*,由B-T,I-T,O,E-T,S-T组成,进而通过训练命名实体识别模型实现基于模型的命名实体识别。具体如图3所示为某订餐系统的语料数据序列标注结果。命名实体识别一般可采用HMM、CRF等模型,优选的,本发明专利中采用双向长短期记忆(BiLSTM)+条件随机场(CRF)模型实现命名实体识别可取得较优效果。The original corpus Sentence is obtained by collecting the historical corpus or corpus generation of the dialogue system, and manually or automatically labeling each position in the Sentence to complete the sequence labeling task. Usually sequence annotation can be in BIO annotation mode or BIOES annotation mode. Taking the BIOES annotation mode as an example, B is Begin, representing the beginning of an entity, I is Intermediate, representing the middle of an entity, O is Other, representing a non-entity irrelevant character, E is End, representing the end of an entity, S is Single, representing The entity consists of single characters. After the annotation is completed, the annotation sentence Sentence * is obtained, which is composed of BT, IT, O, ET, and ST, and then the model-based named entity recognition is realized by training the named entity recognition model. Specifically, Figure 3 shows the labeling result of a corpus data sequence of a meal ordering system. Named entity recognition can generally use HMM, CRF and other models. Preferably, the bidirectional long short-term memory (BiLSTM) + conditional random field (CRF) model is used in the patent of the present invention to achieve named entity recognition, which can achieve better results.

步骤四:文本命名实体替换Step 4: Text Named Entity Replacement

用特定字符将对话系统中所涉及到的命名实体类型做一一映射,记为{T:C},其中T代表实体类型,C代表特定字符。所选特定字符需确保不存在于对话系统的语料中,包括但不限于英文字符、罗马数字、希腊字母等等。The named entity types involved in the dialogue system are mapped one-to-one with specific characters, denoted as {T:C}, where T represents the entity type and C represents a specific character. The selected specific characters need to ensure that they do not exist in the corpus of the dialogue system, including but not limited to English characters, Roman numerals, Greek letters, etc.

将步骤三中所得的命名实体识别结果{E:T}中的实体名称T用特定字符C做替换,得到命名实体替换后的结果集合{E:C},代入步骤二中所得的分词结果Token*,将包含在实体名称E中的单词W用特定字符C替换后,重新组合得到新的语料Sentence′。Replace the entity name T in the named entity recognition result {E:T} obtained in step 3 with a specific character C to obtain the result set {E:C} after the named entity replacement, and substitute it into the word segmentation result Token obtained in step 2 * , after replacing the word W contained in the entity name E with a specific character C, recombine to obtain a new corpus Sentence'.

例如语料中包含3条语句分别为S1,S2,S3,经文本信息分词后得到S1=abc1d,S2=abc2d,S3=abc3d,其中a、b、c1、c2、c3、d代表语料分词结果Token中的不同词汇,且c1、c2、c3代表同种命名实体类型下的不同实体名称。用特定字符c0替换c1、c2、c3后,得到3条完成命名实体替换后的语料分别为S1′,S2′,S3′,其中S1′=abcod,S2′=abcod,S3′=abcod,从而缩小意图识别模型中语料的多样性,降低文本信息的不平衡度。具体如图2所示为某天气查询系统的语料数据命名实体替换示例。For example, the corpus contains 3 sentences which are S1, S2, S3 respectively. After word segmentation of text information, S1=abc 1 d, S2=abc 2 d, S3=abc 3 d, where a, b, c 1 , c 2 , c 3 , d represent different words in the corpus tokenization result Token, and c 1 , c 2 , and c 3 represent different entity names under the same named entity type. After replacing c 1 , c 2 , and c 3 with a specific character c 0 , three corpora after the named entity replacement are obtained are S1', S2', S3', where S1'=abc o d, S2'=abc o d, S3′=abc o d, thereby reducing the diversity of corpus in the intention recognition model and reducing the imbalance of text information. Specifically, Figure 2 shows an example of corpus data named entity replacement for a weather query system.

步骤五:文本特征提取Step 5: Text Feature Extraction

基于BERT,GPT,XLNet,XLM等预训练模型,利用步骤四中得到的语料Sentence′,对上述预训练模型进行微调,得到微调后的特征提取模型。利用微调后的特征提取模型得到对话系统语料的词向量Vec。Based on pre-training models such as BERT, GPT, XLNet, and XLM, and using the corpus Sentence' obtained in step 4, fine-tune the above-mentioned pre-training model to obtain a fine-tuned feature extraction model. The word vector Vec of the dialogue system corpus is obtained by using the fine-tuned feature extraction model.

步骤六:文本意图识别Step 6: Text Intent Recognition

本发明中采用双向长短期记忆(Bi-LSTM)+注意力机制(Attention)的网络结构实现文本意图识别。该网络结构主要由4个部分组成,如图4所示,具体分别为:In the present invention, the network structure of bidirectional long short-term memory (Bi-LSTM) + attention mechanism (Attention) is used to realize text intention recognition. The network structure is mainly composed of 4 parts, as shown in Figure 4, which are as follows:

1)输入层:将步骤五中所获得的对话系统语料的词向量Vec作为输入V;1) Input layer: take the word vector Vec of the dialogue system corpus obtained in step 5 as the input V;

2)双向LSTM层:利用双向长短期记忆网络对输入层的词向量进行前向计算得到向量VL,后向计算得到向量VR。对前后向量进行拼接得到拼接后的LSTM层输出向量VC,其中VC=[VL,VR];2) Bidirectional LSTM layer: using a bidirectional long short-term memory network to perform forward calculation on the word vector of the input layer to obtain a vector VL , and backward calculation to obtain a vector VR . Splicing the front and rear vectors to obtain the spliced LSTM layer output vector V C , where V C =[ VL , VR ];

3)Attention层:对LSTM层的输出向量VC进行Attention加权,进一步得到输出结果VA,计算方法如下:3) Attention layer: Perform Attention weighting on the output vector VC of the LSTM layer, and further obtain the output result VA. The calculation method is as follows:

Vm=tanh(Vc)V m =tanh(V c )

α=softmax(wTVm)α=softmax(w T V m )

VA=VcαT V A =V c α T

其中w为Attention层的权重矩阵。where w is the weight matrix of the Attention layer.

4)输出层:将Attention层的输出结果VA利用Softmax分类器对语句意图进行预测,得到意图预测结果

Figure BDA0002441736800000101
4) Output layer: the output result VA of the Attention layer is used to predict the sentence intent using the Softmax classifier, and the intent prediction result is obtained
Figure BDA0002441736800000101

Figure BDA0002441736800000102
Figure BDA0002441736800000102

其中Ws,bs分别为输出层的权重矩阵和偏置值。where W s and b s are the weight matrix and bias value of the output layer, respectively.

本方法利用命名实体识别结果,将文本信息中的实体名称替换为实体类型,降低了对话系统语料数据的量级与不平衡度,从而综合提升对话过程意图识别的准确度。This method uses the named entity recognition result to replace the entity name in the text information with the entity type, which reduces the magnitude and imbalance of the corpus data of the dialogue system, thereby comprehensively improving the accuracy of intention recognition in the dialogue process.

以上所述,仅是本发明的较佳实施例而已,并非对本发明作任何形式上的限制,虽然本发明已以较佳实施例揭示如上,然而并非用以限定本发明,任何熟悉本专业的技术人员,在不脱离本发明技术方案范围内,当可利用上述揭示的技术内容作出些许更动或修饰为等同变化的等效实施例,但凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰,均仍属于本发明技术方案的范围内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Technical personnel, within the scope of the technical solution of the present invention, can make some changes or modifications to equivalent embodiments of equivalent changes by using the technical content disclosed above, but any content that does not depart from the technical solution of the present invention, according to the present invention Any simple modifications, equivalent changes and modifications made to the above embodiments still fall within the scope of the technical solutions of the present invention.

Claims (10)

1.一种基于实体替换的对话意图识别方法,其特征在于,包括以下步骤:1. a dialogue intention recognition method based on entity replacement, is characterized in that, comprises the following steps: 步骤一、文本分词:Step 1. Text segmentation: 利用分词工具对语音识别模块所得到的文本信息进行分词,得到分词结果集合Token;Use the word segmentation tool to segment the text information obtained by the speech recognition module, and obtain the token result set of word segmentation; 步骤二、文本过滤:Step 2. Text filtering: 根据对话系统建立所需的停用词词库,利用停用词词库对步骤一所得的分词结果集合Token进行文本信息过滤,得到文本清洗后的结果Token*Establish the required stop word thesaurus according to the dialogue system, use the stop word thesaurus to filter the text information of the word segmentation result set Token obtained in step 1, and obtain the result Token * after the text cleaning; 步骤三、文本命名实体识别:Step 3. Text Named Entity Recognition: 通过深度学习模型对步骤二所得的文本清洗结果进行命名实体识别;Perform named entity recognition on the text cleaning results obtained in step 2 through a deep learning model; 步骤四、文本命名实体替换:Step 4. Text named entity replacement: 用特定字符将对话系统中所涉及到的命名实体类型做一一映射,记为{T∶C},重新组合得到新的语料,其中T代表实体类型,C代表特定字符;所选特定字符需确保不存在于对话系统的语料中;Use specific characters to map the named entity types involved in the dialogue system one-to-one, denoted as {T:C}, and recombine to obtain a new corpus, where T represents the entity type, and C represents a specific character; Make sure that it does not exist in the corpus of the dialogue system; 步骤五、文本特征提取:Step 5. Text feature extraction: 基于不同类型的预训练模型,利用步骤四中得到的新的语料,对上述预训练模型进行微调,得到微调后的特征提取模型;利用微调后的特征提取模型得到对话系统语料的词向量Vec;Based on different types of pre-training models, using the new corpus obtained in step 4, fine-tune the above-mentioned pre-training model to obtain a fine-tuned feature extraction model; use the fine-tuned feature extraction model to obtain the word vector Vec of the dialogue system corpus; 步骤六、文本意图识别:Step 6. Text Intent Recognition: 采用双向长短期记忆Bi-LSTM+注意力机制Attention的网络结构实现文本意图识别。The network structure of bi-directional long short-term memory Bi-LSTM + attention mechanism Attention is used to realize text intent recognition. 2.根据权利要求1所述的基于实体替换的对话意图识别方法,其特征在于:所述步骤一集合Token的分词结果表示为集合{W},W代表切分的单词;2. the method for recognizing dialog intention based on entity replacement according to claim 1, is characterized in that: the word segmentation result of described step one set Token is represented as set {W}, and W represents the word of segmentation; 3.根据权利要求1所述的基于实体替换的对话意图识别方法,其特征在于:所述步骤3中,通过命名进行实体识别后,得到命名实体识别结果为{E:T},其中E代表实体名称,T代表实体类型。3. The method for recognizing dialog intent based on entity replacement according to claim 1, characterized in that: in the step 3, after entity recognition is performed by naming, a named entity recognition result is obtained as {E:T}, wherein E represents Entity name, T for entity type. 4.根据权利要求1所述的基于实体替换的对话意图识别方法,其特征在于:步骤三、文本命名实体识别具体工作步骤,具体工作如下:4. The method for recognizing dialog intent based on entity replacement according to claim 1, characterized in that: step 3, the specific work steps of text named entity recognition, and the specific work is as follows: 1)基于规则匹配,1) Based on rule matching, 根据对话系统的需求设计相应的正则表达式,基于正则表达式对命名实体进行抽取,匹配出符合要求的字段;Design corresponding regular expressions according to the requirements of the dialogue system, extract named entities based on regular expressions, and match fields that meet the requirements; 2)基于实体词典2) Based on entity dictionary 根据对话系统构建相应的命名实体词典,基于命名实体词典对步骤一中所得的分词结果进行匹配;Build a corresponding named entity dictionary according to the dialogue system, and match the word segmentation results obtained in step 1 based on the named entity dictionary; 3)基于模型3) Model based 通过收集对话系统的历史语料或语料生成的方式得到原始语料Sentence,对Sentence中的各个位置进行人工或自动标注,完成序列标注任务;标注完成后得到标注语句Sentence*,进而通过训练命名实体识别模型实现基于模型的命名实体识别。The original corpus Sentence is obtained by collecting the historical corpus or corpus generation of the dialogue system, and manually or automatically labeling each position in the Sentence to complete the sequence labeling task; after the labeling is completed, the labeling sentence Sentence * is obtained, and then the named entity recognition model is trained by training. Implements model-based named entity recognition. 5.根据权利要求4所述的基于实体替换的对话意图识别方法,其特征在于:所述标注语句Sentence*是由B-T,I-T,O,E-T,S-T组成。5 . The method for recognizing dialog intent based on entity replacement according to claim 4 , wherein the labeled sentence Sentence * is composed of BT, IT, O, ET, and ST. 6 . 6.根据权利要求4所述的基于实体替换的对话意图识别方法,其特征在于:基于模型中,序列标注可采用BIO标注模式或BIOES标注模式。6 . The method for recognizing dialog intent based on entity replacement according to claim 4 , wherein: in the model-based, sequence annotation can be in a BIO annotation mode or a BIOES annotation mode. 7 . 7.根据权利要求6所述的基于实体替换的对话意图识别方法,其特征在于:BIOES标注模式中,B为Begin,代表实体的开始,I为Intermediate,代表实体的中间,O为Other,代表非实体的无关字符,E为End,代表实体的结尾,S为Single,代表该实体由单字符组成。7. the method for recognizing dialog intention based on entity replacement according to claim 6, is characterized in that: in BIOES labeling pattern, B is Begin, represents the beginning of entity, I is Intermediate, represents the middle of entity, and O is Other, represents Non-entity irrelevant characters, E is End, representing the end of the entity, S is Single, representing that the entity consists of a single character. 8.根据权利要求1所述的基于实体替换的对话意图识别方法,其特征在于:所述步骤四、中文本命名实体替换具体工作步骤:将步骤三中所得的命名实体识别结果{E∶T}中的实体名称T用特定字符C做替换,得到命名实体替换后的结果集合{E∶C},代入步骤二中所得的分词结果Token*,将包含在实体名称E中的单词W用特定字符C替换后,重新组合得到新的语料Sentence′。8. The method for recognizing dialog intention based on entity replacement according to claim 1, characterized in that: the step 4, the concrete working steps of Chinese text named entity replacement: the named entity recognition result {E:T obtained in step 3 } The entity name T in } is replaced with a specific character C, and the result set {E:C} after the replacement of the named entity is obtained, which is substituted into the word segmentation result Token * obtained in step 2, and the word W contained in the entity name E is replaced by a specific character. After the character C is replaced, a new corpus Sentence' is recombined. 9.根据权利要求1所述的基于实体替换的对话意图识别方法,其特征在于:所述步骤六文本意图识别中的网络结构,网络结构主要由4个部分组成,具体分别为:9. The method for recognizing dialog intent based on entity replacement according to claim 1, wherein the network structure in the text intent recognizing in the step 6 is mainly composed of 4 parts, which are specifically: 1)输入层:将步骤五中所获得的对话系统语料的词向量Vec作为输入V;1) Input layer: take the word vector Vec of the dialogue system corpus obtained in step 5 as the input V; 2)双向LSTM层:利用双向长短期记忆网络对输入层的词向量进行前向计算得到向量VL,后向计算得到向量VR;对前后向量进行拼接得到拼接后的LSTM层输出向量VC,其中VC=[VL,VR];2) Bidirectional LSTM layer: use the bidirectional long short-term memory network to perform forward calculation on the word vector of the input layer to obtain a vector VL , and backward calculation to obtain a vector VR ; splicing the front and rear vectors to obtain the spliced LSTM layer output vector VC , where V C =[ VL , VR ]; 3)Attention层:对LSTM层的输出向量VC进行Attention加权,进一步得到输出结果VA,计算方法如下:3) Attention layer: Perform Attention weighting on the output vector VC of the LSTM layer, and further obtain the output result VA. The calculation method is as follows: Vm=tanh(Vc)V m =tanh(V c ) α=softmax(wTVm)α=softmax(w T V m ) VA=VcαT V A =V c α T 其中w为Attention层的权重矩阵。where w is the weight matrix of the Attention layer. 4)输出层:将Attention层的输出结果VA利用Softmax分类器对语句意图进行预测,得到意图预测结果
Figure FDA0002441736790000041
4) Output layer: the output result VA of the Attention layer is used to predict the sentence intent using the Softmax classifier, and the intent prediction result is obtained
Figure FDA0002441736790000041
Figure FDA0002441736790000042
Figure FDA0002441736790000042
其中WS,bS分别为输出层的权重矩阵和偏置值。where W S and b S are the weight matrix and bias value of the output layer, respectively.
10.根据权利要求1所述的基于实体替换的对话意图识别方法,其特征在于:所述步骤3采用双向长短期记忆BiLSTM+条件随机场CRF模型实现命名实体识别可取得较优效果。10 . The method for recognizing dialog intent based on entity replacement according to claim 1 , wherein in the step 3, a bi-directional long short-term memory BiLSTM + conditional random field CRF model is used to realize named entity recognition, which can achieve better results. 11 .
CN202010271707.1A 2020-04-08 2020-04-08 Entity replacement-based dialog intention identification method Pending CN111553157A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010271707.1A CN111553157A (en) 2020-04-08 2020-04-08 Entity replacement-based dialog intention identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010271707.1A CN111553157A (en) 2020-04-08 2020-04-08 Entity replacement-based dialog intention identification method

Publications (1)

Publication Number Publication Date
CN111553157A true CN111553157A (en) 2020-08-18

Family

ID=72002342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010271707.1A Pending CN111553157A (en) 2020-04-08 2020-04-08 Entity replacement-based dialog intention identification method

Country Status (1)

Country Link
CN (1) CN111553157A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905774A (en) * 2021-02-22 2021-06-04 武汉市聚联科软件有限公司 Human-computer conversation deep intention understanding method based on affair map
CN113779229A (en) * 2021-08-31 2021-12-10 康键信息技术(深圳)有限公司 User requirement identification matching method, device, equipment and readable storage medium
CN114491027A (en) * 2022-01-13 2022-05-13 天津车之家软件有限公司 A text intent recognition method, device and computing device
CN115064170A (en) * 2022-08-17 2022-09-16 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101598495A (en) * 2009-06-29 2009-12-09 上海泽玛克敏达机械设备有限公司 Condensate water recovery device
CN108874774A (en) * 2018-06-05 2018-11-23 浪潮软件股份有限公司 A kind of service calling method and system based on intention understanding
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian Named Entity Recognition Method and Recognition System Based on Neural Network
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
CN110298044A (en) * 2019-07-09 2019-10-01 广东工业大学 A kind of entity-relationship recognition method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101598495A (en) * 2009-06-29 2009-12-09 上海泽玛克敏达机械设备有限公司 Condensate water recovery device
CN108874774A (en) * 2018-06-05 2018-11-23 浪潮软件股份有限公司 A kind of service calling method and system based on intention understanding
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian Named Entity Recognition Method and Recognition System Based on Neural Network
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
CN110298044A (en) * 2019-07-09 2019-10-01 广东工业大学 A kind of entity-relationship recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐啸;朱艳辉;冀相冰;: "基于自注意力深度学习的微博实体识别研究" *
王子岳;邵曦;: "基于S-LSTM模型利用‘槽值门’机制的说话人意图识别" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905774A (en) * 2021-02-22 2021-06-04 武汉市聚联科软件有限公司 Human-computer conversation deep intention understanding method based on affair map
CN113779229A (en) * 2021-08-31 2021-12-10 康键信息技术(深圳)有限公司 User requirement identification matching method, device, equipment and readable storage medium
CN114491027A (en) * 2022-01-13 2022-05-13 天津车之家软件有限公司 A text intent recognition method, device and computing device
CN115064170A (en) * 2022-08-17 2022-09-16 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN115064170B (en) * 2022-08-17 2022-12-13 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium

Similar Documents

Publication Publication Date Title
Zhu et al. CAN-NER: convolutional attention network for Chinese named entity recognition
CN110232439B (en) An Intent Recognition Method Based on Deep Learning Network
CN109359293B (en) Mongolian name entity recognition method neural network based and its identifying system
CN108460013A (en) A kind of sequence labelling model based on fine granularity vocabulary representation model
CN111553157A (en) Entity replacement-based dialog intention identification method
CN107832400A (en) A kind of method that location-based LSTM and CNN conjunctive models carry out relation classification
CN113312453B (en) A model pre-training system for cross-language dialogue understanding
CN112182191B (en) Structured memory map network model for multi-round-mouth linguistic understanding
CN108628823A (en) In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training
CN110717341B (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN110070855B (en) A speech recognition system and method based on transfer neural network acoustic model
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN115392259B (en) Microblog text sentiment analysis method and system based on confrontation training fusion BERT
CN111966812A (en) Automatic question answering method based on dynamic word vector and storage medium
CN112905736B (en) An unsupervised text sentiment analysis method based on quantum theory
CN111552801B (en) A Neural Network Automatic Summarization Model Based on Semantic Alignment
CN112613316B (en) A method and system for generating an ancient Chinese annotation model
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114386417A (en) A Chinese Nested Named Entity Recognition Method Incorporating Word Boundary Information
CN115169349A (en) ALBERT-based Named Entity Recognition Method for Chinese Electronic Resume
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN111651973A (en) Text matching method based on syntax perception
CN114548117A (en) Cause-and-effect relation extraction method based on BERT semantic enhancement
CN114564953A (en) Emotion target extraction model based on multiple word embedding fusion and attention mechanism
CN116595181A (en) Personalized dialogue method and system combining emotion analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200818

RJ01 Rejection of invention patent application after publication