WO2020010930A1 - Method for detecting ambiguity of customer service robot knowledge base, storage medium, and computer device - Google Patents

Method for detecting ambiguity of customer service robot knowledge base, storage medium, and computer device Download PDF

Info

Publication number
WO2020010930A1
WO2020010930A1 PCT/CN2019/088473 CN2019088473W WO2020010930A1 WO 2020010930 A1 WO2020010930 A1 WO 2020010930A1 CN 2019088473 W CN2019088473 W CN 2019088473W WO 2020010930 A1 WO2020010930 A1 WO 2020010930A1
Authority
WO
WIPO (PCT)
Prior art keywords
category
question
knowledge base
deep learning
ambiguity
Prior art date
Application number
PCT/CN2019/088473
Other languages
French (fr)
Chinese (zh)
Inventor
欧泽彬
徐易楠
潘晟锋
刘云峰
吴悦
胡晓
汶林丁
Original Assignee
深圳追一科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201810749561.XA external-priority patent/CN109033270A/en
Priority claimed from CN201810801678.8A external-priority patent/CN109101579B/en
Application filed by 深圳追一科技有限公司 filed Critical 深圳追一科技有限公司
Publication of WO2020010930A1 publication Critical patent/WO2020010930A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Abstract

The present application relates to a method for detecting ambiguity of a customer service robot knowledge base, wherein the method comprises: constructing a knowledge base, wherein the knowledge base is divided according to FAQ, and each FAQ is provided with at least one similar question, and each FAQ is a category; dividing the knowledge base into a test set and a training set for a deep learning model; training the deep learning model on the training set, and using the learned deep learning model to carry out ambiguity detection; updating the knowledge base according to an ambiguity detection result; and repeating the above-mentioned steps until the learning effect does not improve any more. According to the present application, updating the knowledge base according to the ambiguity detection result, and repeating the training steps until the learning effect reaches an expected standard can assist manually finding and correcting ambiguity in the knowledge base, and obtaining a knowledge base with the ambiguity eliminated, and data is extracted from the knowledge base with ambiguity eliminated as the training set and the test set of the deep learning model, so as to further improve the learning effect of the deep learning model.

Description

客服机器人知识库歧义检测方法、存储介质和计算机设备Customer service robot knowledge base ambiguity detection method, storage medium and computer equipment
相关申请的交叉引用Cross-reference to related applications
本申请要求于2018年07月09日提交中国专利局,申请号为201810749561.X,申请名称为“一种基于人工客服日志自动构建客服知识库的方法”的中国专利申请的优先权,及于2018年07月19日提交中国专利局,申请号为201810801678.8,申请名称为“客服机器人知识库歧义检测方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on July 9, 2018 with the Chinese Patent Office under the application number 201810749561.X, and the application name is "A Method for Automatically Building Customer Service Knowledge Base Based on Manual Customer Service Logs", and The Chinese Patent Application was filed on July 19, 2018, with application number 201810801678.8, and applied for the priority of a Chinese patent application entitled "Customer Service Robot Knowledge Base Ambiguity Detection Method", the entire contents of which are incorporated herein by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其是一种客服机器人知识库歧义检测方法、存储介质和计算机设备。The present application relates to the field of artificial intelligence technology, and in particular, to a method, storage medium, and computer device for ambiguous detection of a knowledge base of a customer service robot.
背景技术Background technique
随着互联网用户的增加,企业的客服部门服务压力不断增大。由于大多数用户遇到的问题都是重复出现的,这些重复的问题往往可以用固定模板进行回答。为了减少客服中心的人工成本,可引入机器人客服,用程序判断用户的问题类型,如果问题属于FAQ(Frequently Asked Questions,经常问到的问题),则直接给出一个标准答案,反之则转接人工服务进行特殊干预。With the increase of Internet users, the service pressure of the company's customer service department is increasing. Since most users encounter recurring questions, these repeated questions can often be answered with a fixed template. In order to reduce the labor cost of the customer service center, a robot customer service can be introduced to determine the type of user's problem using a program. If the problem is a Frequently Asked Questions (FAQ), a standard answer will be given directly, otherwise the manual will be transferred Services for special interventions.
相关技术中,客服机器人利用机器学习技术识别用户意图,将意图识别转化成问句分类问题。每个FAQ对应一个类别,每个类别有一句以上的相似问句。所有FAQ和对应的相似问句构成了机器人的知识库。In related technologies, customer service robots use machine learning techniques to identify user intents and translate intent recognition into question classification questions. Each FAQ corresponds to a category, and each category has more than one similar question. All FAQs and corresponding similar questions form the robot's knowledge base.
机器学习模型的效果往往依赖于从知识库中选出的训练数据,特别是训练数据的标注准确率会对模型效果产生较大影响。但由于时间和人工精力的 限制,知识库往往会存在大量歧义,如问句对应到错误的类别、类别与类别语义重合等,这些歧义会导致模型学习到错误的知识,从而对模型的准确率产生负面影响,而由于机器学习训练过程中需要大量的标注数据,无法单纯依赖人工去发现并处理这些歧义,因此如何对知识库进行歧义检测,并辅助人工消除知识库歧义成为相关技术人员亟待解决的问题。The effect of a machine learning model often depends on the training data selected from the knowledge base, especially the labeling accuracy of the training data will have a greater impact on the model effect. However, due to the limitation of time and artificial energy, there are often a lot of ambiguities in the knowledge base, such as the question corresponding to the wrong category, the category and category semantic overlap, etc. These ambiguities will cause the model to learn the wrong knowledge and thus the accuracy of the model. It has a negative impact, and because machine learning training requires a large amount of labeled data, it is impossible to rely solely on humans to find and handle these ambiguities. Therefore, how to perform ambiguity detection on the knowledge base and assist in manually eliminating the ambiguity of the knowledge base has become an urgent problem for related technical staff The problem.
发明内容Summary of the invention
根据本申请提供的各种实施例,提供一种客服机器人知识库歧义检测方法,存储介质和计算机设备。According to various embodiments provided in the present application, a method for detecting ambiguity in a knowledge base of a customer service robot, a storage medium, and a computer device are provided.
一种客服机器人知识库歧义检测方法,包括:An ambiguous detection method for customer service robot knowledge base, including:
构建知识库,所述知识库按FAQ划分,每个FAQ设有至少一个相似问句,且每个FAQ为一个类别;Construct a knowledge base, which is divided into FAQs, each FAQ is provided with at least one similar question, and each FAQ is a category;
将所述知识库划分为深度学习模型的测试集和训练集;Dividing the knowledge base into a test set and a training set for a deep learning model;
在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测;Training a deep learning model on the training set, and performing ambiguous detection using the learned deep learning model;
根据歧义检测结果更新所述知识库;及Updating the knowledge base based on the ambiguity detection results; and
重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。Repeat the above steps until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
一种基于人工客服日志自动构建客服知识库的方法,包括:A method for automatically constructing a customer service knowledge base based on a manual customer service log includes:
对人工客服日志数据进行预处理;Preprocessing the manual customer service log data;
根据处理后的人工客服日志数据建立表达模型;Establish an expression model based on the processed manual customer service log data;
通过所述表达模型获取待整理的用户问句的问句表达信息;Obtaining question expression information of a user question to be sorted through the expression model;
对所述问句表达信息进行聚合处理,得到用户问句类簇;及Aggregate the question expression information to obtain a user question cluster; and
将所述用户问句类簇进行整理得到知识库。The user question sentence clusters are sorted to obtain a knowledge base.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所 述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
构建知识库,所述知识库按FAQ划分,每个FAQ设有至少一个相似问句,且每个FAQ为一个类别;Construct a knowledge base, which is divided into FAQs, each FAQ is provided with at least one similar question, and each FAQ is a category;
将所述知识库划分为深度学习模型的测试集和训练集;Dividing the knowledge base into a test set and a training set for a deep learning model;
在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测;Training a deep learning model on the training set, and performing ambiguous detection using the learned deep learning model;
根据歧义检测结果更新所述知识库;及Updating the knowledge base based on the ambiguity detection results; and
重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。Repeat the above steps until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
对人工客服日志数据进行预处理;Preprocessing the manual customer service log data;
根据处理后的人工客服日志数据建立表达模型;Establish an expression model based on the processed manual customer service log data;
通过所述表达模型获取待整理的用户问句的问句表达信息;Obtaining question expression information of a user question to be sorted through the expression model;
对所述问句表达信息进行聚类处理;Performing clustering processing on the question expression information;
将相似的用户问句聚合为同一类,并进行归类整理得到知识库。The similar user questions are aggregated into the same category, and classified into a knowledge base.
一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more processors are caused. Each processor performs the following steps:
构建知识库,所述知识库按FAQ划分,每个FAQ设有至少一个相似问句,且每个FAQ为一个类别;Construct a knowledge base, which is divided into FAQs, each FAQ is provided with at least one similar question, and each FAQ is a category;
将所述知识库划分为深度学习模型的测试集和训练集;Dividing the knowledge base into a test set and a training set for a deep learning model;
在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测;Training a deep learning model on the training set, and performing ambiguous detection using the learned deep learning model;
根据歧义检测结果更新所述知识库;及Updating the knowledge base based on the ambiguity detection results; and
重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。Repeat the above steps until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more processors are caused. Each processor performs the following steps:
对人工客服日志数据进行预处理;Preprocessing the manual customer service log data;
根据处理后的人工客服日志数据建立表达模型;Establish an expression model based on the processed manual customer service log data;
通过所述表达模型获取待整理的用户问句的问句表达信息;Obtaining question expression information of a user question to be sorted through the expression model;
对所述问句表达信息进行聚类处理;Performing clustering processing on the question expression information;
将相似的用户问句聚合为同一类,并进行归类整理得到知识库Aggregate similar user questions into the same category and sort them to get a knowledge base
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features, objects, and advantages of the application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present application or the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are merely These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative work.
图1为一个实施例中客服机器人知识库歧义检测方法的应用环境图;FIG. 1 is an application environment diagram of an ambiguous detection method of a customer service robot knowledge base in an embodiment; FIG.
图2为一个实施例中计算机设备的内部结构示意图;FIG. 2 is a schematic diagram of an internal structure of a computer device in an embodiment; FIG.
图3为一个实施例中客服机器人知识库歧义检测方法的流程示意图;FIG. 3 is a schematic flowchart of an ambiguous detection method of a customer service robot knowledge base in an embodiment; FIG.
图4为另一个实施例中客服机器人知识库歧义检测方法的流程示意图;4 is a schematic flowchart of a method for detecting a ambiguity in a knowledge base of a customer service robot in another embodiment;
图5为一个实施例中构建客服知识库的流程示意图;FIG. 5 is a schematic flowchart of constructing a customer service knowledge base in an embodiment; FIG.
图6为一个实施例中构建客服知识库的工作原理示意图。FIG. 6 is a schematic diagram of the working principle of constructing a customer service knowledge base in an embodiment.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution, and advantages of the present application clearer, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and are not used to limit the application.
图1为一个实施例中客服机器人知识库歧义检测方法的应用环境图。参照图1,该应用环境包括终端110和服务器120,终端110能够通过网络与服务器120连接通信。其中,服务器120构建知识库,所述知识库按FAQ划分,每个FAQ设有至少一个相似问句,且每个FAQ为一个类别;服务器120将所述知识库划分为深度学习模型的测试集和训练集;服务器120在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测;根据歧义检测结果更新所述知识库;及服务器120重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。服务器120可以从终端110中获取用户输入的构建知识库的数据。其中,终端110可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。FIG. 1 is an application environment diagram of an ambiguous detection method for a knowledge base of a customer service robot in an embodiment. Referring to FIG. 1, the application environment includes a terminal 110 and a server 120, and the terminal 110 can communicate with the server 120 through a network. The server 120 constructs a knowledge base, which is divided into FAQs. Each FAQ is provided with at least one similar question, and each FAQ is a category. The server 120 divides the knowledge base into a test set of a deep learning model. And training set; the server 120 trains a deep learning model on the training set, and uses the learned deep learning model to perform ambiguity detection; updates the knowledge base according to the ambiguity detection result; and the server 120 repeats the above steps until the learning effect is no longer Improve and get disambiguated knowledge base. The server 120 may obtain, from the terminal 110, data for constructing a knowledge base input by a user. The terminal 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 120 may be implemented by an independent server or a server cluster composed of multiple servers.
图2为一个实施例中计算机设备的内部结构示意图。该计算机设备具体可以是如图1所示的服务器120或者用户终端110。参照图2,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统和计算机可读指令,该计算机可读指令被执行时,可使得处理器执行一种客服机器人知识库歧义检测方法。该计算机设备的内存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种客服机器人知识库歧义检测方法。该计算机设备的网络接口用于据以通过网 络进行通信,如获取数据要构建知识库的数据。FIG. 2 is a schematic diagram of an internal structure of a computer device in an embodiment. The computer device may specifically be a server 120 or a user terminal 110 as shown in FIG. 1. Referring to FIG. 2, the computer device includes a processor, a memory, and a network interface connected through a system bus. The memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and computer-readable instructions. When the computer-readable instructions are executed, the processor can cause the processor to execute a method for detecting ambiguity in a knowledge base of a customer service robot. The internal memory of the computer device may store computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor may cause the processor to execute a method for detecting ambiguity in a customer service robot knowledge base. The network interface of the computer device is used to communicate through the network, for example, to obtain data to build data of a knowledge base.
图3是本申请一个实施例中客服机器人知识库歧义检测方法的流程示意图,该实施例主要以该方法应用于上述图1中的计算机设备120来举例说明。FIG. 3 is a schematic flowchart of a method for detecting a ambiguity in a knowledge base of a customer service robot according to an embodiment of the present application, and this embodiment mainly uses the method applied to the computer device 120 in FIG.
如图3所示,本实施例的方法包括:As shown in FIG. 3, the method in this embodiment includes:
S11:构建知识库,知识库按FAQ划分,每个FAQ各带有数量不定的相似问句,每个FAQ即为一个类别。S11: Construct a knowledge base. The knowledge base is divided into FAQs. Each FAQ has a similar number of similar questions. Each FAQ is a category.
所述知识库是在大规模知识处理基础上发展起来的一项面向行业应用的,适用大规模知识处理、自然语言理解、知识管理、自动问答系统、推理等等技术行业,智能客服不仅为企业提供了细粒度知识管理技术,还为企业与海量用户之间的沟通建立了一种基于自然语言的快捷有效的技术手段。以一个电商企业的客服机器人知识库为例,所述知识库中包含多个FAQ,例如“退货流程”,“退款流程”。以“退货流程”为例,该FAQ可能包含以下相似问句:“我昨天买的东西怎么退货?”,“我想退货,应该怎么操作?”。The knowledge base is an industry-oriented application developed on the basis of large-scale knowledge processing. It is applicable to large-scale knowledge processing, natural language understanding, knowledge management, automatic question answering systems, reasoning and other technical industries. Intelligent customer service is not only for enterprises. Provides fine-grained knowledge management technology, and also establishes a fast and effective technical means based on natural language for communication between enterprises and mass users. Take the customer service robot knowledge base of an e-commerce enterprise as an example, the knowledge base contains multiple FAQs, such as "return process" and "refund process". Taking the "return process" as an example, the FAQ may contain the following similar questions: "How did I return the goods I bought yesterday?", "I want to return the goods, what should I do?"
S12:将所述知识库划分为深度学习模型的测试集和训练集。S12: Divide the knowledge base into a test set and a training set of a deep learning model.
从所述知识库中选取需要检测歧义的N个FAQ,作为N个类别。对于每个FAQ,随机抽取预设数量的相似问句作该类别的测试数据,其余相似问句作为该类别的训练数据。所有类别的测试数据构成测试集,所有类别的训练数据构成训练集。N FAQs from which the ambiguity needs to be detected are selected from the knowledge base as N categories. For each FAQ, a preset number of similar questions are randomly selected as test data of the category, and the remaining similar questions are used as training data of the category. The test data of all categories constitute the test set, and the training data of all categories constitute the training set.
例如,所述知识库内包含10个FAQ,每个FAQ各包含20个相似问句,从知识库的每个类别中随机抽取预设量例如为3个相似问句作为深度学习模型的测试集,则所述测试集内包含30个相似问句,而其余170个相似问句被纳入深度学习模型的训练集。For example, the knowledge base contains 10 FAQs, each of which contains 20 similar questions, and a preset amount is randomly selected from each category of the knowledge base, for example, 3 similar questions are used as a test set for a deep learning model , The test set contains 30 similar questions, and the remaining 170 similar questions are included in the training set of the deep learning model.
需要说明的是,本申请所涉及的知识库内包含类别的数量、每个类别包 含相似问句的数量不限于实施例中所涉及的举例,此处不再赘述。It should be noted that the number of categories included in the knowledge base involved in this application and the number of similar questions in each category are not limited to the examples involved in the embodiments, and are not repeated here.
S13:在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测。S13: Training a deep learning model on the training set, and performing ambiguous detection using the learned deep learning model.
所述深度学习模型包括:特征提取器、浅层分类器。The deep learning model includes a feature extractor and a shallow classifier.
所述歧义检测包括:类别歧义检测、标注错误检测和标注歧义检测;The ambiguity detection includes: category ambiguity detection, labeling error detection, and labeling ambiguity detection;
所述歧义包括:The ambiguities include:
类别歧义:即两个类别的意思很相近,例如类别1为“订单问题”,类别2为“产品的变更取消问题”,类别1的语义和类别2的语义有重合,因为类别1基本可涵盖类别2;Category ambiguity: the meanings of the two categories are very similar, for example, category 1 is "order problem", category 2 is "product change cancellation problem", the semantics of category 1 and semantics of category 2 overlap, because category 1 basically covers Category 2
标注歧义:即问句可以同时标为多个类,例如:类别1为“产品的退货问题”,类别2为“产品的价格问题”,如果问句是“这东西太贵了,我想退货”,则此句话存在标注歧义,因为该问句同时包含了所述两个类别的意思;Mark ambiguity: the question can be marked as multiple categories at the same time, for example: category 1 is "return problem of product", category 2 is "price problem of product", if the question is "this thing is too expensive, I want to return ", The sentence is ambiguous because the question contains both meanings of the two categories;
标注错误:问句对应到错误的类别,例如类别1为“产品的退货问题”,类别2为“产品的价格问题”,如果问句是“我不想要了”,却被标注成类别2,则会产生标注错误。Incorrect labeling: Questions correspond to the wrong category, for example, Category 1 is "Product return problem" and Category 2 is "Product price problem". If the question is "I do n’t want anymore", it is labeled as Category 2. An annotation error will occur.
所述歧义检测针对测试集或\和训练集。The ambiguity detection is for a test set or training set.
所述利用学习出的所述深度学习模型进行歧义检测,包括:The performing ambiguity detection by using the learned deep learning model includes:
利用深度学习模型中特征提取器检测歧义;Use feature extractors in deep learning models to detect ambiguity;
利用深度学习模型中浅层分类器检测歧义;Use shallow classifiers in deep learning models to detect ambiguity;
所述利用深度学习模型中特征提取器检测歧义,包括:The detecting ambiguity by using a feature extractor in a deep learning model includes:
用所述深度学习模型中的特征提取器将数据集中的相似问句转化成特征向量,所述数据集包括训练集或/和测试集;Using a feature extractor in the deep learning model to convert similar questions in a data set into feature vectors, where the data set includes a training set or / and a test set;
将问句对应的特征向量组合成问句特征向量对(x,y),其中特征向量x对 应的问句和特征向量y对应的问句分别来自不同类别;Combining the feature vector corresponding to the question into a question feature vector pair (x, y), where the question corresponding to the feature vector x and the question corresponding to the feature vector y are from different categories;
计算每组问句特征向量对的向量相似度cos(x,y),所述
Figure PCTCN2019088473-appb-000001
Calculate the vector similarity cos (x, y) of the feature vector pairs of each group, where
Figure PCTCN2019088473-appb-000001
将所有问句特征向量对按所述向量相似度从高到低排序,选择所述向量相似度排名靠前的问句特征向量对,并根据所述向量相似度排名靠前的问句特征向量对判断是否存在歧义。Sort all question feature vector pairs according to the similarity of the vector from high to low, select the question feature vector pair ranked highest in the vector similarity, and rank the question feature vector ranked highest according to the vector similarity Ambiguous judgment.
所述根据所述向量相似度排名靠前的问句特征向量对判断是否存在歧义,包括:The judging whether there is an ambiguity in the top question vector based on the vector similarity includes:
判断是否存在标注歧义或标注错误:提取第一预设数量例如为30个所述相似度排名靠前的问句特征向量对,人工检查对应的问句对是否存在标注歧义和标注错误;Determine whether there are labeling ambiguities or labeling errors: extract a first preset number of, for example, 30 question feature vector pairs with the highest similarity ranking, and manually check whether the corresponding query pair has labeling ambiguity and labeling errors;
判断是否存在类别歧义:对于所述第一预设数量的问句特征向量对,统计对应类别对重复出现的次数,按照出现次数从高到低排序,取第二预设数量例如为20个类别对,人工检查是否存在类别歧义。Determine whether there is category ambiguity: For the first preset number of question feature vector pairs, count the number of repeated occurrences of the corresponding category pair, sort them from high to low, and take the second preset number, for example, 20 categories Yes, manually check for category ambiguity.
所述利用深度学习模型浅层分类器检测歧义,包括:The method for detecting ambiguity by using a shallow classifier of a deep learning model includes:
将深度学习模型分类结果进行统计并形成混淆矩阵,所述混淆矩阵的每行i对应标注的类别,每列j对应所述深度学习模型预测的类别,元素x ij是标注为类别i,而模型预测为类别j的问句个数,元素x ji是标注为类别j,而模型预测为类别i的问句个数; The classification results of the deep learning model are counted and a confusion matrix is formed. Each row i of the confusion matrix corresponds to the labeled category, each column j corresponds to the category predicted by the deep learning model, and the element x ij is labeled as category i, and the model Predict the number of questions in category j, the element x ji is labeled as category j, and the model predicts the number of questions in category i;
计算数据集中标注为类别i的样本个数,所述类别i的样本个数为
Figure PCTCN2019088473-appb-000002
其中k为任意类别;
Calculate the number of samples labeled as category i in the data set, where the number of samples in category i is
Figure PCTCN2019088473-appb-000002
Where k is any category;
计算数据集中标注为类别j的样本个数,所述类别j的样本个数为
Figure PCTCN2019088473-appb-000003
其中k为任意类别;
Calculate the number of samples labeled as category j in the data set, where the number of samples in category j is
Figure PCTCN2019088473-appb-000003
Where k is any category;
计算数据集中将标注为类别i的样本被所述深度学习模型预测为类别j的比例P ij与将标注为类别j的样本预测到类别i的比例P ji,所述P ij和P ji计算公式分别为:
Figure PCTCN2019088473-appb-000004
所述类别i与所述类别j属于不同类别,所述数据集包括训练集或/和测试集;
Calculate the proportion P ij of the samples labeled as category i predicted by the deep learning model as category j in the data set, and the ratio P ji of the samples predicted as category j predicted to category i by the deep learning model, the formulas for calculating P ij and P ji They are:
Figure PCTCN2019088473-appb-000004
The category i and the category j belong to different categories, and the data set includes a training set or / and a test set;
计算类别对(类别i、类别j)的混淆度,所述混淆度为P ij和P ji的调和平均值S ij,所述
Figure PCTCN2019088473-appb-000005
Calculate the degree of confusion of category pairs (category i, category j), which is the harmonic mean S ij of P ij and P ji , where
Figure PCTCN2019088473-appb-000005
根据混淆度判断类别i与类别j是否存在歧义。It is determined whether there is an ambiguity between the category i and the category j according to the degree of confusion.
所述根据混淆度判断类别i、类别j是否存在歧义,包括:The judging whether there is any ambiguity between category i and category j according to the degree of confusion includes:
对计算出的混淆度进行排序;Sort the calculated confusion;
提取第三预设数量例如为5个混淆度排名靠前的类别对,人工检测是否存在类别歧义。A third preset number is extracted, for example, 5 top-ranked category pairs with confusion degree, and artificial detection of category ambiguity is performed.
所述利用深度学习模型浅层分类器检测歧义,还包括:找出数据集中标注的实际类别和深度学习模型预测的类别不一致的数据,人工检查是否存在标注错误,所述数据集包括训练集或/和测试集。The method for detecting ambiguity by using a deep classifier of a deep learning model further includes: finding data in which the actual categories marked in the data set and the categories predicted by the deep learning model are inconsistent, and manually checking whether there are labeling errors. The data set includes a training set or / And test set.
S14:根据歧义检测结果更新所述知识库,包括:S14: Updating the knowledge base according to the ambiguity detection result, including:
对检测出的歧义问句进行人工改写、人工重新标注,并删除原标注;Manually rewrite, re-annotate the detected ambiguous questions, and delete the original annotations;
对检测出的歧义类别进行相似问句的重新组合、分配,并删除原歧义类别。Reassemble and assign similar questions to the detected ambiguity categories, and delete the original ambiguity categories.
S15:重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。S15: Repeat the above steps until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
所述学习效果为模型预测结果和测试集中问句标注的实际类别的一致率,所述一致率例如为预测准确率,即预测结果一致的问句数量除以总的问句数量。所述学习效果不再提升例如为预测准确率提升小于0.5%。The learning effect is the agreement rate between the model prediction result and the actual category of the question mark in the test set. The agreement rate is, for example, the prediction accuracy rate, that is, the number of questions with consistent prediction results divided by the total number of questions. The learning effect is no longer improved, for example, the prediction accuracy is improved by less than 0.5%.
当模型学习效果不再提升时,说明因知识库歧义导致的模型性能下降已消除,可以利用所述知识库训练模型并部署到生产环境中使用。When the model learning effect no longer improves, it means that the degradation of the model performance caused by the ambiguity of the knowledge base has been eliminated, and the model can be used to train the model and deploy it to a production environment for use.
本实施例中,根据歧义检测结果更新所述知识库,重复训练步骤直到学习效果达到预期标准,可以辅助人工发现并修正知识库歧义,得到消除歧义的知识库,从消除歧义的知识库中提取数据作为深度学习模型的训练集和测试集,进一步提高深度学习模型学习效果。In this embodiment, the knowledge base is updated according to the ambiguity detection result, and the training steps are repeated until the learning effect reaches the expected standard. It can assist in manually discovering and correcting the ambiguity of the knowledge base, obtaining a disambiguating knowledge base, and extracting from the disambiguating knowledge base. The data is used as the training set and test set of the deep learning model to further improve the learning effect of the deep learning model.
图4是本申请另一个实施例中客服机器人知识库歧义检测方法的流程示意图。FIG. 4 is a schematic flowchart of an ambiguous detection method for a knowledge base of a customer service robot in another embodiment of the present application.
如图4所示,所述在训练集上训练深度学习模型,包括:As shown in FIG. 4, the training a deep learning model on a training set includes:
所述深度学习的概念源于人工神经网络的研究,含多隐层的多层感知器。深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示。所述深度学习模型包括特征提取器和浅层分类器。The concept of deep learning originates from the research of artificial neural networks, multi-layer perceptrons with multiple hidden layers. Deep learning combines low-level features to form more abstract high-level representation attribute categories or features to discover distributed feature representations of data. The deep learning model includes a feature extractor and a shallow classifier.
S21:将所述训练集中的问句作为输入部分输入到所述深度学习模型;S21: input the question in the training set as an input part into the deep learning model;
S22:利用所述深度学习模型中的特征提取器将输入部分中的问句转化成特征向量;S22: using a feature extractor in the deep learning model to convert a question in an input part into a feature vector;
所述特征提取器例如为循环神经网络。该模型按顺序读入问句中的每个词,并输出一个固定维度的特征向量。需要说明的是,特征提取器不限于所举例的循环神经网络,任何可以将问句转化成固定维度的特征向量的方法都可以作为特征提取器。The feature extractor is, for example, a recurrent neural network. This model sequentially reads each word in the question and outputs a feature vector of a fixed dimension. It should be noted that the feature extractor is not limited to the exemplified recurrent neural network, and any method that can transform a question into a feature vector of a fixed dimension can be used as a feature extractor.
S23:利用所述深度学习模型中的浅层分类器根据所述特征向量计算出预测结果,所述预测结果为输入部分中的问句所对应的类别;S23: Use a shallow classifier in the deep learning model to calculate a prediction result according to the feature vector, where the prediction result is a category corresponding to a question in the input part;
所述的浅层分类器例如为线性分类器。该分类器读入一个固定维度的特征向量,并计算向量元素的线性组合得出各个类别的打分,取打分最高的类 别作为预测结果。需要说明的是,浅层分类器不限于所举例的线性分类器,任何可以将固定维度的特征向量转化成各个类别的打分的方法都可以作为浅层分类器。The shallow classifier is, for example, a linear classifier. The classifier reads a feature vector of a fixed dimension and calculates the linear combination of the vector elements to obtain the score of each category, and takes the category with the highest score as the prediction result. It should be noted that the shallow classifier is not limited to the linear classifier exemplified. Any method that can convert feature vectors of a fixed dimension into a score for each category can be used as the shallow classifier.
S24:利用优化器优化训练模型,将训练集中问句标注的实际类别和所述深度学习模型预测结果的平均差异最小化;S24: Optimize the training model by using an optimizer, and minimize the average difference between the actual categories marked in the training set and the prediction results of the deep learning model;
所述平均差异例如为损失函数。所述损失函数例如为交叉熵。The average difference is, for example, a loss function. The loss function is, for example, cross-entropy.
所述优化器例如为梯度下降法。所述梯度下降是迭代法的一种,在求解机器学习算法的模型参数,即无约束优化问题时,梯度下降(Gradient Descent)是最常采用的方法之一。在求解损失函数的最小值时,可以通过梯度下降法来一步步的迭代求解,得到最小化的损失函数和相应的模型参数值。The optimizer is, for example, a gradient descent method. The gradient descent is a kind of iterative method. When solving model parameters of a machine learning algorithm, that is, an unconstrained optimization problem, gradient descent is one of the most commonly used methods. When solving the minimum value of the loss function, it can be solved step by step through the gradient descent method to obtain the minimized loss function and the corresponding model parameter values.
S25:用测试集对训练完的模型进行评估,计算模型预测结果和测试集中问句标注的实际类别的一致率,作为模型学习效果的评估,所述一致率例如为预测准确率,即预测结果一致的问句数量除以总的问句数量。S25: Use the test set to evaluate the trained model, calculate the consensus rate between the model prediction result and the actual category marked in the test set, and use it as an evaluation of the model learning effect. The consensus rate is, for example, the prediction accuracy rate, that is, the prediction result. The number of consistent questions is divided by the total number of questions.
本实施例中,利用深度学习模型对训练集中FAQ进行训练,在训练过程中利用优化器不断优化模型,不断迭代提升深度学习模型学习效果,并且不断提升歧义检测正确率。In this embodiment, a deep learning model is used to train the FAQ in the training set, an optimizer is used to continuously optimize the model during the training process, iteratively improves the learning effect of the deep learning model, and continuously improves the accuracy of the ambiguity detection.
在一个实施例中,一种客服机器人知识库歧义检测方法,可以包括以下步骤:In one embodiment, a method for ambiguous detection of a customer service robot knowledge base may include the following steps:
S102,服务器从终端获取到人工客服日志数据,对人工客服日志数据进行预处理。S102. The server obtains the manual customer service log data from the terminal, and preprocesses the manual customer service log data.
S104,根据处理后的人工客服日志数据建立表达模型。S104. Establish an expression model according to the processed manual customer service log data.
S106,通过所述表达模型获取待整理的用户问句的问句表达信息。S106. Obtain question expression information of the user question to be sorted through the expression model.
S108,对所述问句表达信息进行聚合处理,得到用户问句类簇。S108: Aggregate the question expression information to obtain a user question cluster.
S110,将所述用户问句类簇进行整理构建FAQ,得到知识库。S110. Sort the user question clusters into a FAQ to obtain a knowledge base.
S112,将得到的知识库划分为深度学习模型的测试集和训练集。S112. The obtained knowledge base is divided into a test set and a training set of the deep learning model.
S114,在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测。S114: Train a deep learning model on the training set, and use the learned deep learning model to perform ambiguity detection.
S116,根据歧义检测结果更新所述知识库。S116. Update the knowledge base according to the ambiguity detection result.
S118,重复上述步骤S112到步骤S116直到学习效果不再提升,得到消除歧义的知识库。S118. Repeat the above steps S112 to S116 until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
在本实施例中,通过预先根据人工客户日志数据训练出表达模型,并利用表达模型获取待整理的用户问句的问句表达信息,并对该问句表达信息进行聚合处理,将用户问句类簇进行整理构建FAQ,从而得到知识库,减少所消耗的人力资源投入,同时通过对大量的人工日志的信息抽取,减弱了在构建知识库过程中对于客服人员业务水平的要求,降低了构建难度,然后通过知识库中的训练集训练得到的深度学习模型,然后再对知识库进行歧义检测,对知识库进行更新,最后得到消除歧义的知识库。In this embodiment, the expression model is trained in advance based on the artificial customer log data, and the expression model is used to obtain the question expression information of the user question to be arranged, and the question expression information is aggregated to process the user question sentence. The clusters are organized to construct FAQs to obtain a knowledge base and reduce the investment of human resources. At the same time, by extracting a large amount of manual logs, it reduces the requirements for the service level of customer service personnel in the process of building a knowledge base and reduces the construction. Difficulty, then the deep learning model trained through the training set in the knowledge base, and then the ambiguity detection is performed on the knowledge base, the knowledge base is updated, and finally the ambiguity-free knowledge base is obtained.
图5是本申请一个实施例中一提供的基于人工客服日志自动构建客服知识库的方法的流程示意图。FIG. 5 is a schematic flowchart of a method for automatically constructing a customer service knowledge base based on a manual customer service log provided in an embodiment of the present application.
如图5所示,本实施例的方法可以包括:As shown in FIG. 5, the method in this embodiment may include:
S1:对人工客服日志数据进行预处理;S1: Preprocessing the manual customer service log data;
进一步的,所述人工客服日志数据包括:Further, the manual customer service log data includes:
用户的一个问句以及对应的客服回复;和,A question from the user and the corresponding customer service response; and,
用户整个会话过程中的所有问句以及对应的客服回复。All questions and corresponding customer service responses during the user's entire conversation.
所述对人工客服日志数据进行预处理,包括:The preprocessing of the manual customer service log data includes:
利用机器学习算法或自然语言处理算法对人工客服日志数据进行处理, 以去除掉与业务内容不相关的用户问句及回复。Use machine learning algorithms or natural language processing algorithms to process artificial customer service log data to remove user questions and responses that are not related to business content.
S2:根据处理后的人工客服日志数据建立表达模型;S2: Establish an expression model based on the processed customer service log data;
进一步的,所述表达模型是通过利用训练算法对处理后的人工客服日志数据进行训练得到的。Further, the expression model is obtained by using a training algorithm to train the processed artificial customer service log data.
其中,所述训练算法包括:机器学习算法(如机器翻译算法)或搜索算法。The training algorithm includes: a machine learning algorithm (such as a machine translation algorithm) or a search algorithm.
S3:通过所述表达模型获取待整理的用户问句的问句表达信息;S3: obtaining question expression information of a user question to be collated through the expression model;
具体的,所述问句表达信息包括:句子的向量表示和/或文本特征表示。Specifically, the question expression information includes: a vector representation of a sentence and / or a text feature representation.
S4:对所述问句表达信息进行聚合处理,得到用户问句类簇;S4: Perform aggregation processing on the question expression information to obtain user question clusters;
进一步的,对所述问句表达信息进行聚合处理,包括:Further, performing aggregation processing on the question expression information includes:
采用聚类算法或同义词整合的方式对所述问句表达信息进行处理。The clustering algorithm or synonym integration is used to process the question expression information.
具体的,所述聚类算法为K-Means聚类算法及其相关改进算法。Specifically, the clustering algorithm is a K-Means clustering algorithm and related improved algorithms.
S5:将所述用户问句类簇进行整理得到知识库。S5: Sort the user question clusters to obtain a knowledge base.
上述过程中,所述表达模型主要是为了得到一个从用户问句到客服的映射关系,通过人工客服日志数据中的多组用户问句及其对应的客服训练处所述表达模型,其训练算法可采用机器学习算法、搜索技术方案或其他算法。In the above process, the expression model is mainly to obtain a mapping relationship from user questions to customer service. Through multiple groups of user questions in the artificial customer service log data and the corresponding expression model described in the customer service training department, the training algorithm Machine learning algorithms, search solutions, or other algorithms can be used.
在所述表达模型训练完成后,我们可以将需要整理知识库的用户问句输入至表达模型,得到该批用户问句的表达信息,问句表达信息的形式可以是向量、文本特征,但可以理解的是,问句表达信息的形式不仅限于是向量或文本特征。然后将该问句表达信息进行聚类处理,进而将相似的用户问句聚合成一类,然后将该类簇进行整理构建FAQ,从而得到知识库。After the training of the expression model is completed, we can input the user questions that need to organize the knowledge base into the expression model to obtain the expression information of the batch of user questions. The form of the question expression information can be vectors, text features, but can be It is understood that the form of question expression information is not limited to vector or text features. Then, the question expression information is clustered, and similar user questions are aggregated into one category, and then the clusters are sorted to construct a FAQ to obtain a knowledge base.
上文所述的词向量是指用户问句中的词组,文本特征是指用户问句中词组的词性,短语的动宾主谓形式等。如,用户问句“如何申请退款?”,词向量可以划分为:如何、申请、退款,他们的文本特征分别为:如何(代词)、申请(动词)、退款(动词),且“如何”与“申请”是状中结构,“申请”与 “退款”是动宾结构,对以上的词向量和文本特征可以通过同义词整合或聚类算法,得到同类的相似用户问句,即得到用户问句类簇,最终,将该用户问句类簇推荐给人工客服进行归类整理构建FAQ,从而得到知识库。图6是本申请一个实施例中构建客服知识库的方法的工作原理示意图。The word vector mentioned above refers to the phrase in the user's question, and the text feature refers to the part of speech of the phrase in the user's question, the subject-verb form of the verb, object, and so on. For example, when the user asks "How to apply for a refund?", The word vectors can be divided into: how, apply, and refund. Their text features are: how (pronoun), application (verb), refund (verb), and "How" and "application" are adverbial structures, and "application" and "refund" are verb-object structures. For the above word vectors and text features, synonym integration or clustering algorithms can be used to obtain similar user questions of the same type. That is, the user question clusters are obtained, and finally, the user question clusters are recommended to the artificial customer service to sort and organize the FAQs to obtain the knowledge base. FIG. 6 is a working principle diagram of a method for constructing a customer service knowledge base in an embodiment of the present application.
通过图6可知,通过人工客服日志数据(包括用户问句和客服回答)训练出表达模型,再将待整理的用户问句输入至训练好的表达模型中,得到用户问句的问句表达信息;在对所述问句表达信息进行聚类处理,将相似的用户问句聚合为同一类,得到用户问句类簇;最终,将该用户问句类簇推荐给人工客服进行归类整理构建FAQ,从而得到知识库。It can be seen from FIG. 6 that the expression model is trained through manual customer service log data (including user questions and customer service answers), and the user question to be collated is input into the trained expression model to obtain the user question expression information. In the clustering processing of the question expression information, similar user questions are aggregated into the same category to obtain user question clusters; finally, the user question clusters are recommended to the artificial customer service for classification and construction FAQ to get the knowledge base.
为便于理解,本实施例采用机器翻译算法作为表达模型的训练算法、K-Means聚类算法作为聚类算法进行说明,但本方案的实现并不限于此种形式。在表达模型中,模型的输入为人工客服日志数据(例如,用户的一个问句以及对应的客服回复或用户整个会话过程中的所有问句以及对应的客服回复),以用户的一个问句以及对应的客服回复进行说明,对该问句进行解析,得到用户问句的词性信息、命名实体信息等。本方案以词向量形式作为用户问句的表达形式进行说明,也可使用文本特征进行表达,处理的过程可以是:To facilitate understanding, this embodiment uses a machine translation algorithm as a training algorithm for an expression model and a K-Means clustering algorithm as a clustering algorithm for description, but the implementation of this solution is not limited to this form. In the expression model, the input of the model is manual customer service log data (for example, a question from the user and the corresponding customer service response or all the questions and corresponding customer service responses during the user's entire conversation), a user's question and The corresponding customer service response is explained, and the question is analyzed to obtain the part-of-speech information and named entity information of the user question. In this solution, the word vector is used as the expression form of the user's question. It can also be expressed using text features. The processing process can be:
首先,对人工客服日志数据进行清洗,去掉与业务关系不够密切的用户问句及回答,(如你好、谢谢,具体需根据业务情况进行筛选),具体方法可以使用机器学习算法(如语言模型打分)或自然语言处理算法(如模板匹配,句法分析等);First, clean the manual customer service log data, remove user questions and answers that are not close to the business relationship (such as hello and thank you, you need to filter based on the business situation), the specific method can use machine learning algorithms (such as language models Scoring) or natural language processing algorithms (such as template matching, syntax analysis, etc.);
然后,使用清洗后的人工客服日志数据训练表达模型,主要目的在于学习从用户问句到客服回答的一个映射关系;Then, use the cleaned artificial customer service log data to train the expression model, the main purpose is to learn a mapping relationship from user questions to customer service answers;
然后,通过所述表达模型获取待整理的用户问句的问句表达信息,如问句的词向量;Then, obtaining the question expression information of the user question to be sorted through the expression model, such as the word vector of the question;
再对所述问句表达信息进行聚类算法处理或者同义词替换整合,得到用 户问句类簇;Then perform clustering algorithm processing or synonym replacement integration on the question expression information to obtain user question clusters;
最后,将用户问句类簇推荐给客服人员进行FAQ整理,得到相应的FAQ知识库。Finally, the user question clusters are recommended to customer service staff for FAQ sorting to obtain the corresponding FAQ knowledge base.
可以理解的是,本实施例所述的表达模型训练方式不限于其形式,可使用深度学习算法、机器学习模型,也可使用搜索技术方案,其输入形式也不限于用户问句、人工客服回复等,可根据客服机器人实际情况决定其输入形式。比如说业务可能会重点考虑情感,会根据用户问句中是否有情感词、有哪些情感词去构建一些输入。It can be understood that the training method of the expression model described in this embodiment is not limited to its form, and deep learning algorithms, machine learning models, or search technology solutions can be used, and its input form is not limited to user questions and manual customer service responses. Etc. The input form of the customer service robot can be determined according to the actual situation. For example, the business may focus on emotions and construct some inputs based on whether there are emotional words in the user's question and which emotional words.
本实施例所述的方法不限于知识库整理的粒度,可根据业务的实际情况决定其知识库的粒度,即,根据业务具体需求设计出粗粒度或者细粒度的FAQ划分方式。比如,某个业务还没有客服知识库的情况下,划分方式主要体现在聚类的类别数,比如某个业务是银行的办卡业务,用户问的问题主要包括借记卡、信用卡等内容,粗粒度构建可构建为两类。对于信用卡,信用卡分类下面还可以包含很多内容,如开卡、年费等,如果想划分的更细,通过增加聚类的类别数即可。The method described in this embodiment is not limited to the granularity of the knowledge base collation. The granularity of the knowledge base can be determined according to the actual situation of the business, that is, a coarse-grained or fine-grained FAQ division method is designed according to the specific needs of the business. For example, if there is no customer service knowledge base for a certain service, the division method is mainly reflected in the number of categories of clustering. For example, a certain service is a bank's card-making service. The questions asked by users mainly include debit cards, credit cards, etc. Coarse-grained builds can be built into two categories. For credit cards, the credit card classification can also contain a lot of content, such as card opening, annual fees, etc. If you want to divide more, you can increase the number of clustering categories.
通常客服是企业获得用户反馈意见、解决用户产品疑问的一个主要途径。传统的客服业务主要由专业的人工客服人员来处理,使得企业在客服方面的投入会随着客服业务量的增加而高速增长,成为不可忽视的支出。Usually customer service is a major way for companies to get user feedback and resolve user product questions. The traditional customer service business is mainly handled by professional manual customer service personnel, so that the company's investment in customer service will increase rapidly with the increase in customer service business volume, which can not be ignored.
针对这一问题,目前比较先进的解决办法是引入智能客服机器人,可以显著降低人工客服量,节约大量客服成本,客服机器人应用在客服工作中确实有着显而易见的优势:一是提高用户感知,为企业在线客服、新媒体客服等提供统一智能的自助服务支撑,降低了用户问题得到解决的难度和复杂度;二是提升服务效率,缩短咨询处理时限,分流传统人工客服压力,节省服务成本;三是快速收集用户诉求和行为数据,支撑产品迭代优化。To solve this problem, the more advanced solution is to introduce intelligent customer service robots, which can significantly reduce the amount of manual customer service and save a lot of customer service costs. The application of customer service robots in customer service does have obvious advantages: First, it improves user perception and provides enterprises with Online customer service, new media customer service, etc. provide unified and intelligent self-service support, which reduces the difficulty and complexity of user issues. Second, it improves service efficiency, shortens the time limit for consulting and processing, and offloads the pressure of traditional manual customer service, saving service costs. Quickly collect user demand and behavior data to support iterative product optimization.
虽然客服机器人有着以上种种优势,但是我们需要考虑一个问题,如何 从人工客服日志中提取出用户高频、意图明确的热门问题并进行分析,抽象成若干类标准问句(Frequently Asked Questions,简称FAQ,常见问题),对每一个FAQ由专业的业务人员配置好标准答案,然后针对未来用户的问题,采用技术手段分析该问题是否归类到任何一个已有FAQ,若成功则将预先配置好的答案返回给用户,达到高效地解决用户疑问的效果。Although customer service robots have all the above advantages, we need to consider a problem, how to extract and analyze hot topics with high frequency and clear intentions from the manual customer service logs, and analyze them into several types of standard questions (FAQs) , Frequently Asked Questions), for each FAQ, a professional business person configures the standard answer, and then, for the questions of future users, use technical means to analyze whether the problem is classified into any existing FAQ. If it is successful, it will be pre-configured The answer is returned to the user to achieve the effect of efficiently solving the user's question.
从传统人工客服直接切换至智能客服机器人,目前市面上较多的处理方法为资深客服人员对用户经常问到的问题进行归类总结,从而形成知识库。该方法比较依赖于资深客服人员对整体业务情况的理解及总结能力。对于一个业务,通常有大量的用户日志积累,用户日志包含了大部分知识库信息。Switching from traditional manual customer service directly to intelligent customer service robots, there are currently more processing methods on the market for senior customer service personnel to summarize and summarize the questions frequently asked by users to form a knowledge base. This method relies more on the ability of senior customer service staff to understand and summarize the overall business situation. For a business, there is usually a large number of user logs accumulated, and the user log contains most of the knowledge base information.
目前大部分知识库构建算法通常采用机器学习算法(例如LSA,LDA等主题模型算法,以及如Seq2Seq等深度学习算法)或自然语言相关算法(如规则匹配或模板匹配)对用户问句进行聚合,然后由人工对每个类簇进行筛选并总结成FAQ标准问句,从而达到构建智能客服知识库的目的。但是,现有构建智能客服知识库的方法需要较多的人工干预,需要大量人工投入,并且构建的知识库的质量受人工客服业务水平影响较大。At present, most knowledge base construction algorithms usually use machine learning algorithms (such as LSA, LDA and other topic model algorithms, and deep learning algorithms such as Seq2Seq) or natural language related algorithms (such as rule matching or template matching) to aggregate user questions. Then each category cluster is manually screened and summarized into FAQ standard questions, so as to achieve the purpose of building an intelligent customer service knowledge base. However, the existing methods for constructing the intelligent customer service knowledge base require more manual intervention and require a large amount of manual investment, and the quality of the constructed knowledge base is greatly affected by the level of the manual customer service business.
本申请采用以上技术方案,通过人工客服日志数据训练出表达模型,并利用表达模型获取待整理的用户问句的问句表达信息,并对该问句表达信息进行聚合处理,得到用户问句类簇,最终将所述用户问句类簇进行整理得到知识库。该方法充分利用了已有的人工客服日志数据所包含的信息,能够通过海量人工客服日志数据快速迭代优化机器人客服知识库,降低了知识库构建对于人工业务水平的依赖,降低了构建难度。This application adopts the above technical solution, trains an expression model through manual customer service log data, and uses the expression model to obtain the question expression information of the user question to be collated, and aggregates the question expression information to obtain the user question class Clusters, and finally sort the user question class clusters to obtain a knowledge base. This method makes full use of the information contained in the existing manual customer service log data, and can quickly and iteratively optimize the robot customer service knowledge base through massive manual customer service log data, reducing the dependence of the knowledge base construction on the level of manual business and reducing the difficulty of construction.
应该理解的是,虽然本申请各实施例中的各个步骤并不是必然按照步骤标号指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各实施例中至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不 必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the embodiments of the present application are not necessarily performed sequentially in the order indicated by the step numbers. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in each embodiment may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. The execution of these sub-steps or stages The order is not necessarily performed sequentially, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的客服机器人知识库歧义检测方法的步骤。One or more non-transitory computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement one of the embodiments of the present application. The steps of the ambiguous detection method provided by the customer service robot knowledge base.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的基于人工客服日志自动构建客服知识库的方法的步骤。One or more non-transitory computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement one of the embodiments of the present application. Provides steps of a method for automatically building a customer service knowledge base based on a manual customer service log.
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的客服机器人知识库歧义检测方法的步骤。A computer device includes a memory and one or more processors. Computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by one or more processors, the one or more processors implement any one of the present application. The steps of the ambiguous detection method of the knowledge base of the customer service robot provided in the embodiment.
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的基于人工客服日志自动构建客服知识库的方法的步骤。A computer device includes a memory and one or more processors. Computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by one or more processors, the one or more processors implement any one of the present application. The steps of the method for automatically constructing a customer service knowledge base based on a manual customer service log provided in the embodiment.
可以理解的是,上述各实施例中相同或相似部分可以相互参考,在一些实施例中未详细说明的内容可以参见其他实施例中相同或相似的内容。It can be understood that the same or similar parts in the above embodiments can be referred to each other. For the content that is not described in detail in some embodiments, refer to the same or similar content in other embodiments.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实 现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in a flowchart or otherwise described herein can be understood as representing a module, fragment, or portion of code that includes one or more executable instructions for implementing a particular logical function or step of a process And, the scope of the preferred embodiments of the present application includes additional implementations, in which the functions may be performed out of the order shown or discussed, including performing functions in a substantially simultaneous manner or in the reverse order according to the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present application pertain.
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that each part of the application may be implemented by hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods may be implemented by software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it may be implemented using any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gate circuits, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。A person of ordinary skill in the art can understand that all or part of the steps carried by the methods in the foregoing embodiments may be implemented by a program instructing related hardware. The program may be stored in a computer-readable storage medium. The program is When executed, one or a combination of the steps of the method embodiment is included.
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist separately physically, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
上述提到的存储介质可以是只读存储器,磁盘或光盘等。The aforementioned storage medium may be a read-only memory, a magnetic disk, or an optical disk.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多 个实施例或示例中以合适的方式结合。In the description of this specification, the description with reference to the terms “one embodiment”, “some embodiments”, “examples”, “specific examples”, or “some examples” and the like means specific features described in conjunction with the embodiments or examples , Structure, materials, or features are included in at least one embodiment or example of the present application. In this specification, the schematic expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present application have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limitations on the present application. Those skilled in the art can interpret the above within the scope of the present application. Embodiments are subject to change, modification, substitution, and modification.

Claims (23)

  1. 一种客服机器人知识库歧义检测方法,其特征在于,包括:A method for ambiguous detection of a customer service robot knowledge base, which is characterized by:
    构建知识库,所述知识库按FAQ划分,每个FAQ设有至少一个相似问句,且每个FAQ为一个类别;Construct a knowledge base, which is divided into FAQs, each FAQ is provided with at least one similar question, and each FAQ is a category;
    将所述知识库划分为深度学习模型的测试集和训练集;Dividing the knowledge base into a test set and a training set for a deep learning model;
    在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测;Training a deep learning model on the training set, and performing ambiguous detection using the learned deep learning model;
    根据歧义检测结果更新所述知识库;及Updating the knowledge base based on the ambiguity detection results; and
    重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。Repeat the above steps until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
  2. 根据权利要求1所述的方法,其特征在于,所述将所述知识库划分为深度学习模型的测试集和训练集,包括:随机抽取每个FAQ对应的预设数量的相似问句作FAQ对应类别的测试数据,其余相似问句作为所述FAQ对应类别的训练数据;所有类别的测试数据构成测试集,所有类别的训练数据构成训练集。The method according to claim 1, wherein the dividing the knowledge base into a test set and a training set for a deep learning model comprises: randomly extracting a preset number of similar questions corresponding to each FAQ for FAQ correspondence. Test data of categories, and other similar questions as training data of the corresponding category of the FAQ; test data of all categories constitute a test set, and training data of all categories constitute a training set.
  3. 根据权利要求1所述的方法,其特征在于,所述深度学习模型包括:特征提取器、浅层分类器,在训练集上训练深度学习模型,包括:The method according to claim 1, wherein the deep learning model comprises: a feature extractor, a shallow classifier, and training a deep learning model on a training set, comprising:
    将所述训练集中的问句作为输入部分输入到所述深度学习模型;Inputting the question in the training set as an input part into the deep learning model;
    利用所述深度学习模型中的特征提取器将输入部分中的问句转化成特征向量;Use a feature extractor in the deep learning model to convert the question in the input part into a feature vector;
    利用所述深度学习模型中的浅层分类器根据所述特征向量计算出预测结果,所述预测结果为输入部分中的问句所对应的类别;Using a shallow classifier in the deep learning model to calculate a prediction result according to the feature vector, where the prediction result is a category corresponding to a question in the input part;
    利用优化器优化训练模型,将训练集中问句标注的实际类别和所述深度学习模型预测结果的平均差异最小化;及Using an optimizer to optimize the training model, minimizing the average difference between the actual categories marked in the training set and the prediction results of the deep learning model; and
    用测试集对训练完的模型进行评估,计算模型预测结果和测试集中问句标注的实际类别的一致率,作为模型学习效果的评估。The test set is used to evaluate the trained model, and the consistency rate of the model prediction results and the actual category labels of the question sets in the test set is calculated as the evaluation of the model learning effect.
  4. 根据权利要求1所述的方法,其特征在于,所述歧义检测包括:类别歧义检测、标注错误检测和标注歧义检测,利用学习出的所述深度学习模型进行歧义检测,包括:The method according to claim 1, wherein the ambiguity detection comprises: category ambiguity detection, labeling error detection, and labeling ambiguity detection, and using the learned deep learning model to perform ambiguity detection, comprising:
    利用深度学习模型中特征提取器检测歧义;及Use feature extractors in deep learning models to detect ambiguity; and
    利用深度学习模型中浅层分类器检测歧义。Use shallow classifiers in deep learning models to detect ambiguity.
  5. 根据权利要求4所述的方法,其特征在于,利用深度学习模型中特征提取器检测歧义,包括:The method according to claim 4, wherein detecting the ambiguity by using a feature extractor in a deep learning model comprises:
    用所述深度学习模型中的特征提取器将数据集中的相似问句转化成特征向量,所述数据集包括训练集或/和测试集;Using a feature extractor in the deep learning model to convert similar questions in a data set into feature vectors, where the data set includes a training set or / and a test set;
    将问句对应的特征向量组合成问句特征向量对(x,y),其中特征向量x对应的问句和特征向量y对应的问句分别来自不同类别;Combining the feature vectors corresponding to the question into a question feature vector pair (x, y), where the question corresponding to the feature vector x and the question corresponding to the feature vector y are from different categories;
    计算每组问句特征向量对的向量相似度cos(x,y),所述
    Figure PCTCN2019088473-appb-100001
    Calculate the vector similarity cos (x, y) of the feature vector pairs of each group, where
    Figure PCTCN2019088473-appb-100001
    and
    将所有问句特征向量对按所述向量相似度从高到低排序,选择所述向量相似度排名靠前的问句特征向量对,并根据所述向量相似度排名靠前的问句特征向量对判断是否存在歧义。Sort all question feature vector pairs according to the similarity of the vector from high to low, select the question feature vector pair ranked highest in the vector similarity, and rank the question feature vector ranked highest according to the vector similarity Ambiguous judgment.
  6. 根据权利要求5所述的方法,其特征在于,根据所述向量相似度排名靠前的问句特征向量对判断是否存在歧义,包括:The method according to claim 5, characterized in that determining whether there is ambiguity according to the top question vector pairs of vector similarity ranking comprises:
    判断是否存在标注歧义或标注错误:提取第一预设数量的所述相似度排名靠前的问句特征向量对,人工检查对应的问句对是否存在标注歧义和标注 错误;及Judging whether there are labeling ambiguities or labeling errors: extracting a first preset number of the similarity-ranked question feature vector pairs, and manually checking whether the corresponding query pair has labeling ambiguity and labeling errors; and
    判断是否存在类别歧义:对于所述第一预设数量的问句特征向量对,统计对应类别对重复出现的次数,按照出现次数从高到低排序,取第二预设数量的类别对,人工检查是否存在类别歧义。Determine whether there is category ambiguity: For the first preset number of question feature vector pairs, count the number of repeated occurrences of the corresponding category pair, sort them from high to low, and take the second preset number of category pairs. Check for category ambiguity.
  7. 根据权利要求4所述的方法,其特征在于,利用深度学习模型浅层分类器检测歧义,包括:The method according to claim 4, wherein detecting the ambiguity by using a shallow classifier of a deep learning model comprises:
    将深度学习模型分类结果进行统计并形成混淆矩阵,所述混淆矩阵的每行i对应标注的类别,每列j对应所述深度学习模型预测的类别,元素x ij是标注为类别i,而模型预测为类别j的问句个数,元素x ji是标注为类别j,而模型预测为类别i的问句个数; The classification results of the deep learning model are counted and a confusion matrix is formed. Each row i of the confusion matrix corresponds to the labeled category, each column j corresponds to the category predicted by the deep learning model, and the element x ij is labeled as category i, and the model Predict the number of questions in category j, the element x ji is labeled as category j, and the model predicts the number of questions in category i;
    计算数据集中标注为类别i的样本个数,所述类别i的样本个数为
    Figure PCTCN2019088473-appb-100002
    其中k为任意类别;
    Calculate the number of samples labeled as category i in the data set, where the number of samples in category i is
    Figure PCTCN2019088473-appb-100002
    Where k is any category;
    计算数据集中标注为类别j的样本个数,所述类别j的样本个数为
    Figure PCTCN2019088473-appb-100003
    其中k为任意类别;
    Calculate the number of samples labeled as category j in the data set, where the number of samples in category j is
    Figure PCTCN2019088473-appb-100003
    Where k is any category;
    计算数据集中将标注为类别i的样本被所述深度学习模型预测为类别j的比例P ij与将标注为类别j的样本预测到类别i的比例P ji,所述P ij和P ji计算公式分别为:
    Figure PCTCN2019088473-appb-100004
    所述类别i与所述类别j属于不同类别,所述数据集包括训练集或/和测试集;
    Calculate the proportion P ij of the samples labeled as category i predicted by the deep learning model as category j in the data set, and the ratio P ji of the samples predicted as category j predicted to category i by the deep learning model, the formulas for calculating P ij and P ji They are:
    Figure PCTCN2019088473-appb-100004
    The category i and the category j belong to different categories, and the data set includes a training set or / and a test set;
    计算类别对(类别i、类别j)的混淆度,所述混淆度为P ij和P ji的调和平均值S ij,所述
    Figure PCTCN2019088473-appb-100005
    Calculate the degree of confusion of category pairs (category i, category j), which is the harmonic mean S ij of P ij and P ji , where
    Figure PCTCN2019088473-appb-100005
    and
    根据混淆度判断类别i与类别j是否存在歧义。It is determined whether there is an ambiguity between the category i and the category j according to the degree of confusion.
  8. 根据权利要求7所述的方法,其特征在于,根据混淆度判断类别i、 类别j是否存在歧义,包括:The method according to claim 7, wherein determining whether there is an ambiguity between category i and category j according to the degree of confusion comprises:
    对计算出的混淆度进行排序;及Sort the calculated confusion; and
    提取第三预设数量的混淆度排名靠前的类别,人工检测是否存在类别歧义。A third preset number of categories with a higher degree of confusion are extracted, and the existence of category ambiguity is manually detected.
  9. 根据权利要求7所述的方法,其特征在于,利用深度学习模型浅层分类器检测歧义,还包括:找出数据集中标注的实际类别和深度学习模型预测的类别不一致的数据,人工检查是否存在标注错误,所述数据集包括训练集或/和测试集。The method according to claim 7, further comprising: using a shallow classifier of the deep learning model to detect ambiguity, further comprising: finding data in which the actual categories marked in the data set and the categories predicted by the deep learning model are inconsistent, and manually checking for the existence of the data Annotation error, the data set includes a training set and / or a test set.
  10. 根据权利要求1所述的方法,其特征在于,根据歧义检测结果更新所述知识库,包括:The method according to claim 1, wherein updating the knowledge base according to the ambiguity detection result comprises:
    对检测出的歧义问句进行人工改写、人工重新标注,并删除原标注;及Manually rewrite, re-annotate the detected ambiguous questions, and delete the original annotations; and
    对检测出的歧义类别进行相似问句的重新组合、分配,并删除原歧义类别。Reassemble and assign similar questions to the detected ambiguity categories, and delete the original ambiguity categories.
  11. 根据权利要求1所述的方法,其特征在于,构建知识库,包括:The method according to claim 1, wherein constructing a knowledge base comprises:
    对人工客服日志数据进行预处理;Preprocessing the manual customer service log data;
    根据处理后的人工客服日志数据建立表达模型;Establish an expression model based on the processed manual customer service log data;
    通过所述表达模型获取待整理的用户问句的问句表达信息;Obtaining question expression information of a user question to be sorted through the expression model;
    对所述问句表达信息进行聚合处理,得到用户问句类簇;及Aggregate the question expression information to obtain a user question cluster; and
    将所述用户问句类簇进行整理得到知识库。The user question sentence clusters are sorted to obtain a knowledge base.
  12. 一种基于人工客服日志自动构建客服知识库的方法,其特征在于,包括:A method for automatically constructing a customer service knowledge base based on a manual customer service log, which is characterized by:
    对人工客服日志数据进行预处理;Preprocessing the manual customer service log data;
    根据处理后的人工客服日志数据建立表达模型;Establish an expression model based on the processed manual customer service log data;
    通过所述表达模型获取待整理的用户问句的问句表达信息;Obtaining question expression information of a user question to be sorted through the expression model;
    对所述问句表达信息进行聚合处理,得到用户问句类簇;及Aggregate the question expression information to obtain a user question cluster; and
    将所述用户问句类簇进行整理得到知识库。The user question sentence clusters are sorted to obtain a knowledge base.
  13. 根据权利要求1所述的方法,其特征在于,所述人工客服日志数据包括:The method according to claim 1, wherein the manual customer service log data comprises:
    用户的一个问句以及对应的客服回复;和,A question from the user and the corresponding customer service response; and,
    用户整个会话过程中的所有问句以及对应的客服回复。All questions and corresponding customer service responses during the user's entire conversation.
  14. 根据权利要求13所述的方法,其特征在于,对人工客服日志数据进行预处理,包括:The method according to claim 13, wherein preprocessing the manual customer service log data comprises:
    利用机器学习算法或自然语言处理算法对人工客服日志数据进行处理,以去除掉与业务内容不相关的用户问句及回复。Use machine learning algorithms or natural language processing algorithms to process artificial customer service log data to remove user questions and responses that are not related to business content.
  15. 根据权利要求13所述的方法,其特征在于,所述表达模型是通过利用训练算法对处理后的人工客服日志数据进行训练得到的。The method according to claim 13, wherein the expression model is obtained by training the processed artificial customer service log data by using a training algorithm.
  16. 根据权利要求15所述的方法,其特征在于,所述训练算法包括:The method according to claim 15, wherein the training algorithm comprises:
    机器学习算法或搜索算法。Machine learning algorithms or search algorithms.
  17. 根据权利要求13至16任一项所述的方法,其特征在于,对所述问句表达信息进行聚合处理,包括:The method according to any one of claims 13 to 16, wherein performing aggregation processing on the question expression information comprises:
    采用聚类算法或同义词整合的方式对所述问句表达信息进行处理。The clustering algorithm or synonym integration is used to process the question expression information.
  18. 根据权利要求17所述的方法,其特征在于,所述聚类算法为K-Means聚类算法及其相关改进算法。The method according to claim 17, wherein the clustering algorithm is a K-Means clustering algorithm and related improved algorithms.
  19. 根据权利要求12至16任一项所述的方法,其特征在于,所述问句表达信息包括:句子的向量表示和/或文本特征表示。The method according to any one of claims 12 to 16, wherein the question expression information comprises: a vector representation of a sentence and / or a text feature representation.
  20. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
    构建知识库,所述知识库按FAQ划分,每个FAQ设有至少一个相似问句,且每个FAQ为一个类别;Construct a knowledge base, which is divided into FAQs, each FAQ is provided with at least one similar question, and each FAQ is a category;
    将所述知识库划分为深度学习模型的测试集和训练集;Dividing the knowledge base into a test set and a training set for a deep learning model;
    在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测;Training a deep learning model on the training set, and performing ambiguous detection using the learned deep learning model;
    根据歧义检测结果更新所述知识库;及Updating the knowledge base based on the ambiguity detection results; and
    重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。Repeat the above steps until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
  21. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
    对人工客服日志数据进行预处理;Preprocessing the manual customer service log data;
    根据处理后的人工客服日志数据建立表达模型;Establish an expression model based on the processed manual customer service log data;
    通过所述表达模型获取待整理的用户问句的问句表达信息;Obtaining question expression information of a user question to be sorted through the expression model;
    对所述问句表达信息进行聚类处理;Performing clustering processing on the question expression information;
    将相似的用户问句聚合为同一类,并进行归类整理得到知识库。The similar user questions are aggregated into the same category, and classified into a knowledge base.
  22. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more processors are caused. Each processor performs the following steps:
    构建知识库,所述知识库按FAQ划分,每个FAQ设有至少一个相似问句,且每个FAQ为一个类别;Construct a knowledge base, which is divided into FAQs, each FAQ is provided with at least one similar question, and each FAQ is a category;
    将所述知识库划分为深度学习模型的测试集和训练集;Dividing the knowledge base into a test set and a training set for a deep learning model;
    在训练集上训练深度学习模型,并利用学习出的所述深度学习模型进行歧义检测;Training a deep learning model on the training set, and performing ambiguous detection using the learned deep learning model;
    根据歧义检测结果更新所述知识库;Update the knowledge base according to the ambiguity detection result;
    重复上述步骤直到学习效果不再提升,得到消除歧义的知识库。Repeat the above steps until the learning effect is no longer improved, and a disambiguation knowledge base is obtained.
  23. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more processors are caused. Each processor performs the following steps:
    对人工客服日志数据进行预处理;Preprocessing the manual customer service log data;
    根据处理后的人工客服日志数据建立表达模型;Establish an expression model based on the processed manual customer service log data;
    通过所述表达模型获取待整理的用户问句的问句表达信息;Obtaining question expression information of a user question to be sorted through the expression model;
    对所述问句表达信息进行聚类处理;Performing clustering processing on the question expression information;
    将相似的用户问句聚合为同一类,并进行归类整理得到知识库。The similar user questions are aggregated into the same category, and classified into a knowledge base.
PCT/CN2019/088473 2018-07-09 2019-05-27 Method for detecting ambiguity of customer service robot knowledge base, storage medium, and computer device WO2020010930A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201810749561.XA CN109033270A (en) 2018-07-09 2018-07-09 A method of service knowledge base is constructed based on artificial customer service log automatically
CN201810749561.X 2018-07-09
CN201810801678.8A CN109101579B (en) 2018-07-19 2018-07-19 Customer service robot knowledge base ambiguity detection method
CN201810801678.8 2018-07-19

Publications (1)

Publication Number Publication Date
WO2020010930A1 true WO2020010930A1 (en) 2020-01-16

Family

ID=69143296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088473 WO2020010930A1 (en) 2018-07-09 2019-05-27 Method for detecting ambiguity of customer service robot knowledge base, storage medium, and computer device

Country Status (1)

Country Link
WO (1) WO2020010930A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114153993A (en) * 2022-02-07 2022-03-08 杭州远传新业科技有限公司 Automatic knowledge graph construction method and system for intelligent question answering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608959A (en) * 2017-09-08 2018-01-19 电子科技大学 A kind of English social media short text place name identification method
CN108227932A (en) * 2018-01-26 2018-06-29 上海智臻智能网络科技股份有限公司 Interaction is intended to determine method and device, computer equipment and storage medium
CN109033270A (en) * 2018-07-09 2018-12-18 深圳追科技有限公司 A method of service knowledge base is constructed based on artificial customer service log automatically
CN109101579A (en) * 2018-07-19 2018-12-28 深圳追科技有限公司 customer service robot knowledge base ambiguity detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608959A (en) * 2017-09-08 2018-01-19 电子科技大学 A kind of English social media short text place name identification method
CN108227932A (en) * 2018-01-26 2018-06-29 上海智臻智能网络科技股份有限公司 Interaction is intended to determine method and device, computer equipment and storage medium
CN109033270A (en) * 2018-07-09 2018-12-18 深圳追科技有限公司 A method of service knowledge base is constructed based on artificial customer service log automatically
CN109101579A (en) * 2018-07-19 2018-12-28 深圳追科技有限公司 customer service robot knowledge base ambiguity detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YAO, ZHANLEI: "Knowledge Base Construction and Its Application in Chinese Q&A System", CHINA MASTER'S THESES FULL-TEXT DATABASE, 15 January 2014 (2014-01-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114153993A (en) * 2022-02-07 2022-03-08 杭州远传新业科技有限公司 Automatic knowledge graph construction method and system for intelligent question answering

Similar Documents

Publication Publication Date Title
WO2021078027A1 (en) Method and apparatus for constructing network structure optimizer, and computer-readable storage medium
WO2020125445A1 (en) Classification model training method, classification method, device and medium
US20180232443A1 (en) Intelligent matching system with ontology-aided relation extraction
US20190018899A1 (en) Method and system for providing real time search preview personalization in data management systems
EP4127970A1 (en) Cross-context natural language model generation
US20210182659A1 (en) Data processing and classification
CN108416375B (en) Work order classification method and device
CN104573130B (en) The entity resolution method and device calculated based on colony
US11086857B1 (en) Method and system for semantic search with a data management system
US20170103439A1 (en) Searching Evidence to Recommend Organizations
US20220100963A1 (en) Event extraction from documents with co-reference
Jaspers et al. Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA
US9460071B2 (en) Rule development for natural language processing of text
US11269665B1 (en) Method and system for user experience personalization in data management systems using machine learning
WO2020019866A1 (en) Method for tagging customer service system log, customer service system, and storage medium
US20220100772A1 (en) Context-sensitive linking of entities to private databases
US11593665B2 (en) Systems and methods driven by link-specific numeric information for predicting associations based on predicate types
WO2023078136A1 (en) Data set construction method and apparatus, device, storage medium, and computer program product
Chikkamannur Semantic Annotation of IoT Resource with ontology orchestration
US20220100967A1 (en) Lifecycle management for customized natural language processing
EP4222635A1 (en) Lifecycle management for customized natural language processing
WO2022068160A1 (en) Artificial intelligence-based critical illness inquiry data identification method and apparatus, device, and medium
WO2020010930A1 (en) Method for detecting ambiguity of customer service robot knowledge base, storage medium, and computer device
CN116823321B (en) Method and system for analyzing economic management data of electric business
WO2023093116A1 (en) Method and apparatus for determining industrial chain node of enterprise, and terminal and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19834237

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.05.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19834237

Country of ref document: EP

Kind code of ref document: A1