WO2020133470A1 - 聊天语料的清洗方法、装置、计算机设备和存储介质 - Google Patents

聊天语料的清洗方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020133470A1
WO2020133470A1 PCT/CN2018/125768 CN2018125768W WO2020133470A1 WO 2020133470 A1 WO2020133470 A1 WO 2020133470A1 CN 2018125768 W CN2018125768 W CN 2018125768W WO 2020133470 A1 WO2020133470 A1 WO 2020133470A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
chat
question
matching
answer
Prior art date
Application number
PCT/CN2018/125768
Other languages
English (en)
French (fr)
Inventor
熊友军
熊为星
廖洪涛
Original Assignee
深圳市优必选科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技有限公司 filed Critical 深圳市优必选科技有限公司
Priority to PCT/CN2018/125768 priority Critical patent/WO2020133470A1/zh
Publication of WO2020133470A1 publication Critical patent/WO2020133470A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/02User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail using automatic reactions or user delegation, e.g. automatic replies or chatbot-generated messages

Definitions

  • the invention relates to the field of computer technology and deep learning technology, in particular to a method, device, computer equipment and storage medium for cleaning chat corpus.
  • Intelligent robot chat has always been the main research direction in the artificial field. How to make intelligent chat robots chat like humans through deep learning and other methods, for example, as an intelligent customer service in the product after-sales department. In the training process of the current intelligent chat robot, whether it is a retrieval type or a generation type, a chat corpus is required to train the robot.
  • chat corpus A large amount of chat corpus is required for Q&A training of intelligent chat robots. At present, a large amount of chat corpus comes from open source materials on the Internet. However, these chat corpus generally have low quality and need to be cleaned.
  • the manual screening method requires professional personnel to tag the chatting corpus, which not only consumes manpower and is inefficient, but also may result in insufficient accuracy of the results due to the difference in the level and understanding of the tagging personnel, resulting in low quality of the final training corpus .
  • a method for cleaning chat corpus includes:
  • chat corpus includes question corpus and answer corpus
  • a device for cleaning chat corpus including:
  • the chat corpus acquisition module is used to acquire chat corpus, the chat corpus includes question corpus and answer corpus;
  • the chat corpus processing module is used to perform word segmentation processing on the chat corpus and convert the word segmentation results into word vectors
  • a model calculation module used to input the word vector into a preset deep search matching ranking model to obtain a matching score corresponding to the chat corpus;
  • the corpus cleaning module is used for cleaning the chat corpus according to the matching score.
  • a computer device which includes a memory and a processor.
  • the memory stores a computer program.
  • the processor is caused to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • the invention proposes a method, device, computer equipment and storage medium for cleaning chat corpus.
  • the chat corpus to be cleaned is obtained.
  • Each chat corpus contains corresponding questions and replies. Asking questions and replying the corresponding corpus for word segmentation processing , And converted into word vectors, and then calculate the matching score between the question vector and the corresponding word vector according to the trained depth search matching ranking model, so as to determine whether the current chat corpus is matched and whether it needs to be cleaned.
  • it can be automatically cleaned according to the deep search matching ranking model, and there is no need to manually tag the chat corpus one by one, saving a lot of manual operation time. To a certain extent, the cost is reduced.
  • the above-mentioned method for cleaning chat corpus avoids the mistake of manual operation, and also improves the accuracy of chat corpus cleaning to a certain extent. Further, in this embodiment, the matching degree between the question and the answer is judged by the deep search matching ranking model, which improves the accuracy of the matching judgment of the chat corpus, that is, the accuracy of cleaning the chat corpus.
  • 1 is a schematic diagram of an implementation process of a method for cleaning chat corpus in an embodiment
  • FIG. 2 is a schematic diagram of an implementation process of a method for cleaning chat corpus in an embodiment
  • FIG. 3 is a schematic diagram of an implementation process of DRMR model training in an embodiment
  • FIG. 4 is a schematic diagram of question and answer pair corpus construction in one embodiment
  • FIG. 5 is a schematic diagram of a DRMR model in an embodiment
  • FIG. 6 is a schematic diagram of an implementation process of a method for cleaning chat corpus in an embodiment
  • FIG. 7 is a structural block diagram of a device for cleaning chat corpus in an embodiment
  • FIG. 8 is a structural block diagram of a computer device in an embodiment.
  • a method for cleaning chat corpus is provided.
  • the execution subject of the method for cleaning chat corpus according to the embodiment of the present invention may be a server.
  • the chat described in the embodiment of the present invention may also be other terminal devices, such as robotic devices.
  • the method for cleaning the chat corpus includes the following steps:
  • Step S102 Obtain chat corpus, the chat corpus includes question corpus and answer corpus.
  • the chat corpus is an uncleaned chat corpus obtained from the Internet or other sources, where each chat corpus includes a question (question corpus) and an answer sentence (answer corpus).
  • the corresponding chat corpus is a number of question and answer pairs, such as (question 1, reply 1), (question 2, reply 2),...
  • the chat corpus needs to be pre-processed, mainly for the irregularities that may exist in the original chat corpus, such as removing repeated repeats Punctuation marks (for example, a large number of question marks appear after a question, in this case, only one of them is retained), and for example, remove the chat corpus that contains the package, remove the spaces contained in the chat corpus, filter to Sensitive information (such as politically sensitive words and pornographic violence). That is to say, after preprocessing the chat corpus, part of the chat corpus of low quality can be removed to improve the efficiency and accuracy of subsequent chat corpus cleaning.
  • Punctuation marks for example, a large number of question marks appear after a question, in this case, only one of them is retained
  • Sensitive information such as politically sensitive words and pornographic violence
  • chat corpus needs to be further rewritten, for example, punctuation, space, English case conversion, stop words, etc., to remove characters that are irrelevant to semantic understanding and avoid subsequent chats.
  • punctuation for example, punctuation, space, English case conversion, stop words, etc.
  • Step S104 Perform word segmentation processing on the chat corpus, and convert the word segmentation results into word vectors.
  • the word segmentation processing of the question corpus or the answer corpus in the chat corpus may be cut by words, and if it is an English character, it may be cut by letters. Then, for the chat corpus that has been segmented, it further converts each word/word into a corresponding word vector (because word segmentation is cut by words, it can also be called a word vector).
  • the normal distribution is randomly initialized to a 300-dimensional word vector, and each word/word included in the word segmentation result is converted into a corresponding word vector/word vector.
  • the length of different word vectors may be inconsistent due to Chinese, English or length problems.
  • the corresponding length The length of the word vector is rewritten. That is, according to a preset length threshold, all word vectors are truncated or complemented.
  • Step S106 input the word vector into a preset deep search matching ranking model to obtain a matching score corresponding to the chat corpus;
  • Step S108 Clean the chat corpus according to the matching score.
  • the Deep Retrieval Matching Ranking (DRMR) model (hereinafter referred to as DRMR model) is a model built based on the deep learning model to evaluate and predict whether the chat corpus matches, and the input is the word corresponding to the question corpus in the chat corpus The word vector corresponding to the vector and the answer corpus is output as the matching score between the question corpus and the answer corpus.
  • the matching score corresponding to the chat corpus after the matching score corresponding to the chat corpus is obtained, it can be cleaned according to the matching score. For example, as shown in FIG. 2, when the matching score is greater than or equal to the preset matching threshold, it is determined that the chat corpus needs to be retained. Conversely, when the matching score is less than the preset matching threshold, it is determined that the Chat corpus discarded.
  • the process of obtaining the matching score corresponding to the chat corpus through the DRMR model is as follows:
  • q 1 be the word vector corresponding to the question corpus in the chat corpus
  • Q 2 (y 1 , y 2 , y 3 ,..., y n ), where m is the length after the word segmentation of the answer corpus, n is the length after the word segmentation of the answer corpus, and x i is the i-th in the question corpus
  • the word vector corresponding to each word, y i represents the word vector corresponding to the i-th word in the answer corpus.
  • relu is the relu activation function
  • W (l) is the weight matrix of the first layer
  • b (l) is the bias matrix of the first layer
  • L is the total number of layers of the neural network
  • W p is the preset question training
  • b p is the weight matrix of the preset question training text
  • s is the output value of the preset question training text after mapping.
  • different matching thresholds can be set according to the degree of filtering or the number of times.
  • the matching threshold in the case of the first filtering, can be set to 0.5, and after multiple washings, the matching can be The threshold is gradually increased, and the matching threshold in the last or final cleaning process is set to 0.9.
  • the method for cleaning the chat corpus further includes the following steps:
  • Step S202 Obtain a training corpus and construct a question and answer pair corpus according to the training corpus;
  • Step S204 perform word segmentation processing on the corpus, and convert the word segmentation result into a word vector
  • Step S206 Train the preset deep search matching ranking model according to the question and answer corpus to obtain the trained deep search matching ranking model.
  • the training corpus is a chat corpus that has been preprocessed after the obtained chat corpus, and may be the same corpus as the aforementioned chat corpus.
  • a corresponding question-answer pair corpus is constructed, and the question-answer pair corpus is in a form that conforms to DRMR training data.
  • the positive and negative sample pairs are constructed in the following ways.
  • Question 1-Response 1 Question 2-Response 2, Question 3-Response 3, the above three chat chat corpora can form 6 question and answer pairs, as shown in Figure 4, they are: (Question 1, Reply 1, Reply 2 ), (question 1, reply 1, reply 3), (question 2, reply 2, reply 1), (question 2, reply 2, reply 3), (question 3, reply 3, reply 1), (question 3, Reply 3, reply 2) and other question-answer corpus, where (question 1, reply 1, reply 2) the question-answer pair indicates that question 1 and reply 1 match more than question 1 and reply 2 match.
  • a training sample, a verification sample, and a test sample are constructed according to a ratio of 8:1:1, thereby completing the training of the entire DRMR model.
  • the above question-answer pair corpus includes training question corpus, first answer corpus and second answer corpus; then the corresponding conversion into word vectors is the same as the aforementioned way of evaluating the matching score of chat corpus according to the DRMR model, then corresponding
  • the word vector of includes the question corpus word vector, the first answer corpus word vector, and the second answer corpus word vector.
  • q 1 be the word vector corresponding to the question corpus in the chat corpus
  • q 2 be the word vector corresponding to the first answer corpus in the chat corpus
  • q 3 be the word vector corresponding to the first answer corpus in the chat corpus.
  • the corresponding matching scores are calculated for q 1 , q 2 and q 1 , q 3 respectively, and the calculation process is as follows:
  • relu is the relu activation function
  • W (l) is the weight matrix of the first layer
  • b (l) is the bias matrix of the first layer
  • L is the total number of layers of the neural network
  • W p is the preset question training
  • b p is the weight matrix of the preset question training text
  • s is the output value of the preset question training text after mapping. That is, s(q 1 , q 2 ) is obtained through the above steps.
  • the matching score s(q 1 , q 3 ) between the question corpus q 1 and the second answer corpus q 3 can be calculated in the same manner.
  • the hinge-loss loss function can be used:
  • margin is the sample similar interval between positive and negative (in this embodiment, margin can be set to 1)
  • s(q 1 , q 2 ) represents the result value of q 1 , q 2 input to the DRMR model calculation
  • s(q 1 , q 3 ) is the result of inputting q 1 and q 3 into the DRMR model calculation
  • is the current given parameter.
  • the gradient update can complete the model training.
  • the specific values of the parameters can be initialized. For example, it can be calculated by randomly initializing the parameters of the normal distribution, and after each round of model training, Update and iterate the parameters in the model to clean the next chat corpus, that is, perform the foregoing steps S102-S108.
  • the chat corpus can be cleaned according to the DRMR model, and the chat corpus after the first round of cleaning can be further used as The training corpus of the DRMR model, and another round of model training for the DRMR model, and then the chat corpus is washed again according to this.
  • the matching threshold is continuously reduced. For example, in the first cleaning process, the matching threshold is taken as 0.5, and the filtering is gradually performed according to the matching threshold of 0.5, 0.6, 0.7, 0.8, and 0.9 to complete the final cleaning. Work, and compare the matching score of the last round of chat corpus output by the DRMR model with 0.9. If the matching score is greater than or equal to 0.9, retain the corresponding chat corpus, otherwise, filter the corresponding chat corpus . Refer to Figure 6 for details.
  • the chat corpus is unsupervisedly cleaned through repeated operations, which saves a lot of time for manual cleaning of the chat corpus and ensures the quality of the chat corpus after cleaning , Improve the accuracy of the follow-up training of intelligent chat robots.
  • a cleaning device for chat corpus which specifically includes:
  • the chat corpus acquisition module 102 is used to acquire chat corpus, the chat corpus includes question corpus and answer corpus;
  • the chat corpus processing module 104 is used to perform word segmentation processing on the chat corpus and convert the word segmentation results into word vectors,
  • the model calculation module 106 is used to input the word vector into a preset deep search matching ranking model to obtain a matching score corresponding to the chat corpus;
  • the corpus cleaning module 108 is used for cleaning the chat corpus according to the matching score.
  • the above-mentioned chat corpus cleaning device first obtains the chat corpus to be cleaned, and each chat corpus contains corresponding questions and replies.
  • the question replies and replies corresponding corpus are subjected to word segmentation processing, and converted into word vectors, and then according to the trained depth
  • the retrieval matching ranking model calculates the matching score between the question vector and the corresponding word vector to determine whether the current chat corpus is matched and whether it needs to be cleaned.
  • it can be automatically cleaned according to the deep search matching ranking model, and there is no need to manually tag the chat corpus one by one, saving a lot of manual operation time. To a certain extent, the cost is reduced.
  • the above-mentioned method for cleaning chat corpus avoids the mistake of manual operation, and also improves the accuracy of chat corpus cleaning to a certain extent. Further, in this embodiment, the matching degree between the question and the answer is judged by the deep search matching ranking model, which improves the accuracy of the matching judgment of the chat corpus, that is, the accuracy of cleaning the chat corpus.
  • the model calculation module 106 is further configured to: perform cross-product processing on the first word vector corresponding to the question corpus and the second word vector corresponding to the answer corpus, and obtain the fork according to a preset mapping function Multiply the preset number of mapping values of the processing result, and obtain a matching score corresponding to the mapping value according to the preset activation function and the preset projection function.
  • the above device further includes a word vector rewriting module 110, configured to rewrite the word vector according to a preset length threshold.
  • the corpus cleaning module 108 is further used to determine whether the matching score is greater than or equal to a preset matching threshold; when the matching score is less than the matching threshold, the chat corpus Perform cleaning.
  • the device for cleaning chat corpus further includes a model training module 112, which is used to:
  • the question-answer pair corpus includes a training question corpus, a first answer corpus, and a second answer corpus;
  • the word vector includes a question corpus word vector, a first answer corpus word vector, and a second answer corpus word vector
  • the model training module 112 is also used to evaluate and predict the question and answer corpus converted into the word vector according to the preset depth search matching ranking model, and obtain the first corresponding to the question corpus word vector and the first answer corpus word vector A matching score and a second matching score matching the question corpus word vector and the second answer corpus word vector; according to a preset loss function, the first matching score and the second matching score are input, and the output corresponds Loss value; update and iterate on the loss value according to a preset iteration algorithm.
  • the model training module 112 is further configured to use the cleaned chat corpus as a training corpus to train the deep search matching ranking model to obtain the trained deep search matching ranking model.
  • FIG. 8 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may be a server or a robot.
  • the computer device includes a processor, memory, and network interface connected by a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor may enable the processor to implement a method for cleaning chat corpus.
  • a computer program may also be stored in the internal memory.
  • the processor may be caused to execute a method for cleaning chat corpus.
  • the specific computer device may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.
  • the method for cleaning chat corpus provided in this application may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 8.
  • the memory device of the computer device can store various program templates that constitute the cleaning device of the chat corpus.
  • a computer device includes a memory and a processor.
  • the memory stores a computer program.
  • the processor is caused to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • a computer-readable storage medium which stores a computer program, and when the computer program is executed by a processor, the processor is caused to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • cleaning method of chat corpus cleaning device of chat corpus, computer equipment and computer readable storage medium belong to a general inventive concept.
  • the cleaning method of chat corpus, cleaning device of chat corpus, computer equipment and computer may The contents in the embodiment of reading the storage medium are applicable to each other.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • RDRAM direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种聊天语料的清洗方法、装置、计算机设备及存储介质,包括:获取聊天语料,所述聊天语料包括问语料和答语料(S102);将所述聊天语料进行分词处理,并将分词结果转换成词向量(S104),将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值(S106);根据所述匹配分值对所述聊天语料进行清洗(S108)。通过上述方式,能够自动对聊天语料进行清洗,提高聊天语料的质量,从而提高后续模型训练的准确性。

Description

聊天语料的清洗方法、装置、计算机设备和存储介质 技术领域
本发明涉及计算机技术领域和深度学习技术领域,尤其涉及一种聊天语料的清洗方法、装置、计算机设备和存储介质。
背景技术
智能机器人聊天一直是人工领域的主要研究方向,如何通过深度学习等方法使得智能聊天机器人像人一样自如的进行聊天,例如,在产品售后部门中作为智能客服。在目前的智能聊天机器人的训练过程中,不管是检索式还是生成式,均需要闲聊语料来机器人进行训练。
对智能聊天机器人进行问答训练需要大量的闲聊语料,目前大量的闲聊语料来自于网上的开源材料,但是这些闲聊语料普遍存在质量不高的情况,需要对这些闲聊语料进行清洗。而采用人工筛选的方式需要专业的人员对闲聊语料进行标注,不仅耗费人力、效率低下,还可能因为标注人员的水平和理解的不同导致结果的准确性不足,从而导致最终训练语料的质量不高。
发明内容
基于此,有必要针对上述问题,提出一种对聊天语料清洗效率高的聊天语料的清洗方法、装置、计算机设备和存储介质。
在本发明的第一方面,提供了一种聊天语料的清洗方法,所述方法包括:
获取聊天语料,所述聊天语料包括问语料和答语料;
将所述聊天语料进行分词处理,并将分词结果转换成词向量,
将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
根据所述匹配分值对所述聊天语料进行清洗。
在本发明的第二方面,还提供了一种聊天语料的清洗装置,包括:
聊天语料获取模块,用于获取聊天语料,所述聊天语料包括问语料和答语料;
聊天语料处理模块,用于将所述聊天语料进行分词处理,并将分词结果转换成词向量,
模型计算模块,用于将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
语料清洗模块,用于根据所述匹配分值对所述聊天语料进行清洗。
在本发明的第三方面,提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
将所述聊天语料进行分词处理,并将分词结果转换成词向量,
将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
根据所述匹配分值对所述聊天语料进行清洗。
在本发明的第四方面,提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
将所述聊天语料进行分词处理,并将分词结果转换成词向量,
将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
根据所述匹配分值对所述聊天语料进行清洗。
实施本发明实施例,将具有如下有益效果:
本发明提出了一种聊天语料的清洗方法、装置、计算机设备和存储介质,首先获取待清洗的聊天语料,每一条聊天语料包含了对应的问题和回复,问问题、回复对应的语料进行分词处理,并转换成词向量,然后根据训练好的深度检索匹配排序模型计算问句、回复对应的词向量之间的匹配分值,从而来判断 当前聊天语料是否是匹配的,是否需要进行清洗。也就是说,对于原始获取的聊天语料,在本实施例中,可以根据深度检索匹配排序模型进行自动的清洗,不再需要人工逐条聊天语料进行标注,省去了大量的人工操作时间,在一定程度上减少了成本花销。并且,采用上述聊天语料的清洗方法,避免了人工操作的认为错误,也在一定程度上提高了聊天语料清洗的准确性。进一步的,在本实施例中,通过深度检索匹配排序模型对问题和答复之间的匹配程度进行判断,提高了聊天语料的匹配性判断的准确性,也即提高了聊天语料清洗的准确度。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为一个实施例中聊天语料的清洗方法的实现流程示意图;
图2为一个实施例中聊天语料的清洗方法的实现流程示意图;
图3为一个实施例中DRMR模型训练的实现流程示意图;
图4为一个实施例中问答对语料构建示意图;
图5为一个实施例中DRMR模型示意图;
图6为一个实施例中聊天语料的清洗方法的实现流程示意图;
图7为一个实施例中聊天语料的清洗装置的结构框图;
图8为一个实施例中计算机设备的结构框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造 性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
如图1所示,在一个实施例中,提供了一种聊天语料的清洗方法,本发明实施例所述的聊天语料的清洗方法的执行主体可以是服务器,当然本发明实施例所述的聊天语料的清洗方法的执行主体还可以是其他终端设备,例如,机器人设备。
具体的,如图1所示,上述聊天语料的清洗方法包括如下步骤:
步骤S102:获取聊天语料,所述聊天语料包括问语料和答语料。
聊天语料为从网络或其他途径获取的未经清洗的闲聊语料,其中,每条聊天语料包括一个问句(问语料)和一个答句(答语料)。例如,相应的聊天语料为若干个问答对,如(问题1,回复1),(问题2,回复2),……
需要说明的是,在本实施例中,在对聊天语料进行具体的清洗之前,还需要对聊天语料进行预处理,主要是针对原始的聊天语料中可能存在的不规范性,如,去掉反复重复的标点符号(如,一个问句后出现大量的问号,在这种情况下,仅保留其中的一个),再例如,去掉含有包情包的聊天语料,去掉聊天语料中含有的空格,过滤到敏感信息(如政治敏感词及色情暴力等词)。也就是说,经过针对聊天语料的预处理之后,可以去掉部分质量不高的聊天语料,提高后续聊天语料清洗的效率和准确性。
进一步的,在本实施例中,对聊天语料还需要进一步的改写,例如,去标点、取空格、英文大小写转换、去停用词等,以去掉对语义理解无关的字符,避免对后续聊天语料的清洗过程的准确性的影响。
步骤S104:将所述聊天语料进行分词处理,并将分词结果转换成词向量。
在本实施例中,对聊天语料中的问语料或答语料的分词处理,可以是按字进行切割的,如果是英文字符,可以按照字母进行切割。然后对于已经分词处理完毕的聊天语料,进一步的将每一个字/词转换成对应的词向量(因为分词处理是按字进行切割的,也可以称为词向量)。
具体的,采用正态分布随机初始化为300维的字向量,像分词结果中包含的每一个词/字转换成相应的词向量/字向量。
在本实施例中,因为中文、英文或者长度的问题,可能导致不同的词向量的长度不一致,为了后续向量或矩阵计算的方便,在本实施例中,还需要按照预设的长度阈值对相应的词向量进行长度的改写。即,按照预设的长度阈值,对所有的词向量进行截断或补齐操作。
步骤S106:将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
步骤S108:根据所述匹配分值对所述聊天语料进行清洗。
深度检索匹配排序(Deep Retrieval Match Ranking,DRMR)模型(下称DRMR模型)为根据深度学习模型构建的对聊天语料之间是否匹配进行评估预测的模型,输入为聊天语料中的问语料对应的词向量、答语料对应的词向量,输出为该问语料与答语料之间的匹配分值。
在本步骤中,获取聊天语料对应的匹配分值之后,即可根据该匹配分值进行清洗。例如,如图2所示,在匹配分值大于或等于预设的匹配阈值的情况下,确定聊天语料需要进行保留,反之,在匹配分值小于预设的匹配阈值的情况下,确定将该聊天语料丢弃。
具体的,在一个具体的实施例中,通过DRMR模型获取聊天语料对应的匹配分值的过程具体如下:
令q 1为聊天语料中的问语料对应的词向量,q 2为聊天语料中的答语料对应的词向量,即q 1=(x 1,x 2,x 3,...,x m),q 2=(y 1,y 2,y 3,...,y n),其中m为问语料分词之后的长度,n为答语料分词之后的长度,x i表示为问语料中第i个词对应的词向量,y i表示为答语料中第i个词对应的词向量。
首先对q 1、q 2做叉乘处理:
Figure PCTCN2018125768-appb-000001
其中,
Figure PCTCN2018125768-appb-000002
表示为对应元素相乘,f表示映射函数(Mapping函数),在这里Mapping函数为挑选出叉乘之后的前K个值(例如,K=10或K=30)。从而获取相应的词向量进行叉乘处理之后获取出关键的前K个值,且输入问题也转换成固定长度的问题或回复。
进一步的,在按照公式1进行处理之后,进一步的还需要采用预设的激活函数进行激活(公式2),并进行投影(公式3),以获取最终的匹配分值(公式4)。
z (l)=relu(W (l)z (l-1)+b (l)),l=1,2,...,L                    (2)
h=relu(W pq 1+b p)                              (3)
Figure PCTCN2018125768-appb-000003
其中,relu为relu激活函数,W (l)为第l层的权重矩阵,b (l)为第l层的偏置矩阵,L是神经网络的总层数,W p是预置问句训练文本的权重矩阵,b p是预置问句训练文本的权重矩阵,s是预置问句训练文本经过映射后的输出值。
在本实施例中,可以根据过滤的程度或次数的不同,设定不同的匹配阈值,例如,第一次过滤情况下,可以将匹配阈值设置为0.5,在经过多次清洗之后,可以将匹配阈值逐步提高,将最后一次或最终的清洗过程中的匹配阈值设置为0.9。
进一步的,在本实施例中,还需要对上述DRMR模型进行模型训练、验证,然后再进行具体的聊天语料的清洗。
在一个具体的实施例中,如图3所示,上述聊天语料的清洗方法还包括如下步骤:
步骤S202:获取训练语料,根据所述训练语料构建问答对语料;
步骤S204:将所述问答对语料进行分词处理,并将分词结果转换成词向量;
步骤S206:根据所述问答对语料对预设的深度检索匹配排序模型进行训练,获取训练完成的深度检索匹配排序模型。
训练语料为获取到的闲聊语料之后经过预处理之后的聊天语料,与前述聊天语料可以为同一语料。在本实施例中,在获取到训练语料之后,构建相应的问答对语料,该问答对语料为符合DRMR训练数据的形式。
具体的,通过以下方式构建正负样本对。如问题1-回复1,问题2-回复2,问题3-回复3,以上三条闲聊语料,可构成6个问答对,具体可如图4所示,为:(问题1,回复1,回复2)、(问题1,回复1,回复3)、(问题2,回复2, 回复1)、(问题2,回复2,回复3)、(问题3,回复3,回复1)、(问题3,回复3,回复2)等问答对语料,其中(问题1,回复1,回复2)该问答对表示为问题1与回复1的匹配度比问题1与回复2的匹配度高。
进一步的,在本实施例中,在构建训练语料时,根据8:1:1的比例构建训练样本、验证样本、测试样本,从而完成整个DRMR模型的训练。
也就是说,上述问答对语料包括训练问语料、第一答语料和第二答语料;则对应的转换成词向量的方式与前述根据DRMR模型评估聊天语料的匹配分值的方式一致,则相应的词向量包括问语料词向量、第一答语料词向量、第二答语料词向量。
将构建完成的训练语料对应的问答对语料,输入如图5所示的DRMR模型中,并获取与所述问语料词向量、第一答语料词向量对应的第一匹配分值和与问语料词向量、第二答语料词向量匹配的第二匹配分值。然后将第一匹配分值、第二匹配分值与真实结果(第一匹配分值大于第二匹配分值)进行比对,从而完成对于模型的训练。
具体的,具体的DRMR的计算过程如下:
令q 1为聊天语料中的问语料对应的词向量,q 2为聊天语料中的第一答语料对应的词向量,q 3为聊天语料中的第一答语料对应的词向量。
分别针对q 1、q 2和q 1、q 3计算相应的匹配分值,其计算过程如下:
首先对q 1、q 2做叉乘处理:
Figure PCTCN2018125768-appb-000004
其中,
Figure PCTCN2018125768-appb-000005
表示为对应元素相乘,f表示映射函数(Mapping函数),在这里Mapping函数为挑选出叉乘之后的前K个值(例如,K=10或K=30)。从而获取相应的词向量进行叉乘处理之后获取出关键的前K个值,且输入问题也转换成固定长度的问题或回复。
进一步的,在按照公式(5)进行处理之后,进一步的还需要采用预设的激活函数进行激活(公式6),并进行投影(公式7),以获取最终的匹配分值(公式8)。
z (l)=relu(W (l)z (l-1)+b (l)),l=1,2,...,L                    (6)
h=relu(W pq 1+b p)                              (7)
Figure PCTCN2018125768-appb-000006
其中,relu为relu激活函数,W (l)为第l层的权重矩阵,b (l)为第l层的偏置矩阵,L是神经网络的总层数,W p是预置问句训练文本的权重矩阵,b p是预置问句训练文本的权重矩阵,s是预置问句训练文本经过映射后的输出值。也就是说,经过上述步骤获取s(q 1,q 2)。
并且,可以按照相同的方式,计算问语料q 1与第二答语料q 3之间的匹配分值s(q 1,q 3)。
然后按照预设的损失函数计算损失值L,具体可以使用hinge-loss损失函数:
L(q 1,q 2,q 3;Θ)=max(0,margin-s(q 1,q 2)+s(q 1,q 3))             (9)
其中margin为正反间样本相似间距(本实施例中可以将margin设为1),s(q 1,q 2)表示q 1,q 2输入到DRMR模型计算的结果值,s(q 1,q 3)为将q 1,q 3输入到DRMR模型计算的结果值,Θ为当前的给定参数。
根据损失值进行梯度更新,即可完成模型的训练,为了加快模型训练的速度我们选用了Adam算法来完成梯度的更新。最后,保存模型以及其中的参数。
需要说明的是,在第一次模型训练时,其中的参数的具体值可以采用初始值,例如,可以是通过正态分布随机初始化的参数进行计算,并且,在每一轮模型训练完毕之后,对模型中的参数进行更新和迭代,以进行下一步的聊天语料的清洗,即执行前述步骤S102-S108。
在一个具体的实施例中,在通过步骤S202-S206对DRMR模型完成训练之后,即可根据该DRMR模型对聊天语料进行清洗,并且,对于经过第一轮清洗之后的聊天语料,可以进一步的作为DRMR模型的训练语料,并对DRMR模型再进行一轮模型训练,再据此对聊天语料进行再次的清洗。并且,在此循环过程中,不断的减少匹配阈值,例如,第一次清洗过程中,匹配阈值取0.5,并按照匹配阈值为0.5、0.6、0.7、0.8、0.9逐步进行过滤,完成最终的清洗工作,并将最后一轮聊天语料经过DRMR模型输出的匹配分值与0.9进行相比,在匹 配分值大于或等于0.9的情况下,保留相应的聊天语料,否则,将相应的聊天语料进行过滤。具体可参考图6所示。
在本实施例中,经过DRMR模型的不断的训练、语料清洗的循环,通过反复操作进行无监督的清洗聊天语料,大量的节省了人工清洗聊天语料的时间,并且能保证聊天语料清洗之后的质量,提高了后续对智能聊天机器人进行训练的准确性。
如图7所示,提供了一种聊天语料的清洗装置,具体包括:
聊天语料获取模块102,用于获取聊天语料,所述聊天语料包括问语料和答语料;
聊天语料处理模块104,用于将所述聊天语料进行分词处理,并将分词结果转换成词向量,
模型计算模块106,用于将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
语料清洗模块108,用于根据所述匹配分值对所述聊天语料进行清洗。
上述聊天语料的清洗装置,首先获取待清洗的聊天语料,每一条聊天语料包含了对应的问题和回复,问问题、回复对应的语料进行分词处理,并转换成词向量,然后根据训练好的深度检索匹配排序模型计算问句、回复对应的词向量之间的匹配分值,从而来判断当前聊天语料是否是匹配的,是否需要进行清洗。也就是说,对于原始获取的聊天语料,在本实施例中,可以根据深度检索匹配排序模型进行自动的清洗,不再需要人工逐条聊天语料进行标注,省去了大量的人工操作时间,在一定程度上减少了成本花销。并且,采用上述聊天语料的清洗方法,避免了人工操作的认为错误,也在一定程度上提高了聊天语料清洗的准确性。进一步的,在本实施例中,通过深度检索匹配排序模型对问题和答复之间的匹配程度进行判断,提高了聊天语料的匹配性判断的准确性,也即提高了聊天语料清洗的准确度。
在其中一个实施例中,模型计算模块106还用于:对所述问语料对应的第一词向量、所述答语料对应的第二词向量进行叉乘处理,按照预设的映射函数 获取叉乘处理结果的预设数量的映射值,根据预设的激活函数、预设的投影函数获取与所述映射值对应的匹配分值。
在其中一个实施例中,如图7所示,上述装置还包括词向量改写模块110,用于按照预设的长度阈值对所述词向量进行长度改写。
在其中一个实施例中,语料清洗模块108还用于判断所述匹配分值是否大于或等于预设的匹配阈值;在所述匹配分值小于所述匹配阈值的情况下,对所述聊天语料进行清洗。
在其中一个实施例中,如图7所示,聊天语料的清洗装置还包括模型训练模块112,用于:
获取训练语料,根据所述训练语料构建问答对语料;
将所述问答对语料进行分词处理,并将分词结果转换成词向量;
根据所述问答对语料对预设的深度检索匹配排序模型进行训练,获取训练完成的深度检索匹配排序模型。
在其中一个实施例中,所述问答对语料包括训练问语料、第一答语料和第二答语料;所述词向量包括问语料词向量、第一答语料词向量、第二答语料词向量;模型训练模块112还用于根据预设的深度检索匹配排序模型对所述转换成词向量的问答对语料进行评估预测,得到与所述问语料词向量、第一答语料词向量对应的第一匹配分值和与问语料词向量、第二答语料词向量匹配的第二匹配分值;按照预设的损失函数以所述第一匹配分值、第二匹配分值为输入,输出对应的损失值;按照预设的迭代算法对所述损失值进行更新迭代。
在其中一个实施例中,模型训练模块112还用于以所述清洗完成的聊天语料作为训练语料,对所述深度检索匹配排序模型进行训练,获取所述训练完成的深度检索匹配排序模型。
图8示出了一个实施例中计算机设备的内部结构图。该计算机设备可以是服务器,也可以是机器人。如图8所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程 序,该计算机程序被处理器执行时,可使得处理器实现聊天语料的清洗方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行聊天语料的清洗方法。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的聊天语料的清洗方法可以实现为一种计算机程序的形式,计算机程序可在如图8所示的计算机设备上运行。计算机设备的存储器中可存储组成聊天语料的清洗装置的各个程序模板。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
将所述聊天语料进行分词处理,并将分词结果转换成词向量;
将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
根据所述匹配分值对所述聊天语料进行清洗。
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
将所述聊天语料进行分词处理,并将分词结果转换成词向量;
将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
根据所述匹配分值对所述聊天语料进行清洗。
需要说明的是,上述聊天语料的清洗方法、聊天语料的清洗装置、计算机设备及计算机可读存储介质属于一个总的发明构思,聊天语料的清洗方法、聊天语料的清洗装置、计算机设备及计算机可读存储介质实施例中的内容可相互适用。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种聊天语料的清洗方法,其特征在于,所述方法包括:
    获取聊天语料,所述聊天语料包括问语料和答语料;
    将所述聊天语料进行分词处理,并将分词结果转换成词向量;
    将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
    根据所述匹配分值对所述聊天语料进行清洗。
  2. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值,还包括:
    对所述问语料对应的词向量、所述答语料对应的词向量进行叉乘处理,按照预设的映射函数获取叉乘处理结果的预设数量的映射值,根据预设的激活函数、预设的投影函数获取与所述映射值对应的匹配分值。
  3. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述将分词结果转换成词向量之后,还包括:
    按照预设的长度阈值对所述词向量进行长度改写。
  4. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述根据所述匹配分值对所述聊天语料进行清洗,还包括:
    判断所述匹配分值是否大于或等于预设的匹配阈值;
    在所述匹配分值小于所述匹配阈值的情况下,对所述聊天语料进行清洗。
  5. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述方法还包括:
    获取训练语料,根据所述训练语料构建问答对语料;
    将所述问答对语料进行分词处理,并将分词结果转换成词向量;
    根据所述问答对语料对预设的深度检索匹配排序模型进行训练,获取训练完成的深度检索匹配排序模型。
  6. 根据权利要求5所述的聊天语料的清洗方法,其特征在于,所述问答对 语料包括训练问语料、第一答语料和第二答语料;
    所述词向量包括问语料词向量、第一答语料词向量、第二答语料词向量;
    所述根据所述问答对语料对预设的深度检索匹配排序模型进行训练,还包括:
    根据预设的深度检索匹配排序模型对所述转换成词向量的问答对语料进行评估预测,得到与所述问语料词向量、第一答语料词向量对应的第一匹配分值和与问语料词向量、第二答语料词向量匹配的第二匹配分值;
    按照预设的损失函数以所述第一匹配分值、第二匹配分值为输入,输出对应的损失值;
    按照预设的迭代算法对所述损失值进行更新迭代。
  7. 根据权利要求5所述的聊天语料的清洗方法,其特征在于,所述根据所述匹配分值对所述聊天语料进行清洗之后,还包括:
    以所述清洗完成的聊天语料作为训练语料,对所述深度检索匹配排序模型进行训练,获取所述训练完成的深度检索匹配排序模型。
  8. 一种聊天语料的清洗装置,其特征在于,所述装置包括:
    聊天语料获取模块,用于获取聊天语料,所述聊天语料包括问语料和答语料;
    聊天语料处理模块,用于将所述聊天语料进行分词处理,并将分词结果转换成词向量,
    模型计算模块,用于将所述词向量输入预设的深度检索匹配排序模型,获取与所述聊天语料对应的匹配分值;
    语料清洗模块,用于根据所述匹配分值对所述聊天语料进行清洗。
  9. 一种计算机设备,其特征在于,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述聊天语料的清洗方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程 序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述聊天语料的清洗方法的步骤。
PCT/CN2018/125768 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质 WO2020133470A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/125768 WO2020133470A1 (zh) 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/125768 WO2020133470A1 (zh) 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020133470A1 true WO2020133470A1 (zh) 2020-07-02

Family

ID=71126044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/125768 WO2020133470A1 (zh) 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质

Country Status (1)

Country Link
WO (1) WO2020133470A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114124860A (zh) * 2021-11-26 2022-03-01 中国联合网络通信集团有限公司 会话管理方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104753765A (zh) * 2013-12-31 2015-07-01 华为技术有限公司 自动回复信息的方法以及装置
CN108170853A (zh) * 2018-01-19 2018-06-15 广东惠禾科技发展有限公司 一种聊天语料自清洗方法、装置和用户终端
US20180293978A1 (en) * 2017-04-07 2018-10-11 Conduent Business Services, Llc Performing semantic analyses of user-generated textual and voice content
CN108920603A (zh) * 2018-06-28 2018-11-30 厦门快商通信息技术有限公司 一种基于客服机器模型的客服引导方法
CN108920460A (zh) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 一种多类型实体识别的多任务深度学习模型的训练方法及装置
CN108984655A (zh) * 2018-06-28 2018-12-11 厦门快商通信息技术有限公司 一种客服机器人智能客服引导方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104753765A (zh) * 2013-12-31 2015-07-01 华为技术有限公司 自动回复信息的方法以及装置
US20180293978A1 (en) * 2017-04-07 2018-10-11 Conduent Business Services, Llc Performing semantic analyses of user-generated textual and voice content
CN108170853A (zh) * 2018-01-19 2018-06-15 广东惠禾科技发展有限公司 一种聊天语料自清洗方法、装置和用户终端
CN108920460A (zh) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 一种多类型实体识别的多任务深度学习模型的训练方法及装置
CN108920603A (zh) * 2018-06-28 2018-11-30 厦门快商通信息技术有限公司 一种基于客服机器模型的客服引导方法
CN108984655A (zh) * 2018-06-28 2018-12-11 厦门快商通信息技术有限公司 一种客服机器人智能客服引导方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114124860A (zh) * 2021-11-26 2022-03-01 中国联合网络通信集团有限公司 会话管理方法、装置、设备及存储介质

Similar Documents

Publication Publication Date Title
CN107766324B (zh) 一种基于深度神经网络的文本一致性分析方法
CN112131383B (zh) 特定目标的情感极性分类方法
CN111240975B (zh) 人工智能系统风险检测方法、装置、计算机设备与介质
CN110019843A (zh) 知识图谱的处理方法及装置
WO2019232893A1 (zh) 文本的情感分析方法、装置、计算机设备和存储介质
CN110413319A (zh) 一种基于深度语义的代码函数味道检测方法
CN117194637A (zh) 基于大语言模型的多层级可视化评估报告生成方法、装置
CN114492423B (zh) 基于特征融合及筛选的虚假评论检测方法、系统及介质
CN111897961A (zh) 一种宽度神经网络模型的文本分类方法及相关组件
CN109857846A (zh) 用户问句与知识点的匹配方法和装置
CN111368096A (zh) 基于知识图谱的信息分析方法、装置、设备和存储介质
CN114116998A (zh) 答复语句生成方法、装置、计算机设备和存储介质
CN114647713A (zh) 基于虚拟对抗的知识图谱问答方法、设备及存储介质
CN111680132A (zh) 一种用于互联网文本信息的噪声过滤和自动分类方法
CN112307048A (zh) 语义匹配模型训练方法、匹配方法、装置、设备及存储介质
WO2020133470A1 (zh) 聊天语料的清洗方法、装置、计算机设备和存储介质
CN111027318A (zh) 基于大数据的行业分类方法、装置、设备及存储介质
CN112464660B (zh) 文本分类模型构建方法以及文本数据处理方法
WO2020133358A1 (zh) 聊天语料的清洗方法、装置、计算机设备和存储介质
CN117669767A (zh) 语言模型的训练方法、设备及计算机可读存储介质
CN112307743A (zh) 基于K-max池化的卷积网络事件识别方法
CN109543571B (zh) 一种面向复杂产品异形加工特征的智能识别与检索方法
CN111145053A (zh) 基于人工智能的企业法律顾问管理系统及方法
CN111382249B (zh) 聊天语料的清洗方法、装置、计算机设备和存储介质
CN111400460A (zh) 聊天语料的清洗方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18945064

Country of ref document: EP

Kind code of ref document: A1