WO2020133358A1 - 聊天语料的清洗方法、装置、计算机设备和存储介质 - Google Patents

聊天语料的清洗方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020133358A1
WO2020133358A1 PCT/CN2018/125358 CN2018125358W WO2020133358A1 WO 2020133358 A1 WO2020133358 A1 WO 2020133358A1 CN 2018125358 W CN2018125358 W CN 2018125358W WO 2020133358 A1 WO2020133358 A1 WO 2020133358A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
chat
chat corpus
word vector
preset
Prior art date
Application number
PCT/CN2018/125358
Other languages
English (en)
French (fr)
Inventor
熊友军
熊为星
廖洪涛
Original Assignee
深圳市优必选科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技有限公司 filed Critical 深圳市优必选科技有限公司
Priority to PCT/CN2018/125358 priority Critical patent/WO2020133358A1/zh
Publication of WO2020133358A1 publication Critical patent/WO2020133358A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Definitions

  • the invention relates to the field of computer technology and deep learning technology, in particular to a method, device, computer equipment and storage medium for cleaning chat corpus.
  • Intelligent robot chat has always been the main research direction in the artificial field. How to make intelligent chat robots chat like humans through deep learning and other methods, for example, as an intelligent customer service in the product after-sales department. In the training process of the current intelligent chat robot, whether it is a retrieval type or a generation type, a chat corpus is required to train the robot.
  • chat corpus A large amount of chat corpus is required for Q&A training of intelligent chat robots. At present, a large amount of chat corpus comes from open source materials on the Internet. However, these chat corpus generally have low quality and need to be cleaned.
  • the manual screening method requires professional personnel to tag the chatting corpus, which not only consumes manpower and is inefficient, but also may result in insufficient accuracy of the results due to the difference in the level and understanding of the tagging personnel, resulting in low quality of the final training corpus .
  • a method for cleaning chat corpus includes:
  • chat corpus includes question corpus and answer corpus
  • a device for cleaning chat corpus including:
  • the chat corpus acquisition module is used to acquire chat corpus, the chat corpus includes question corpus and answer corpus;
  • a chat corpus processing module configured to perform word segmentation processing on the chat corpus, obtain a word vector converted into the word segmentation result, and acquire a word vector corresponding to the chat corpus;
  • a model calculation module configured to input the word vector and the word vector into a preset chat corpus matching model, and obtain a target matching score corresponding to the chat corpus;
  • the corpus cleaning module is used to clean the chat corpus according to the target matching score.
  • a computer device which includes a memory and a processor.
  • the memory stores a computer program.
  • the processor is caused to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • the present invention proposes a method, device, computer equipment and storage medium for cleaning chat corpus.
  • the chat corpus to be cleaned is obtained.
  • Each chat corpus contains a corresponding question and reply, and the question and reply corresponding corpus are processed. Convert into corresponding word vectors and word vectors respectively, and then calculate the question and reply the corresponding target matching score according to the trained chat corpus matching model, so as to determine whether the current chat corpus is matched and whether it needs to be cleaned.
  • the originally acquired chat corpus it can be automatically cleaned according to the chat corpus matching model.
  • There is no need to manually tag each chat corpus which saves a lot of manual operation time and reduces the cost to a certain extent. .
  • the above-mentioned method for cleaning chat corpus avoids the mistake of manual operation, and also improves the accuracy of chat corpus cleaning to a certain extent.
  • the word vector and the self vector corresponding to the chat corpus are considered at the same time, which is retained to the maximum extent.
  • the features of word vectors and word vectors are improved, and the effectiveness of feature extraction in the chat corpus matching model is improved, thereby improving the clarity and accuracy of chat corpus.
  • 1 is a schematic diagram of an implementation process of a method for cleaning chat corpus in an embodiment
  • FIG. 2 is a schematic diagram of an implementation process of a method for cleaning chat corpus in an embodiment
  • FIG. 3 is a schematic diagram of an implementation process of chat corpus matching model training in an embodiment
  • FIG. 4 is a schematic diagram of question and answer pair corpus construction in one embodiment
  • FIG. 5 is a schematic diagram of a chat corpus matching model in an embodiment
  • FIG. 6 is a schematic diagram of an implementation process of a method for cleaning chat corpus in an embodiment
  • FIG. 7 is a structural block diagram of a device for cleaning chat corpus in an embodiment
  • FIG. 8 is a structural block diagram of a computer device in an embodiment.
  • a method for cleaning chat corpus is provided.
  • the execution subject of the method for cleaning chat corpus according to the embodiment of the present invention may be a server.
  • the chat described in the embodiment of the present invention may also be other terminal devices, such as robotic devices.
  • the method for cleaning the chat corpus includes the following steps:
  • Step S102 Obtain chat corpus, the chat corpus includes question corpus and answer corpus.
  • the chat corpus is an uncleaned chat corpus obtained from the Internet or other sources, where each chat corpus includes a question (question corpus) and an answer sentence (answer corpus).
  • the corresponding chat corpus is a number of question and answer pairs, such as (question 1, reply 1), (question 2, reply 2),...
  • the chat corpus needs to be pre-processed, mainly for the irregularities that may exist in the original chat corpus, such as removing repeated repeats Punctuation marks (for example, a large number of question marks appear after a question, in this case, only one of them is retained), and for example, remove the chat corpus that contains the package, remove the spaces contained in the chat corpus, filter to Sensitive information (such as politically sensitive words and pornographic violence). That is to say, after preprocessing the chat corpus, part of the chat corpus of low quality can be removed to improve the efficiency and accuracy of subsequent chat corpus cleaning.
  • Punctuation marks for example, a large number of question marks appear after a question, in this case, only one of them is retained
  • Sensitive information such as politically sensitive words and pornographic violence
  • chat corpus needs to be further rewritten, for example, punctuation, space, English case conversion, stop words, etc., to remove characters that are irrelevant to semantic understanding and avoid subsequent chats.
  • punctuation for example, punctuation, space, English case conversion, stop words, etc.
  • Step S104 Perform word segmentation processing on the chat corpus, obtain a word vector converted into the word segmentation result, and acquire a word vector corresponding to the chat corpus.
  • the process of obtaining the word vector corresponding to the chat corpus is as follows: perform word segmentation processing on the question corpus or answer corpus in the chat corpus, and then further convert each word into the corresponding chat corpus after the word segmentation has been processed Word vector.
  • the normal distribution is randomly initialized into a 300-dimensional vector by means of randomly initializing word vectors, and then each cut word is converted into a corresponding 300-dimensional word vector according to the result of word segmentation.
  • the process of obtaining the word vector corresponding to the chat corpus is as follows: the question corpus and the answer corpus in the chat corpus are cut by word (if English, cut by character), and then each word/character is converted into a corresponding word vector .
  • the normal distribution is randomly initialized into a 300-dimensional vector by means of randomly initializing word vectors, and then each cut word or character is converted into a corresponding 300-dimensional word vector.
  • the length of different word vectors or word vectors may be inconsistent because of problems in Chinese, English, or length.
  • the threshold rewrites the length of the corresponding word vector and word vector. That is, according to a preset length threshold, all word vectors or word vectors are truncated or complemented. For example, set the maximum length of a user's consultation question, use this length value to truncate or complete the word vector and word vector corresponding to the chat corpus, and normalize the chat corpus to obtain the normalized question corpus (corresponding word Vector, word vector) and answer corpus (corresponding word vector, word vector).
  • q 1w and q 2w be the question corpus in the chat corpus and the word vector corresponding to the question corpus
  • q 1c and q 2c are the question corpus in the chat corpus and the word vector corresponding to the question corpus, namely :
  • Step S106 input the word vector and the word vector into a preset chat corpus matching model to obtain a target matching score corresponding to the chat corpus.
  • Step S108 Clean the chat corpus according to the target matching score.
  • the chat corpus matching model is a model constructed based on the deep learning model to evaluate and predict whether the chat corpus matches, and the input is the word/word vector corresponding to the question corpus and the word corresponding to the answer corpus in the chat corpus Word/word vector, the output is the target matching score between the question corpus and the answer corpus.
  • this step after obtaining the target matching score corresponding to the chat corpus, it can be cleaned according to the target matching score. For example, as shown in FIG. 2, when the target matching score is greater than or equal to the preset matching threshold, it is determined that the chat corpus needs to be retained, and conversely, when the target matching score is less than the preset matching threshold, it is determined Discard the chat corpus.
  • the process of obtaining the target matching score corresponding to the chat corpus through the chat corpus matching model is as follows:
  • Step S1062 Perform cross-product processing on the word vector and the word vector corresponding to the chat corpus respectively according to a preset cross-product function, and obtain a preset number of map vectors of the cross-product process result according to the preset map function.
  • the vector includes a map word vector and a map word vector; the map word vector and the map word vector are fused according to a preset fusion algorithm, and feature extraction is performed on the fused result to obtain a first target corresponding to the chat corpus Matching score
  • Step S1064 Feature extraction is performed on the word vector and the word vector corresponding to the chat corpus, and the fusion operation is performed on the word vector and the word vector after feature extraction according to a preset fusion algorithm, and the fusion result is input into a preset projection layer, Obtaining a second target matching score corresponding to the chat corpus;
  • Step S1066 According to a preset matching splicing algorithm, calculate a target matching score corresponding to the chat corpus according to the first target matching score and the second target matching score.
  • step S1062 Specifically, in step S1062:
  • W (l-1) and b (l-1) are the weight parameter matrix and offset vector of the corresponding layer in the projection, which are obtained through model training; and, when activated by a preset activation function, the output is The first target matching score corresponding to the chat corpus.
  • step S1064
  • U and V are weight matrix of word vector and word vector respectively
  • b w and b v are offset vectors of word vector and word vector
  • activation is performed by activation function relu.
  • W (h) and b (h) are the weight matrix and offset vector of the projection layer.
  • the fusion result is input to a projection layer (project layer), and the output is the second target matching score corresponding to the chat corpus.
  • steps S1062 and S1064 the first target matching score z (l) and the second target matching score h corresponding to the chat corpus are obtained, and then the first target matching score and the second target matching score can be obtained Calculate the overall target matching score e.
  • step S1066 the target matching score corresponding to the chat corpus is calculated according to the first target matching score and the second target matching score; that is, , According to the formula
  • the target matching score e corresponding to the chat corpus in step S102 is calculated.
  • different matching thresholds can be set according to the degree of filtering or the number of times.
  • the matching threshold in the case of the first filtering, can be set to 0.5, and after multiple washings, the matching can be The threshold is gradually increased, and the matching threshold in the last or final cleaning process is set to 0.9.
  • the method for cleaning the chat corpus further includes the following steps:
  • Step S202 Obtain a training corpus and construct a question and answer pair corpus according to the training corpus;
  • Step S204 Train the preset chat corpus matching model on the corpus according to the question and answer to obtain the trained chat corpus matching model.
  • the training corpus is a chat corpus that has been preprocessed after the obtained chat corpus, and may be the same corpus as the aforementioned chat corpus.
  • a corresponding question-answer pair corpus is constructed, and the question-answer pair corpus is in a form that conforms to the training data of the chat corpus matching model.
  • the positive and negative sample pairs are constructed in the following ways.
  • Question 1-Response 1 Question 2-Response 2, Question 3-Response 3, the above three chat chat corpora can form 6 question and answer pairs, as shown in Figure 4, they are: (Question 1, Reply 1, Reply 2 ), (question 1, reply 1, reply 3), (question 2, reply 2, reply 1), (question 2, reply 2, reply 3), (question 3, reply 3, reply 1), (question 3, Reply 3, reply 2) and other question-answer corpus, where (question 1, reply 1, reply 2) the question-answer pair indicates that question 1 and reply 1 match more than question 1 and reply 2 match.
  • a training sample, a verification sample, and a test sample are constructed according to a ratio of 8:1:1, thereby completing the training of the entire chat corpus matching model.
  • the question and answer pair corpus includes training question corpus, first answer corpus and second answer corpus; then the corresponding conversion into word vectors and word vectors is the same as the above steps S102-S108, then the corresponding word vector includes question Corpus word vector, first answer corpus word vector, second answer corpus word vector, corresponding word vectors include question corpus word vector, first answer corpus word vector, second answer corpus word vector.
  • q 1w be the word vector corresponding to the question corpus in the chat corpus
  • q 2w be the word vector corresponding to the first answer corpus in the chat corpus
  • q 3w be the word vector corresponding to the first answer corpus in the chat corpus
  • q 1c be The word vector corresponding to the question corpus in the chat corpus
  • q 2c is the word vector corresponding to the first answer corpus in the chat corpus
  • q 3c is the word vector corresponding to the first answer corpus in the chat corpus.
  • the first evaluation score e(q 1 , q 2 ) corresponding to the question corpus and the first answer corpus and the match with the question corpus and the second answer corpus are calculated respectively
  • the hinge-loss loss function can be used:
  • margin is the sample similar interval between positive and negative (in this embodiment, margin can be set to 1)
  • e(q 1 , q 2 ) represents the result value calculated by q 1 , q 2 input to the chat corpus matching model
  • e(q 1 , q 3 ) is the result value calculated by inputting q 1 , q 3 into the chat corpus matching model
  • is the current given parameter.
  • the gradient update can complete the model training.
  • the specific values of the parameters can be initialized. For example, it can be calculated by randomly initializing the parameters of the normal distribution, and after each round of model training, Update and iterate the parameters in the model to clean the next chat corpus, that is, perform the foregoing steps S102-S108.
  • the chat corpus after completing the training of the chat corpus matching model through steps S202-S204, the chat corpus can be cleaned according to the chat corpus matching model, and for the chat corpus after the first round of cleaning,
  • the chat corpus matching model can be further used as a training corpus for the chat corpus matching model, and the chat corpus matching model can be subjected to another round of model training, and then the chat corpus can be washed again according to this.
  • the matching threshold is continuously reduced. For example, in the first cleaning process, the matching threshold is taken as 0.5, and the filtering is gradually performed according to the matching threshold of 0.5, 0.6, 0.7, 0.8, and 0.9 to complete the final cleaning.
  • the chat corpus is unsupervisedly cleaned through repeated operations, which saves a lot of time for manual cleaning of the chat corpus, and can ensure that the chat corpus is cleaned
  • the quality of the system has improved the accuracy of subsequent training of intelligent chat robots.
  • a cleaning device for chat corpus which specifically includes:
  • the chat corpus acquisition module 102 is used to acquire chat corpus, the chat corpus includes question corpus and answer corpus;
  • the chat corpus processing module 104 is configured to perform word segmentation processing on the chat corpus, obtain a word vector converted into the word segmentation result, and acquire a word vector corresponding to the chat corpus;
  • the model calculation module 106 is configured to input the word vector and the word vector into a preset chat corpus matching model to obtain a target matching score corresponding to the chat corpus;
  • the corpus cleaning module 108 is used for cleaning the chat corpus according to the target matching score.
  • the above-mentioned chat corpus cleaning device first obtains the chat corpus to be cleaned, and each chat corpus contains corresponding questions and replies, processes the questions and replies corresponding corpus, and converts them into corresponding word vectors and word vectors, respectively, and then
  • the trained chat corpus matching model calculates the problem and replies to the corresponding target matching score, so as to judge whether the current chat corpus is matched and whether it needs to be cleaned.
  • the originally acquired chat corpus it can be automatically cleaned according to the chat corpus matching model. There is no need to manually tag each chat corpus, which saves a lot of manual operation time and reduces the cost to a certain extent. .
  • the above-mentioned method for cleaning chat corpus avoids the mistake of manual operation, and also improves the accuracy of chat corpus cleaning to a certain extent.
  • the word vector and the self vector corresponding to the chat corpus are considered at the same time, which is retained to the maximum extent.
  • the features of word vectors and word vectors are improved, and the effectiveness of feature extraction in the chat corpus matching model is improved, thereby improving the clarity and accuracy of chat corpus.
  • the model calculation module 106 is further configured to: perform cross-product processing on the word vector and the word vector corresponding to the chat corpus according to a preset cross-product function, and obtain the cross product process according to the preset mapping function A preset number of mapping vectors of the result, the mapping vector including a mapping word vector and a mapping word vector; performing fusion processing on the mapping word vector and the mapping word vector according to a preset fusion algorithm, and performing feature extraction on the fused result To obtain the first target matching score corresponding to the chat corpus; perform feature extraction on the word vector and word vector corresponding to the chat corpus respectively, and perform the feature extraction on the word vector and word vector according to a preset fusion algorithm Fusion operation, input the fusion result into a preset projection layer to obtain a second target matching score corresponding to the chat corpus; according to a preset matching stitching algorithm, according to the first target matching score and the second Target matching score, calculate the target matching score corresponding to the chat corpus.
  • the above device further includes a vector rewriting module 110 for rewriting the length of the word vector according to a preset first length threshold; according to a preset second length threshold The word vector is rewritten in length.
  • the corpus cleaning module 108 is further used to determine whether the target matching score is greater than or equal to a preset matching threshold; when the target matching score is less than the matching threshold, the The chat corpus is cleaned.
  • the device for cleaning chat corpus further includes a model training module 112, which is used to:
  • the preset chat chat corpus matching model is trained on the corpus according to the question and answer, and the trained chat corpus matching model is obtained.
  • the question-answer pair corpus includes a training question corpus, a first answer corpus, and a second answer corpus;
  • the word vector includes a question corpus word vector, a first answer corpus word vector, and a second answer corpus word vector
  • the model training module 112 is also used to evaluate and predict the question and answer corpus according to a preset chat corpus matching model, to obtain the first evaluation score and the training question corresponding to the training question corpus and the first training answer corpus.
  • the second evaluation score matching the corpus and the second training answer corpus; the first evaluation score and the second evaluation score are input according to the preset loss function, and the corresponding loss value is output; according to the preset iterative algorithm Update and iterate the loss value, and update the chat corpus matching model.
  • the model training module 112 is further configured to use the cleaned chat corpus as a training corpus to train the chat corpus matching model to obtain the trained chat corpus matching model.
  • FIG. 8 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may be a server or a robot.
  • the computer device includes a processor, memory, and network interface connected by a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor may enable the processor to implement a method for cleaning chat corpus.
  • a computer program may also be stored in the internal memory.
  • the processor may be caused to execute a method for cleaning chat corpus.
  • the specific computer device may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.
  • the method for cleaning chat corpus provided in this application may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 8.
  • the memory device of the computer device can store various program templates that constitute the cleaning device of the chat corpus.
  • a computer device includes a memory and a processor.
  • the memory stores a computer program.
  • the processor is caused to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • a computer-readable storage medium which stores a computer program, and when the computer program is executed by a processor, the processor is caused to perform the following steps:
  • chat corpus includes question corpus and answer corpus
  • cleaning method of chat corpus cleaning device of chat corpus, computer equipment and computer readable storage medium belong to a general inventive concept.
  • the cleaning method of chat corpus, cleaning device of chat corpus, computer equipment and computer may The contents in the embodiment of reading the storage medium are applicable to each other.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • RDRAM direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

一种聊天语料的清洗方法、装置、计算机设备及存储介质,包括:获取聊天语料,所述聊天语料包括问语料和答语料(S102);对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量(S104);将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值(S106);根据所述目标匹配分值对所述聊天语料进行清洗。通过上述方式,能够自动对聊天语料进行清洗,提高聊天语料的质量,从而提高后续模型训练的准确性。

Description

聊天语料的清洗方法、装置、计算机设备和存储介质 技术领域
本发明涉及计算机技术领域和深度学习技术领域,尤其涉及一种聊天语料的清洗方法、装置、计算机设备和存储介质。
背景技术
智能机器人聊天一直是人工领域的主要研究方向,如何通过深度学习等方法使得智能聊天机器人像人一样自如的进行聊天,例如,在产品售后部门中作为智能客服。在目前的智能聊天机器人的训练过程中,不管是检索式还是生成式,均需要闲聊语料来机器人进行训练。
对智能聊天机器人进行问答训练需要大量的闲聊语料,目前大量的闲聊语料来自于网上的开源材料,但是这些闲聊语料普遍存在质量不高的情况,需要对这些闲聊语料进行清洗。而采用人工筛选的方式需要专业的人员对闲聊语料进行标注,不仅耗费人力、效率低下,还可能因为标注人员的水平和理解的不同导致结果的准确性不足,从而导致最终训练语料的质量不高。
发明内容
基于此,有必要针对上述问题,提出一种对聊天语料清洗效率高的聊天语料的清洗方法、装置、计算机设备和存储介质。
在本发明的第一方面,提供了一种聊天语料的清洗方法,所述方法包括:
获取聊天语料,所述聊天语料包括问语料和答语料;
对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
根据所述目标匹配分值对所述聊天语料进行清洗。
在本发明的第二方面,还提供了一种聊天语料的清洗装置,包括:
聊天语料获取模块,用于获取聊天语料,所述聊天语料包括问语料和答语料;
聊天语料处理模块,用于对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
模型计算模块,用于将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
语料清洗模块,用于根据所述目标匹配分值对所述聊天语料进行清洗。
在本发明的第三方面,提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
根据所述目标匹配分值对所述聊天语料进行清洗。
在本发明的第四方面,提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
根据所述目标匹配分值对所述聊天语料进行清洗。
实施本发明实施例,将具有如下有益效果:
本发明提出了一种聊天语料的清洗方法、装置、计算机设备和存储介质, 首先获取待清洗的聊天语料,每一条聊天语料包含了对应的问题和回复,对问题、回复对应的语料进行处理,分别转换成对应的词向量以及字向量,然后根据训练好的聊天语料匹配模型计算问题、回复对应的目标匹配分值,从而来判断当前聊天语料之间是否是匹配的,是否需要进行清洗。也就是说,对于原始获取的聊天语料,可以根据聊天语料匹配模型进行自动的清洗,不再需要人工逐条聊天语料进行标注,省去了大量的人工操作时间,在一定程度上减少了成本花销。并且,采用上述聊天语料的清洗方法,避免了人工操作的认为错误,也在一定程度上提高了聊天语料清洗的准确性。
进一步的,在本实施例中,在对聊天语料匹配模型进行训练以及计算聊天语料之间的目标匹配分值的过程中,同时考虑了聊天语料对应的词向量和自向量,在最大程度上保留了词向量与字向量的特征,提高了在聊天语料匹配模型中特征提取的有效性,从而提高了聊天语料清晰的准确性。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为一个实施例中聊天语料的清洗方法的实现流程示意图;
图2为一个实施例中聊天语料的清洗方法的实现流程示意图;
图3为一个实施例中聊天语料匹配模型训练的实现流程示意图;
图4为一个实施例中问答对语料构建示意图;
图5为一个实施例中聊天语料匹配模型示意图;
图6为一个实施例中聊天语料的清洗方法的实现流程示意图;
图7为一个实施例中聊天语料的清洗装置的结构框图;
图8为一个实施例中计算机设备的结构框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
如图1所示,在一个实施例中,提供了一种聊天语料的清洗方法,本发明实施例所述的聊天语料的清洗方法的执行主体可以是服务器,当然本发明实施例所述的聊天语料的清洗方法的执行主体还可以是其他终端设备,例如,机器人设备。
具体的,如图1所示,上述聊天语料的清洗方法包括如下步骤:
步骤S102:获取聊天语料,所述聊天语料包括问语料和答语料。
聊天语料为从网络或其他途径获取的未经清洗的闲聊语料,其中,每条聊天语料包括一个问句(问语料)和一个答句(答语料)。例如,相应的聊天语料为若干个问答对,如(问题1,回复1),(问题2,回复2),……
需要说明的是,在本实施例中,在对聊天语料进行具体的清洗之前,还需要对聊天语料进行预处理,主要是针对原始的聊天语料中可能存在的不规范性,如,去掉反复重复的标点符号(如,一个问句后出现大量的问号,在这种情况下,仅保留其中的一个),再例如,去掉含有包情包的聊天语料,去掉聊天语料中含有的空格,过滤到敏感信息(如政治敏感词及色情暴力等词)。也就是说,经过针对聊天语料的预处理之后,可以去掉部分质量不高的聊天语料,提高后续聊天语料清洗的效率和准确性。
进一步的,在本实施例中,对聊天语料还需要进一步的改写,例如,去标点、取空格、英文大小写转换、去停用词等,以去掉对语义理解无关的字符,避免对后续聊天语料的清洗过程的准确性的影响。
步骤S104:对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量。
在本实施例中,对聊天语料进行处理获取对应的词向量或字向量的过程中,既需要获取与聊天语料对应的词向量,也需要获取与聊天语料对应的字向量,在后续的模型匹配计算的过程中,既考虑词向量的特性,也保留字向量的特性,从而提高聊天语料清洗的准确性。
具体的,获取与聊天语料对应的词向量的过程具体如下:对聊天语料中的问语料或答语料进行分词处理,然后对于已经分词处理完毕的聊天语料,进一步的将每一个词转换成对应的词向量。
其中,采用随机初始化词向量的方式正态分布随机初始化为300维的向量,然后根据分词结果将每一个切割的词转换成对应的300维的词向量。
获取聊天语料对应的字向量的过程具体如下:对聊天语料中的问语料和答语料按字进行切割(如果是英文,按照字符进行切割),然后将每一个字/字符转换成对应的字向量。
其中,采用随机初始化字向量的方式正态分布随机初始化为300维的向量,然后将每一个切割的字或字符转换成对应的300维的字向量。
在本实施例中,因为中文、英文或者长度的问题,可能导致不同的词向量或字向量的长度不一致,为了后续向量或矩阵计算的方便,在本实施例中,还需要按照预设的长度阈值对相应的词向量和字向量进行长度的改写。即,按照预设的长度阈值,对所有的词向量或字向量进行截断或补齐操作。例如,设定一个用户咨询问题的最大长度,利用该长度值对聊天语料对应的词向量和字向量进行截断或补齐操作,对聊天语料进行规范化处理,得到规范化之后的问语料(对应的词向量、字向量)以及答语料(对应的词向量、字向量)。
在一个具体的实施例中,令q 1w,q 2w为聊天语料中的问语料和问语料对应的词向量,q 1c,q 2c为聊天语料中的问语料和问语料对应的字向量,即:
q 1w=(x 1w,x 2w,x 3w,...,x mw)
q 2w=(y 1w,y 2w,y 3w,...,y nw)
q 1c=(x 1c,x 2c,x 3c,...,x pc)
q 2c=(y 1c,y 2c,y 3c,...,y qc)
其中m、n为问语料和问语料的词向量长度,此处m=n,p、q为问语料和问语料的字向量的长度,此处p=q。
步骤S106:将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值。
步骤S108:根据所述目标匹配分值对所述聊天语料进行清洗。
在本实施例中,聊天语料匹配模型为根据深度学习模型构建的对聊天语料之间是否匹配进行评估预测的模型,输入为聊天语料中的问语料对应的词/字向量、答语料对应的词词/字向量,输出为该问语料与答语料之间的目标匹配分值。
在本步骤中,获取聊天语料对应的目标匹配分值之后,即可根据该目标匹配分值进行清洗。例如,如图2所示,在目标匹配分值大于或等于预设的匹配阈值的情况下,确定聊天语料需要进行保留,反之,在目标匹配分值小于预设的匹配阈值的情况下,确定将该聊天语料丢弃。
具体的,在一个具体的实施例中,通过聊天语料匹配模型获取聊天语料对应的目标匹配分值的过程具体如下:
步骤S1062:按照预设的叉乘函数分别对所述聊天语料对应的词向量和字向量进行叉乘处理,根据预设的映射函数获取叉乘处理结果的预设数量的映射向量,所述映射向量包括映射词向量和映射字向量;按照预设的融合算法对所述映射词向量和映射字向量进行融合处理,对融合后的结果进行特征提取,获取与所述聊天语料对应的第一目标匹配分值;
步骤S1064:分别对所述聊天语料对应的词向量和字向量进行特征提取,按照预设的融合算法对特征提取后的词向量和字向量进行融合操作,将融合结果输入预设的投影层,获取与所述聊天语料对应的第二目标匹配分值;
步骤S1066:按照预设的匹配拼接算法,根据所述第一目标匹配分值和所述第二目标匹配分值,计算与所述聊天语料对应的目标匹配分值。
具体的,步骤S1062中:
Figure PCTCN2018125358-appb-000001
Figure PCTCN2018125358-appb-000002
其中,
Figure PCTCN2018125358-appb-000003
为对q 1w,q 2w进行叉乘处理,relu为预设的激活函数进行激活,f表示映射函数(Mapping函数),在这里Mapping函数为挑选出叉乘之后的前K个值(TopK,例如,K=10或K=30)。从而获取相应的词向量/字向量进行叉乘处理之后获取出关键的前K个值,且输入问题也转换成固定长度的问题或回复。
然后将TopK后词向量和字向量进行合并和融合,即为:
Figure PCTCN2018125358-appb-000004
在将词向量与字向量融合之后,对融合的结果进行特征提取:
z (l)=relu(W (l-1)z (l-1)+b (l-1)),l=1,2,...,L
其中,W (l-1)、b (l-1)为投影中对应层的权重参数矩阵及偏置向量,是通过模型训练得到的;并且,通过预设的激活函数进行激活,输出即为与所述聊天语料对应的第一目标匹配分值。
步骤S1064中:
首先对聊天语料对应的词向量、字向量进行特征提取:
g w=relu(Uq 1w+b w)
g v=relu(Vq 1v+b v)
其中U、V分别为词向量及字向量的权重矩阵,b w、b v为词向量及字向量的偏置向量,并且采用激活函数relu进行激活。
在激活之后,按照预设的融合算法对特征提取后的词向量和字向量进行融合操作:
Figure PCTCN2018125358-appb-000005
在融合之后对融合结果进行投影:
h=relu(W (h)g+b (h))
其中,W (h)、b (h)为投影层的权重矩阵及偏置向量。
也即,将融合结果输入一个投影层(project层),输出即为与所述聊天语料对应的第二目标匹配分值。
在步骤S1062及S1064中,获取到了与聊天语料对应的第一目标匹配分值z (l)和第二目标匹配分值h,然后即可根据第一目标匹配分值和第二目标匹配分值计算整体的目标匹配分值e。
具体的,在步骤S1066中,按照预设的匹配拼接算法,根据所述第一目标匹配分值和所述第二目标匹配分值,计算与所述聊天语料对应的目标匹配分值;也即,根据公式
Figure PCTCN2018125358-appb-000006
计算步骤S102中的聊天语料对应的目标匹配分值e。
在本实施例中,可以根据过滤的程度或次数的不同,设定不同的匹配阈值,例如,第一次过滤情况下,可以将匹配阈值设置为0.5,在经过多次清洗之后,可以将匹配阈值逐步提高,将最后一次或最终的清洗过程中的匹配阈值设置为0.9。
进一步的,在本实施例中,还需要对上述聊天语料匹配模型进行模型训练、验证,然后再进行具体的聊天语料的清洗。
在一个具体的实施例中,如图3所示,上述聊天语料的清洗方法还包括如下步骤:
步骤S202:获取训练语料,根据所述训练语料构建问答对语料;
步骤S204:根据所述问答对语料对预设的聊天语料匹配模型进行训练,获取训练完成的聊天语料匹配模型。
训练语料为获取到的闲聊语料之后经过预处理之后的聊天语料,与前述聊天语料可以为同一语料。在本实施例中,在获取到训练语料之后,构建相应的问答对语料,该问答对语料为符合聊天语料匹配模型训练数据的形式。
具体的,通过以下方式构建正负样本对。如问题1-回复1,问题2-回复2,问题3-回复3,以上三条闲聊语料,可构成6个问答对,具体可如图4所示,为:(问题1,回复1,回复2)、(问题1,回复1,回复3)、(问题2,回复2, 回复1)、(问题2,回复2,回复3)、(问题3,回复3,回复1)、(问题3,回复3,回复2)等问答对语料,其中(问题1,回复1,回复2)该问答对表示为问题1与回复1的匹配度比问题1与回复2的匹配度高。
进一步的,在本实施例中,在构建训练语料时,根据8:1:1的比例构建训练样本、验证样本、测试样本,从而完成整个聊天语料匹配模型的训练。
也就是说,上述问答对语料包括训练问语料、第一答语料和第二答语料;则对应的转换成词向量和字向量的方式与前述步骤S102-S108一致,则相应的词向量包括问语料词向量、第一答语料词向量、第二答语料词向量,相应的字向量包括问语料字向量、第一答语料字向量、第二答语料字向量。
将构建完成的训练语料对应的问答对语料,输入如图5所示的聊天语料匹配模型中,并获取与所述问语料、第一答语料对应的第一评估分值和与问语料、第二答语料匹配的第二评估分值。然后将第一评估分值、第二评估分值与真实结果(第一评估分值大于第二评估分值)进行比对,从而完成对于模型的训练。
具体的,具体的聊天语料匹配模型的计算过程如下:
令q 1w为聊天语料中的问语料对应的词向量,q 2w为聊天语料中的第一答语料对应的词向量,q 3w为聊天语料中的第一答语料对应的词向量;q 1c为聊天语料中的问语料对应的字向量,q 2c为聊天语料中的第一答语料对应的字向量,q 3c为聊天语料中的第一答语料对应的字向量。
在对聊天语料匹配模型进行训练的过程中,分别计算与所述问语料、第一答语料对应的第一评估分值e(q 1,q 2)和与问语料、第二答语料匹配的第二评估分值e(q 1,q 3)。
然后按照预设的损失函数计算损失值L,具体可以使用hinge-loss损失函数:
L(q 1,q 2,q 3;Θ)=max(0,margin-s(q 1,q 2)+s(q 1,q 3))
其中margin为正反间样本相似间距(本实施例中可以将margin设为1),e(q 1,q 2)表示q 1,q 2输入到聊天语料匹配模型计算的结果值,e(q 1,q 3)为将q 1,q 3输入到聊天语料匹配模型计算的结果值,Θ为当前的给定参数。
根据损失值进行梯度更新,即可完成模型的训练,为了加快模型训练的速 度我们选用了Adam算法来完成梯度的更新。最后,保存模型以及其中的参数,并对模型进行更新。
需要说明的是,在第一次模型训练时,其中的参数的具体值可以采用初始值,例如,可以是通过正态分布随机初始化的参数进行计算,并且,在每一轮模型训练完毕之后,对模型中的参数进行更新和迭代,以进行下一步的聊天语料的清洗,即执行前述步骤S102-S108。
在一个具体的实施例中,在通过步骤S202-S204对聊天语料匹配模型完成训练之后,即可根据该聊天语料匹配模型对聊天语料进行清洗,并且,对于经过第一轮清洗之后的聊天语料,可以进一步的作为聊天语料匹配模型的训练语料,并对聊天语料匹配模型再进行一轮模型训练,再据此对聊天语料进行再次的清洗。并且,在此循环过程中,不断的减少匹配阈值,例如,第一次清洗过程中,匹配阈值取0.5,并按照匹配阈值为0.5、0.6、0.7、0.8、0.9逐步进行过滤,完成最终的清洗工作,并将最后一轮聊天语料经过聊天语料匹配模型输出的目标匹配分值与0.9进行相比,在目标匹配分值大于或等于0.9的情况下,保留相应的聊天语料,否则,将相应的聊天语料进行过滤。具体可参考图6所示。
在本实施例中,经过聊天语料匹配模型的不断的训练、语料清洗的循环,通过反复操作进行无监督的清洗聊天语料,大量的节省了人工清洗聊天语料的时间,并且能保证聊天语料清洗之后的质量,提高了后续对智能聊天机器人进行训练的准确性。
如图7所示,提供了一种聊天语料的清洗装置,具体包括:
聊天语料获取模块102,用于获取聊天语料,所述聊天语料包括问语料和答语料;
聊天语料处理模块104,用于对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
模型计算模块106,用于将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
语料清洗模块108,用于根据所述目标匹配分值对所述聊天语料进行清洗。
上述聊天语料的清洗装置,首先获取待清洗的聊天语料,每一条聊天语料包含了对应的问题和回复,对问题、回复对应的语料进行处理,分别转换成对应的词向量以及字向量,然后根据训练好的聊天语料匹配模型计算问题、回复对应的目标匹配分值,从而来判断当前聊天语料之间是否是匹配的,是否需要进行清洗。也就是说,对于原始获取的聊天语料,可以根据聊天语料匹配模型进行自动的清洗,不再需要人工逐条聊天语料进行标注,省去了大量的人工操作时间,在一定程度上减少了成本花销。并且,采用上述聊天语料的清洗方法,避免了人工操作的认为错误,也在一定程度上提高了聊天语料清洗的准确性。
进一步的,在本实施例中,在对聊天语料匹配模型进行训练以及计算聊天语料之间的目标匹配分值的过程中,同时考虑了聊天语料对应的词向量和自向量,在最大程度上保留了词向量与字向量的特征,提高了在聊天语料匹配模型中特征提取的有效性,从而提高了聊天语料清晰的准确性。
在其中一个实施例中,模型计算模块106还用于:按照预设的叉乘函数分别对所述聊天语料对应的词向量和字向量进行叉乘处理,根据预设的映射函数获取叉乘处理结果的预设数量的映射向量,所述映射向量包括映射词向量和映射字向量;按照预设的融合算法对所述映射词向量和映射字向量进行融合处理,对融合后的结果进行特征提取,获取与所述聊天语料对应的第一目标匹配分值;分别对所述聊天语料对应的词向量和字向量进行特征提取,按照预设的融合算法对特征提取后的词向量和字向量进行融合操作,将融合结果输入预设的投影层,获取与所述聊天语料对应的第二目标匹配分值;按照预设的匹配拼接算法,根据所述第一目标匹配分值和所述第二目标匹配分值,计算与所述聊天语料对应的目标匹配分值。
在其中一个实施例中,如图7所示,上述装置还包括向量改写模块110,用于按照预设的第一长度阈值对所述词向量进行长度改写;按照预设的第二长度阈值对所述字向量进行长度改写。
在其中一个实施例中,语料清洗模块108还用于判断所述目标匹配分值是 否大于或等于预设的匹配阈值;在所述目标匹配分值小于所述匹配阈值的情况下,对所述聊天语料进行清洗。
在其中一个实施例中,如图7所示,聊天语料的清洗装置还包括模型训练模块112,用于:
获取训练语料,根据所述训练语料构建问答对语料;
根据所述问答对语料对预设的聊天语料匹配模型进行训练,获取训练完成的聊天语料匹配模型。
在其中一个实施例中,所述问答对语料包括训练问语料、第一答语料和第二答语料;所述词向量包括问语料词向量、第一答语料词向量、第二答语料词向量;模型训练模块112还用于根据预设的聊天语料匹配模型对所述问答对语料进行评估预测,得到与所述训练问语料、第一训练答语料对应的第一评估分值和与训练问语料、第二训练答语料匹配的第二评估分值;按照预设的损失函数以所述第一评估分值、第二评估分值为输入,输出对应的损失值;按照预设的迭代算法对所述损失值进行更新迭代,并对所述聊天语料匹配模型进行更新。
在其中一个实施例中,模型训练模块112还用于以所述清洗完成的聊天语料作为训练语料,对所述聊天语料匹配模型进行训练,获取所述训练完成的聊天语料匹配模型。
图8示出了一个实施例中计算机设备的内部结构图。该计算机设备可以是服务器,也可以是机器人。如图8所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现聊天语料的清洗方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行聊天语料的清洗方法。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的聊天语料的清洗方法可以实现为一种计算机程序的形式,计算机程序可在如图8所示的计算机设备上运行。计算机设备的存储器中可存储组成聊天语料的清洗装置的各个程序模板。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
根据所述目标匹配分值对所述聊天语料进行清洗。
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取聊天语料,所述聊天语料包括问语料和答语料;
对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
根据所述目标匹配分值对所述聊天语料进行清洗。
需要说明的是,上述聊天语料的清洗方法、聊天语料的清洗装置、计算机设备及计算机可读存储介质属于一个总的发明构思,聊天语料的清洗方法、聊天语料的清洗装置、计算机设备及计算机可读存储介质实施例中的内容可相互适用。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据 库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种聊天语料的清洗方法,其特征在于,所述方法包括:
    获取聊天语料,所述聊天语料包括问语料和答语料;
    对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
    将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
    根据所述目标匹配分值对所述聊天语料进行清洗。
  2. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值,还包括:
    按照预设的叉乘函数分别对所述聊天语料对应的词向量和字向量进行叉乘处理,根据预设的映射函数获取叉乘处理结果的预设数量的映射向量,所述映射向量包括映射词向量和映射字向量;按照预设的融合算法对所述映射词向量和映射字向量进行融合处理,对融合后的结果进行特征提取,获取与所述聊天语料对应的第一目标匹配分值;
    分别对所述聊天语料对应的词向量和字向量进行特征提取,按照预设的融合算法对特征提取后的词向量和字向量进行融合操作,将融合结果输入预设的投影层,获取与所述聊天语料对应的第二目标匹配分值;
    按照预设的匹配拼接算法,根据所述第一目标匹配分值和所述第二目标匹配分值,计算与所述聊天语料对应的目标匹配分值。
  3. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量之后,还包括:
    按照预设的第一长度阈值对所述词向量进行长度改写;
    按照预设的第二长度阈值对所述字向量进行长度改写。
  4. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述根据所 述目标匹配分值对所述聊天语料进行清洗,还包括:
    判断所述目标匹配分值是否大于或等于预设的匹配阈值;
    在所述目标匹配分值小于所述匹配阈值的情况下,对所述聊天语料进行清洗。
  5. 根据权利要求1所述的聊天语料的清洗方法,其特征在于,所述方法还包括:
    获取训练语料,根据所述训练语料构建问答对语料;
    根据所述问答对语料对预设的聊天语料匹配模型进行训练,获取训练完成的聊天语料匹配模型。
  6. 根据权利要求5所述的聊天语料的清洗方法,其特征在于,所述问答对语料包括训练问语料、第一训练答语料和第二训练答语料;
    所述根据所述问答对语料对预设的聊天语料匹配模型进行训练,还包括:
    根据预设的聊天语料匹配模型对所述问答对语料进行评估预测,得到与所述训练问语料、第一训练答语料对应的第一评估分值和与训练问语料、第二训练答语料匹配的第二评估分值;
    按照预设的损失函数以所述第一评估分值、第二评估分值为输入,输出对应的损失值;
    按照预设的迭代算法对所述损失值进行更新迭代,并对所述聊天语料匹配模型进行更新。
  7. 根据权利要求5所述的聊天语料的清洗方法,其特征在于,所述根据所述目标匹配分值对所述聊天语料进行清洗之后,还包括:
    以所述清洗完成的聊天语料作为训练语料,对所述聊天语料匹配模型进行训练,获取所述训练完成的聊天语料匹配模型。
  8. 一种聊天语料的清洗装置,其特征在于,所述装置包括:
    聊天语料获取模块,用于获取聊天语料,所述聊天语料包括问语料和答语料;
    聊天语料处理模块,用于对所述聊天语料进行分词处理,获取与所述分词结果转换成的词向量,并获取与所述聊天语料对应的字向量;
    模型计算模块,用于将所述词向量和所述字向量输入预设的聊天语料匹配模型,获取与所述聊天语料对应的目标匹配分值;
    语料清洗模块,用于根据所述目标匹配分值对所述聊天语料进行清洗。
  9. 一种计算机设备,其特征在于,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述聊天语料的清洗方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述聊天语料的清洗方法的步骤。
PCT/CN2018/125358 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质 WO2020133358A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/125358 WO2020133358A1 (zh) 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/125358 WO2020133358A1 (zh) 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020133358A1 true WO2020133358A1 (zh) 2020-07-02

Family

ID=71129549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/125358 WO2020133358A1 (zh) 2018-12-29 2018-12-29 聊天语料的清洗方法、装置、计算机设备和存储介质

Country Status (1)

Country Link
WO (1) WO2020133358A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022160442A1 (zh) * 2021-01-28 2022-08-04 平安科技(深圳)有限公司 答案生成方法、装置、电子设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026634A1 (en) * 2014-07-28 2016-01-28 International Business Machines Corporation Corpus Quality Analysis
CN107833629A (zh) * 2017-10-25 2018-03-23 厦门大学 基于深度学习的辅助诊断方法及系统
CN108170853A (zh) * 2018-01-19 2018-06-15 广东惠禾科技发展有限公司 一种聊天语料自清洗方法、装置和用户终端
CN108766581A (zh) * 2018-05-07 2018-11-06 上海市公共卫生临床中心 健康医疗数据的关键信息挖掘方法及辅助诊断系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026634A1 (en) * 2014-07-28 2016-01-28 International Business Machines Corporation Corpus Quality Analysis
CN107833629A (zh) * 2017-10-25 2018-03-23 厦门大学 基于深度学习的辅助诊断方法及系统
CN108170853A (zh) * 2018-01-19 2018-06-15 广东惠禾科技发展有限公司 一种聊天语料自清洗方法、装置和用户终端
CN108766581A (zh) * 2018-05-07 2018-11-06 上海市公共卫生临床中心 健康医疗数据的关键信息挖掘方法及辅助诊断系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022160442A1 (zh) * 2021-01-28 2022-08-04 平安科技(深圳)有限公司 答案生成方法、装置、电子设备及可读存储介质

Similar Documents

Publication Publication Date Title
CN111182162B (zh) 基于人工智能的电话质检方法、装置、设备和存储介质
WO2019232893A1 (zh) 文本的情感分析方法、装置、计算机设备和存储介质
CN110728182B (zh) 基于ai面试系统的面试方法、装置和计算机设备
CN112488241A (zh) 一种基于多粒度融合网络的零样本图片识别方法
CN113128287A (zh) 训练跨域人脸表情识别模型、人脸表情识别的方法及系统
CN114492423A (zh) 基于特征融合及筛选的虚假评论检测方法、系统及介质
WO2020133358A1 (zh) 聊天语料的清洗方法、装置、计算机设备和存储介质
CN112347245A (zh) 面向投融资领域机构的观点挖掘方法、装置和电子设备
CN110334262B (zh) 一种模型训练方法、装置及电子设备
CN115344805A (zh) 素材审核方法、计算设备及存储介质
WO2020133470A1 (zh) 聊天语料的清洗方法、装置、计算机设备和存储介质
CN109101984B (zh) 一种基于卷积神经网络的图像识别方法及装置
CN113489606A (zh) 一种基于图神经网络的网络应用识别方法及装置
CN113283488A (zh) 一种基于学习行为的认知诊断方法及系统
CN117454020A (zh) 基于图卷积神经网络的在线社交网络用户表示方法及装置
CN111680132A (zh) 一种用于互联网文本信息的噪声过滤和自动分类方法
CN111382249B (zh) 聊天语料的清洗方法、装置、计算机设备和存储介质
CN110795623A (zh) 一种图像增强训练方法及其系统、计算机可读存储介质
CN113239272B (zh) 一种网络管控系统的意图预测方法和意图预测装置
CN113627464B (zh) 图像处理方法、装置、设备和存储介质
CN115145928A (zh) 模型训练方法及装置、结构化摘要获取方法及装置
CN110427624B (zh) 实体关系抽取方法及装置
US11954009B2 (en) Method for analyzing a simulation of the execution of a quantum circuit
CN114911933A (zh) 基于图内图间联合信息传播的假新闻检测方法及系统
CN112860892A (zh) 一种ai模型中的数据标注方法、装置及设备

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18944944

Country of ref document: EP

Kind code of ref document: A1