WO2020151318A1 - 基于爬虫模型的语料构建方法、装置及计算机设备 - Google Patents

基于爬虫模型的语料构建方法、装置及计算机设备 Download PDF

Info

Publication number
WO2020151318A1
WO2020151318A1 PCT/CN2019/117698 CN2019117698W WO2020151318A1 WO 2020151318 A1 WO2020151318 A1 WO 2020151318A1 CN 2019117698 W CN2019117698 W CN 2019117698W WO 2020151318 A1 WO2020151318 A1 WO 2020151318A1
Authority
WO
WIPO (PCT)
Prior art keywords
question
data
model
response
corpus
Prior art date
Application number
PCT/CN2019/117698
Other languages
English (en)
French (fr)
Inventor
吴壮伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020151318A1 publication Critical patent/WO2020151318A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the field of intelligent customer service of this application particularly relates to a corpus construction method, device, computer equipment, and storage medium based on a crawler model.
  • Intelligent customer service not only establishes a convenient natural language-based communication platform between enterprises and a large number of users, which effectively improves the efficiency of customer service work, but also provides direct customer information for enterprises to carry out refined management.
  • Smart customer service is usually based on the existing question and answer database to provide customer service functions.
  • the existing intelligent customer service requires manual sorting of knowledge points when establishing the question and answer database, artificially expanding the user's question points, and finally generating the question and answer data in the question and answer database.
  • This application provides a corpus construction method, device, computer equipment and storage medium based on a crawler model to solve the problem of time-consuming and laborious construction of a question and answer corpus for intelligent customer service.
  • this application proposes a corpus model-based corpus construction method, which includes the following steps:
  • the response data is used as the response data of the question list, and the response data is associated with the question list to form question and answer corpus data of the topic word.
  • this application also provides a question and answer corpus data construction device based on a crawler model, including:
  • the acquisition module is used to acquire the subject words of the question and answer corpus data to be constructed
  • a generating module configured to input the topic words into a preset question generation model, and obtain a list of questions output by the question generation model in response to the topic words;
  • the processing module is configured to input the question list into a preset first web crawler model, and obtain response data output by the first web crawler model in response to the question list, wherein the first web crawler model Grab the target data with the question list as a constraint condition;
  • the execution module is configured to use the response data as the response data of the question list, and the response data is associated with the question list to form question and answer corpus data of the topic word.
  • an embodiment of the present application further provides a computer device including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the steps of the corpus construction method based on the crawler model.
  • embodiments of the present application further provide one or more non-volatile readable storage media.
  • the non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are When the processor is executed, the processor is caused to execute the steps of the aforementioned crawler model-based corpus construction method.
  • the beneficial effects of the embodiments of the present application are: by obtaining the subject words of the question and answer corpus data to be constructed; inputting the subject words into a preset question generation model, and obtaining the output of the question generation model in response to the subject words Question list; input the question list into a preset first web crawler model, and obtain response data output by the first web crawler model in response to the question list; use the response data as the question list
  • the response data is associated with the question list to form question and answer corpus data of the topic word.
  • the question generation about the subject words automatically obtains the real questions of the user through the web crawler, or the question list is generated by learning the real intention of the user through artificial intelligence, and the corresponding response data is also obtained through the web crawler to obtain the real customer service response.
  • the invention improves the efficiency and quality of question and answer data construction, and also improves the problem hit rate of intelligent customer service.
  • FIG. 1 is a schematic diagram of the basic flow of a corpus construction method based on a crawler model according to an embodiment of this application;
  • FIG. 2 is a schematic diagram of the process of generating a question list based on the second web crawler model according to an embodiment of the application;
  • FIG. 3 is a schematic diagram of the process of generating a question list based on the Seq2Seq model in an embodiment of the application;
  • FIG. 4 is a schematic diagram of a flow of obtaining response data based on filtering rules in an embodiment of the application
  • FIG. 5 is a schematic diagram of a process of obtaining response data based on a deep neural network model according to an embodiment of the application
  • FIG. 6 is a schematic diagram of a deep neural network model training process according to an embodiment of the application.
  • FIG. 7 is a basic structural block diagram of a question and answer corpus data construction device based on a crawler model according to an embodiment of this application;
  • FIG. 8 is a block diagram of the basic structure of the computer equipment implemented in this application.
  • terminal and terminal equipment used herein include both wireless signal receiver equipment, which only has wireless signal receiver equipment without transmitting capability, and also includes receiving and transmitting hardware equipment.
  • Such equipment may include: cellular or other communication equipment, which has a single-line display or a multi-line display or a cellular or other communication equipment without a multi-line display; PCS (Personal Communications Service, personal communication system), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notebooks, calendars, and/or GPS (Global Positioning System (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device, which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device.
  • PCS Personal Communications Service, personal communication system
  • PDA Personal Digital Assistant
  • GPS Global Positioning System (Global Positioning System) receiver
  • a conventional laptop and/or palmtop computer or other device which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device.
  • terminal and terminal equipment used here may be portable, transportable, installed in a vehicle (aviation, sea and/or land), or suitable and/or configured to operate locally, and/or In a distributed form, it runs on the earth and/or any other location in space.
  • the "terminal” and “terminal device” used here can also be communication terminals, Internet terminals, music/video playback terminals, such as PDA, MID (Mobile Internet Device, mobile Internet device) and/or music/video playback Functional mobile phones can also be devices such as smart TVs and set-top boxes.
  • the terminal in this embodiment is the aforementioned terminal.
  • FIG. 1 is a schematic diagram of the basic flow of a corpus construction method based on a crawler model in this embodiment.
  • a corpus construction method based on a crawler model includes the following steps:
  • the subject term defines the subject of the question and answer corpus data to be constructed, and the subject term entered by the user is obtained through the interactive page on the terminal.
  • the range of the input topic word description be appropriately small. For example, “mobile phone” covers a wide range, and the constructed question and answer corpus may be more divergent.
  • the subject term can be limited to "xx model mobile phone".
  • the question generation model can be a set of fixed series of questions in advance, with subject words as parameters. For example, a series of pre-set questions are:
  • a web crawler model is used to obtain real questions from online users; or a pre-trained Seq2Seq model is used to generate a question list.
  • a web crawler model is used to obtain real questions from online users; or a pre-trained Seq2Seq model is used to generate a question list.
  • a web crawler is a program that automatically extracts web pages. Specifically, a python program is used to simulate a browser to send a request to a target site, and the target site server responds to the request and returns resources such as html, pictures, and videos.
  • the first web crawler model uses the question list as a search condition to retrieve data related to the question list in the target site, that is, the response data output by the first web crawler model influencing the question list.
  • S104 Use the response data as response data of the question list, and associate the response data with the question list to form question and answer corpus data of the topic word.
  • the response data is used as the response data of the question list, and the question list and the response data are associated in a form of one question and one answer.
  • a piece of data contains two parts, one part is the question, and the other part is the answer to the question.
  • the intelligent customer service When the intelligent customer service receives the user's question, it can search the question in the Q&A database that is consistent with the user's question keyword by way of keyword search, and return a response that has a mapping relationship with the question.
  • the response corresponding to the question is obtained by calculating the similarity between the user question and the question in the question and answer database.
  • the calculation of similarity can use the algorithm to calculate the edit distance to calculate the similarity. For example, the question stored in the question and answer database is "How much does the phone cost", the user question received is "How much does the phone cost”, and the edit distance between the two is 1. , That is, "how much is the phone” to "how much is the phone to sell” just insert “sell”. Retrieve the question that is most similar to the question asked by the user in the database, and return the answer corresponding to the question.
  • step S102 specifically includes the following steps:
  • a web crawler model is used to obtain problems related to the input topic words.
  • the second web crawler model uses the topic words as the search condition to obtain the information on the target site .
  • the content related to the subject heading is called question candidate data here.
  • the obtained interrogative candidate data includes non-interrogative corpus data and interrogative corpus data.
  • a matching rule is preset, and the interrogative candidate data is processed through the preset matching rule to obtain interrogative matching data.
  • the matching rule is to include "?”, "what", “how much”, “where", “how” and other modal particles that express questions.
  • a regular matching algorithm is adopted.
  • a regular expression is a logical formula for operating on character strings. It uses predefined specific characters and combinations of these specific characters to form a "rule string”. "Rule string” is used to express a kind of filtering logic for string.
  • a regular expression is a text pattern that describes one or more strings to be matched when searching for text. For example, you can use the regular expression "*subject word*what*” to find any string containing "subject word” and "what".
  • the question list obtained in this way is closer to reality, and the question and answer corpus data constructed based on this has a higher hit rate for hitting the user's actual question.
  • the obtained question matching data is a list of questions related to the subject words.
  • step S102 specifically further includes the following steps:
  • the question list is obtained by inputting the subject words into a pre-trained Seq2Seq model.
  • the Seq2Seq model is a network of Encoder-Decoder structure. Its input is a sequence, and its output is also a sequence.
  • a variable-length signal sequence is transformed into a fixed-length vector expression. Decoder Turn this fixed-length vector into a variable-length target signal sequence.
  • the Encoder layer is a bidirectional LSTM layer or RNN (Convolutional Neural Network) Multi-layer neuron layer as the basic neuron unit to generate final_state state layer and final_output state vector;
  • the decoder layer also uses the bidirectional LSTM layer or RNN as the basic nerve Multi-layered neuron layers of unit units.
  • the output result is a list of basic questions based on the input subject words.
  • the Seq2Seq model needs to be trained to have the function of outputting a list of questions.
  • the specific training process is to prepare the training corpus, that is, prepare the input sequence and the corresponding output sequence, input the input sequence into the Seq2Seq model, calculate the probability of the output sequence, adjust the parameters of the Seq2Seq model, so that the entire sample, that is, all input sequences, pass through Seq2Seq The probability of output corresponding to the output sequence is the highest.
  • step S103 the following steps are further included:
  • filtering rules include at least query corpus data filtering rules
  • further processing is performed on the acquired response data. Since the acquired response data is needed here, the corpus representing the question needs to be filtered out first.
  • the regular matching algorithm can also be used to filter out all the corpus that contains "what", “how” and “how much” and other questionable semantics.
  • the filtering rules may also include filtering of sensitive words. According to the set sensitive vocabulary, the corpus containing sensitive words is filtered out.
  • the filtered data is the response data of the question list.
  • step S103 the following steps are further included:
  • S141 Input the response data into a pre-trained deep neural network model, and obtain classification information of the response data output by the deep neural network model, wherein the classification information at least distinguishes the response data into Interrogative corpus data and non-interrogative corpus data;
  • the response data is classified by a pre-trained deep neural network model, where the pre-trained deep neural network model can at least identify interrogative corpus and non-interrogative corpus. Please refer to Figure 6 for the specific training process of the deep neural network.
  • the non-corpus data identified by the deep neural network is the response data corresponding to the question list.
  • the deep neural network model used in step S141 is trained as follows:
  • the training goal of the deep neural network model is to be able to identify interrogative corpus and non-interrogative corpus. Therefore, the training sample contains two types of corpus, and each sample is marked with a corpus category.
  • the loss function is used to determine whether the reference corpus category output by the deep convolutional neural network is consistent with the corpus category labeled by the sample.
  • the loss function adopts the Softmax cross entropy loss function.
  • the input feature of the i-th sample for the last layer of the network is Xi
  • the corresponding label Yi is the final classification result (that is, whether sample i is an interrogative sentence or a non-interrogative sentence)
  • h (h1,h2 ,...,hc) is the final output of the network, that is, the prediction result of sample i.
  • C is the number of all categories at the end.
  • the gradient descent method is an optimization algorithm for machine learning and artificial The intelligence is used to recursively approximate the minimum deviation model.
  • FIG. 7 is a basic structural block diagram of a question and answer corpus data construction device based on a crawler model in this embodiment.
  • a question and answer corpus data construction device based on a crawler model includes: an acquisition module 210, a generation module 220, a processing module 230, and an execution module 240.
  • the obtaining module 210 is used to obtain the topic words of the question and answer corpus data to be constructed
  • the generating module 220 is used to input the topic words into a preset question generation model, and obtain the question generation model in response to the topic A list of questions output by words
  • a processing module 230 configured to input the question list into a preset first web crawler model, and obtain response data output by the first web crawler model in response to the question list
  • the first web crawler model uses the question list as a constraint condition to capture target data
  • the execution module 240 is configured to use the response data as response data of the question list, and the response data is the same as the question list Associate question and answer corpus data that constitute the subject word.
  • the subject words of the question and answer corpus data to be constructed are obtained; the subject words are input into a preset question generation model, and the question list output by the question generation model in response to the subject words is obtained;
  • the question list is input into a preset first web crawler model, and the response data output by the first web crawler model in response to the question list is obtained; the response data is used as the response data of the question list, so The response data is associated with the question list to form question and answer corpus data of the topic word.
  • the question generation about the subject words automatically obtains the real questions of the user through the web crawler, or the question list is generated by learning the real intention of the user through artificial intelligence, and the corresponding response data is also obtained through the web crawler to obtain the real customer service response.
  • This application improves the efficiency and quality of question and answer data construction, and also improves the problem hit rate of intelligent customer service.
  • the generating module 220 further includes: a first processing submodule, a first matching submodule, and a first execution submodule.
  • the first processing sub-module is configured to input the topic words into the second web crawler model, and obtain question candidate data output by the second web crawler model in response to the topic words;
  • the first matching sub-module It is used to match the question candidate data according to preset matching rules to obtain question matching data, where the matching rules include at least question corpus matching rules;
  • the first execution sub-module is used to use the question matching data as the all List of questions describing the subject heading.
  • a regular matching algorithm is used to obtain questioned matching data.
  • the generating module 220 further includes: a second processing sub-module and a first obtaining sub-module.
  • the second processing sub-module is used to input the topic words into a pre-trained Seq2Seq model; the first acquisition sub-module is used to obtain a list of questions output by the Seq2Seq model in response to the topic words.
  • the question and answer corpus data construction device based on the crawler model further includes: a first filtering submodule and a second execution submodule.
  • the first filtering submodule is configured to filter the response data according to preset filtering rules to obtain filtering data, wherein the filtering rules include at least query corpus data filtering rules;
  • the second execution submodule is configured to The filtering data is used as the response data of the question list.
  • the device for constructing question and answer corpus data based on the crawler model further includes: a first classification submodule and a third execution submodule.
  • the first classification sub-module is configured to input the response data into a pre-trained deep neural network model, and obtain classification information of the response data output by the deep neural network model, wherein the classification information At least the response data is divided into question corpus data and non-questioning corpus data; a third execution sub-module is used to use the non-questioning corpus data as the response data of the question list.
  • the device for constructing question and answer corpus data based on the crawler model further includes: a second acquisition submodule, a third processing submodule, a first comparison submodule, and a first update submodule.
  • the second acquisition submodule is used to obtain training samples marked with corpus categories, where the corpus categories include at least questionable corpus and non-questionable corpus;
  • the third processing submodule is used to input the training samples into deep convolution
  • the neural network model obtains the reference corpus category of the training sample;
  • the first comparison submodule is used to compare whether the reference corpus category of different samples in the training sample is consistent with the corpus category;
  • the first update submodule uses When the reference corpus category is inconsistent with the corpus category, the weights in the deep neural network model are updated repeatedly and iteratively until the reference corpus category is consistent with the corpus category.
  • FIG. 8 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences.
  • the processor can implement a A method of question and answer corpus data construction based on crawler model.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • a computer readable instruction may be stored in the memory of the computer device.
  • the processor may execute a method for constructing question and answer corpus data based on a crawler model.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • the processor is used to execute the specific content of the acquisition module 210, the generation module 220, the processing module 230, and the execution module 240 in FIG. 7, and the memory stores readable instruction codes and various data required to execute the above modules.
  • the network interface is used for data transmission between user terminals or servers.
  • the memory in this embodiment stores the readable instruction codes and data required to execute all sub-modules in the corpus construction method based on the crawler model, and the server can call the readable instruction codes and data of the server to execute the functions of all the sub-modules.
  • the computer device obtains the topic words of the question and answer corpus data to be constructed; inputs the topic words into a preset question generation model, and obtains a list of questions output by the question generation model in response to the topic words;
  • the list is input into a preset first web crawler model, and the response data output by the first web crawler model in response to the question list is obtained; the response data is used as the response data of the question list, and the response The data is associated with the question list to form question and answer corpus data of the topic word.
  • the question generation about the subject words automatically obtains the real questions of the user through the web crawler, or the question list is generated by learning the real intention of the user through artificial intelligence, and the corresponding response data is also obtained through the web crawler to obtain the real customer service response.
  • This application improves the efficiency and quality of question and answer data construction, and also improves the problem hit rate of intelligent customer service.
  • the present application also provides one or more non-volatile storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, the one or more processors perform any of the foregoing implementations.
  • the example describes the steps of the corpus construction method based on the crawler model.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Abstract

一种基于爬虫模型的语料构建方法、装置、计算机设备及存储介质,其中方法包括下述步骤:获取待构建问答语料数据的主题词(S101);将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表(S102);将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据(S103);将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据(S104)。其中,应答数据通过网络爬虫获取真实的客服应答。该方法提高了语料构建的效率和质量,也提高了智能客服的问题命中率,使客服人工智能化。

Description

基于爬虫模型的语料构建方法、装置及计算机设备
本申请以2019年1月24日提交的申请号为201910065779.8,名称为“智能客服的语料构建方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请智能客服领域,尤其涉及一种基于爬虫模型的语料构建方法、装置、计算机设备及存储介质。
背景技术
随着人工智能技术的发展,智能客服系统也逐渐兴起。智能客服不仅为企业与海量用户之间建立起便捷的基于自然语言的沟通平台,有效地提高了客服工作的效率,而且能够为企业进行精细化管理提供直接来源于客户信息。
智能客服通常基于既有问答数据库才能提供客服功能,现有的智能客服,在建立问答数据库时需要人工整理知识点,人工扩展用户的问题点,最终生成问答数据库中一问一答的数据。
但是,发明人意识到人工整理知识点和人工扩展用户问题点的方法,费时费力,而且往往不能体现用户真实的热点问题,导致在使用过程中,用户提问命中问答数据库中的问题点的命中率低,使智能客服不能有效地解答用户提问,影响用户体验。
发明内容
本申请提供一种基于爬虫模型的语料构建方法、装置、计算机设备及存储介质,以解决智能客服构建问答语料库费时费力的问题。
为解决上述技术问题,本申请提出一种基于爬虫模型的语料构建方法,包括如下步骤:
获取待构建问答语料数据的主题词;
将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;
将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网 络爬虫模型响应所述问题列表而输出的响应数据;
将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
为解决上述问题,本申请还提供一种基于爬虫模型的问答语料数据构建装置,包括:
获取模块,用于获取待构建问答语料数据的主题词;
生成模块,用于将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;
处理模块,用于将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据,其中所述第一网络爬虫模型以所述问题列表为约束条件抓取目标数据;
执行模块,用于将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
为解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述所述基于爬虫模型的语料构建方法的步骤。
为解决上述技术问题,本申请实施例还提供一个或多个非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行上述所述基于爬虫模型的语料构建方法的步骤。
本申请实施例的有益效果为:通过获取待构建问答语料数据的主题词;将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据;将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。其中,关于主题词的问题生成通过网络爬虫自动获取用户真实问题,或通过人工智能学习用户的真实意图生成问题列表,相应的应答数据也通过网络爬虫获取真实的客服应答。本发明提高了问答数据构建的效率和质量,也提高了智能客服的问题命中率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图
图1为本申请实施例一种基于爬虫模型的语料构建方法基本流程示意图;
图2为本申请实施例基于第二网络爬虫模型生成问题列表的流程示意图;
图3为本申请实施例基于Seq2Seq模型生成问题列表的流程示意图;
图4为本申请实施例基于过滤规则获取应答数据的流程示意图;
图5为本申请实施例基于深度神经网络模型获取应答数据的流程示意图;
图6为本申请实施例深度神经网络模型训练流程示意图;
图7为本申请实施例一种基于爬虫模型的问答语料数据构建装置基本结构框图;
图8为本申请实施计算机设备基本结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
在本申请的说明书和权利要求书及上述附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如101、102等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
实施例
本技术领域技术人员可以理解,这里所使用的“终端”、“终端设备”既包括无 线信号接收器的设备,其仅具备无发射能力的无线信号接收器的设备,又包括接收和发射硬件的设备,其具有能够在双向通信链路上,执行双向通信的接收和发射硬件的设备。这种设备可以包括:蜂窝或其他通信设备,其具有单线路显示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备;PCS(Personal Communications Service,个人通信系统),其可以组合语音、数据处理、传真和/或数据通信能力;PDA(Personal Digital Assistant,个人数字助理),其可以包括射频接收器、寻呼机、互联网/内联网访问、网络浏览器、记事本、日历和/或GPS(Global Positioning System,全球定位系统)接收器;常规膝上型和/或掌上型计算机或其他设备,其具有和/或包括射频接收器的常规膝上型和/或掌上型计算机或其他设备。这里所使用的“终端”、“终端设备”可以是便携式、可运输、安装在交通工具(航空、海运和/或陆地)中的,或者适合于和/或配置为在本地运行,和/或以分布形式,运行在地球和/或空间的任何其他位置运行。这里所使用的“终端”、“终端设备”还可以是通信终端、上网终端、音乐/视频播放终端,例如可以是PDA、MID(Mobile Internet Device,移动互联网设备)和/或具有音乐/视频播放功能的移动电话,也可以是智能电视、机顶盒等设备。
本实施方式中的终端即为上述的终端。
具体地,请参阅图1,图1为本实施例一种基于爬虫模型的语料构建方法的基本流程示意图。
如图1所示,一种基于爬虫模型的语料构建方法,包括下述步骤:
S101、获取待构建问答语料数据的主题词;
主题词限定了待构建问答语料数据的主题,通过终端上可交互的页面获取用户输入的主题词。为了构建的问答语料数据更聚焦,建议输入的主题词描述的范围适当小。例如“手机”,涵盖范围较广,构建的问答语料可能较发散,为了问答语料数据更聚焦,可以将主题词限定为“xx型号手机”。
S102、将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;
将获取的主题词输入到预先设定的问题生成模型中,生成基于主题词的问题列表。问题生成模型可以是预先设置的固定的一系列问题,以主题词为参数变化。例如,预先设置的一系列问题为:
主题词 发布时间是什么时候
主题词 销售价格是多少
主题词 销售渠道有哪些
主题词 支持指纹识别吗
主题词 支持多用户登录吗
可以预先设定不同类型的主题词对应不同的问题列表。
本申请实施例中采用网络爬虫模型,获取线上用户的真实提问;或者通过预先训练的Seq2Seq模型生成问题列表。具体地,请参阅以下图2和图3描述。
S103、将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据;
获取到有关主题词的问题列表,将其输入到预先设定的网络爬虫模型中,这里称为第一网络爬虫模型。网络爬虫是一个自动提取网页的程序,具体地,利用python程序,模拟浏览器,向目标站点发送请求,目标站点服务器响应请求,返回html、图片、视频等资源。第一网络爬虫模型以问题列表为检索条件,检索目标站点中与问题列表相关的数据,即第一网络爬虫模型影响问题列表而输出的响应数据。
S104、将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
将响应数据作为问题列表的应答数据,并将问题列表和应答数据按照一问一答形式关联。表现在数据库中,为一条数据包含两部分,一部分为问题,另一部分则为该问题的应答。
当智能客服接收到用户的问题时,可以通过关键词检索的方式,检索问答数据库与用户问题关键词一致的问题,并返回与该问题具有映射关系的应答。在一些实施方式中通过计算用户问题与问答数据库中问题相似度的方式,获取该问题对应的应答。相似度的计算可以采用计算编辑距离的算法计算相似度,例如:问答数据库中保存的问题为“手机卖多少钱”,接收到的用户问题为“手机多少钱”,两者的编辑距离为1,即“手机多少钱”变“手机卖多少钱”只需插入“卖”。检索数据库中与用户提出的问题相似度最大的问题,返回该问题对应的应答。
如图2所示,步骤S102具体还包括下述步骤:
S111、将所述主题词输入到第二网络爬虫模型中,获取所述第二网络爬虫模型响应所述主题词而输出的疑问候选数据;
本申请实施例中采用网络爬虫模型获取与输入主题词相关的问题,为区别 与前述网络爬虫模型的区别,这里称为第二网络爬虫模型,该模型以主题词为检索条件,获取目标站点上,与主题词相关的内容,这里称为疑问候选数据。
S112、按照预设的匹配规则对所述疑问候选数据进行匹配,获取疑问匹配数据,其中所述匹配规则至少包含疑问语料匹配规则;
获取的疑问候选数据中包含非疑问语料数据和疑问语料数据,为了得到疑问语料,预设了匹配规则,通过预设的匹配规则对疑问候选数据进行处理,得到疑问匹配数据。匹配规则为包含“?””什么”“多少”“哪里”“怎么”等表示疑问的语气词。本申请实施例中采用正则匹配的算法,正则表达式是对字符串操作的一种逻辑公式,用事先定义好的一些特定字符、及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑。正则表达式是一种文本模式,描述在搜索文本时要匹配的一个或多个字符串。例如,可以用“*主题词*什么*”这一正则表达式查找包含“主题词”和“什么”的任意字符串。
由于通过第二网络爬虫模型获取的是目标站点记录的用户的真实问题,所以通过这种方式获取的问题列表与现实更接近,基于此构建的问答语料数据命中用户实际提问的命中率更高。
S113、将所述疑问匹配数据作为所述主题词的问题列表。
疑问候选数据经过预设的匹配规则后,得到的疑问匹配数据就是与主题词相关的问题列表。
如图3所示,在一些实施方式中,步骤S102具体还包括下述步骤:
S121、将所述主题词输入到预先训练的Seq2Seq模型中;
在一些实施方式中,通过将主题词输入到预先训练的Seq2Seq模型中获取问题列表。Seq2Seq模型是一个Encoder(编码器)–Decoder(解码器)结构的网络,它的输入是一个序列,输出也是一个序列,Encoder中将一个可变长度的信号序列变为固定长度的向量表达,Decoder将这个固定长度的向量变成可变长度的目标的信号序列。具体地,先将接收的主题词进行one-hot词汇编码,然后,通过word2vec词向量模型,转化为对应的词向量,将词向量输入到Encoder层,其中,Encoder层是以双向LSTM层或RNN(卷积神经网络)作为基本的神经元单位的多层神经元层,生成final_state状态层和final_output状态向量;
然后将以上步骤输出的encoder的final output状态向量,输入到全局信息层,生成全局状态层context;
最后将以上步骤得到的final_state和final_output状态向量和全局信息层生成的context向量,输入到Decoder层中,输出decoder层的final_state向量和output向量,其中Decoder层也是以双向LSTM层或RNN作为基本的神经元单位的多层神经元层。输出结果即为以输入主题词为主题的基本问题列表。
S122、获取所述Seq2Seq模型响应所述主题词而输出的问题列表。
将Seq2Seq模型输出的关于主题词的响应作为问题列表。Seq2Seq模型需要经过训练才具备输出问题列表的功能。具体的训练过程为准备训练语料,即准备输入序列和对应的输出序列,将输入序列输入到Seq2Seq模型,计算得到输出序列的概率,调整Seq2Seq模型的参数,使整个样本,即所有输入序列经过Seq2Seq输出对应输出序列的概率最高。
如图4所示,在步骤S103之后,还包括下述步骤:
S131、按照预设的过滤规则对所述响应数据进行过滤,获取过滤数据,其中,所述过滤规则至少包含疑问语料数据过滤规则;
在一些实施方式中,对于获取的响应数据进行进一步的处理,由于这里需要获取的应答数据,所以首先需要过滤掉表示疑问的语料。同样可以通过正则匹配算法,将所有包含“什么”“怎么”“多少”等表示疑问语义的语料过滤掉。另外,过滤规则还可以包括敏感词的过滤,根据设定的敏感词表,过滤掉包含敏感词的语料。
S132、将所述过滤数据作为所述问题列表的应答数据。
响应数据经过滤后得到过滤数据即为问题列表的应答数据。
如图5所示,在一些实施方式中在步骤S103之后,还包括下述步骤:
S141、将所述响应数据输入到预先训练的深度神经网络模型中,获取所述深度神经网络模型输出的对所述响应数据的分类信息,其中,所述分类信息至少将所述响应数据区分为疑问语料数据和非疑问语料数据;
在一些实施方式中,通过预先训练的深度神经网络模型对响应数据进行分类,其中预先训练的深度神经网络模型至少可以识别出疑问语料和非疑问语料。深度神经网络的具体训练过程请参阅图6。
S142、将所述非疑问语料数据作为所述问题列表的应答数据。
经深度神经网络识别后的非语料数据即为问题列表对应的应答数据。
如图6所示,在步骤S141中使用的深度神经网络模型按下述步骤训练:
S151、获取标记有语料类别的训练样本,其中所述语料类别至少包含疑问 语料和非疑问语料;
本申请实施例中,深度神经网络模型的训练目标是可以识别出疑问语料和非疑问语料。所以训练样本中包含两类语料,每一个样本均标记有语料类别。
S152、将所述训练样本输入深度卷积神经网络模型获取所述训练样本的参照语料类别;
将样本输入到深度卷积神经网络模型,输出训练样本的参照语料类别,即输出样本语料是疑问语料还是非疑问语料。
S153、比对所述训练样本内不同样本的参照语料类别与所述语料类别是否一致;
通过损失函数判断深度卷积神经网络输出的参照语料类别与样本标注的语料类别是否一致。本申请实施例中,损失函数采用Softmax交叉熵损失函数。在训练过程中,调整深度卷积神经网络模型中的权重,使Softmax交叉熵损失函数尽可能收敛,也就是说继续调整权重,得到的损失函数的值不再缩小,反而增大时,认为深度卷积神经网络训练可以结束。
假设共有N个训练样本,针对网络最后分层第i个样本的输入特征为Xi,其对应的标记为Yi是最终的分类结果(即样本i是疑问句还是非疑问句),h=(h1,h2,...,hc)为网络的最终输出,即样本i的预测结果。其中C是最后所有分类的数量。
Softmax交叉熵损失函数为
Figure PCTCN2019117698-appb-000001
S154、当参照语料类别与所述语料类别不一致时,反复循环迭代的更新所述深度神经网络模型中的权重,至所述参照语料类别与所述语料类别一致时结束。
如前所述当损失函数没有收敛时,更新深度卷积神经网络模型中的各节点的权重,本申请实施例中采用梯度下降法,梯度下降法是一个最优化算法,用于机器学习和人工智能当中用来递归性地逼近最小偏差模型。
为解决上述技术问题本申请实施例还提供一种基于爬虫模型的问答语料数据构建装置。具体请参阅图7,图7为本实施例基于爬虫模型的问答语料数据构建装置的基本结构框图。
如图7所示,一种基于爬虫模型的问答语料数据构建装置,包括:获取模 块210、生成模块220、处理模块230和执行模块240。其中,获取模块210,用于获取待构建问答语料数据的主题词;生成模块220,用于将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;处理模块230,用于将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据,其中所述第一网络爬虫模型以所述问题列表为约束条件抓取目标数据;执行模块240,用于将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
本申请实施例通过获取待构建问答语料数据的主题词;将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据;将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。其中,关于主题词的问题生成通过网络爬虫自动获取用户真实问题,或通过人工智能学习用户的真实意图生成问题列表,相应的应答数据也通过网络爬虫获取真实的客服应答。本申请提高了问答数据构建的效率和质量,也提高了智能客服的问题命中率。
在一些实施方式中,所述生成模块220还包括:第一处理子模块、第一匹配子模块、第一执行子模块。其中,第一处理子模块,用于将所述主题词输入到第二网络爬虫模型中,获取所述第二网络爬虫模型响应所述主题词而输出的疑问候选数据;第一匹配子模块,用于按照预设的匹配规则对所述疑问候选数据进行匹配,获取疑问匹配数据,其中所述匹配规则至少包含疑问语料匹配规则;第一执行子模块,用于将所述疑问匹配数据作为所述主题词的问题列表。
在一些实施方式中,所述第一匹配子模块中,采用正则匹配算法获取疑问匹配数据。
在一些实施方式中,所述生成模块220还包括:第二处理子模块和第一获取子模块。其中,第二处理子模块,用于将所述主题词输入到预先训练的Seq2Seq模型中;第一获取子模块,用于获取所述Seq2Seq模型响应所述主题词而输出的问题列表。
在一些实施方式中,所述基于爬虫模型的问答语料数据构建装置还包括:,第一过滤子模块和第二执行子模块。其中,第一过滤子模块,用于按照预设的 过滤规则对所述响应数据进行过滤,获取过滤数据,其中,所述过滤规则至少包含疑问语料数据过滤规则;第二执行子模块,用于将所述过滤数据作为所述问题列表的应答数据。
在一些实施方式中,所述基于爬虫模型的问答语料数据构建装置还包括:第一分类子模块和第三执行子模块。其中,第一分类子模块,用于将所述响应数据输入到预先训练的深度神经网络模型中,获取所述深度神经网络模型输出的对所述响应数据的分类信息,其中,所述分类信息至少将所述响应数据区分为疑问语料数据和非疑问语料数据;第三执行子模块,用于将所述非疑问语料数据作为所述问题列表的应答数据。
在一些实施方式中,所述基于爬虫模型的问答语料数据构建装置还包括:第二获取子模块、第三处理子模块、第一比对子模块和第一更新子模块。其中,第二获取子模块,用于获取标记有语料类别的训练样本,其中所述语料类别至少包含疑问语料和非疑问语料;第三处理子模块,用于将所述训练样本输入深度卷积神经网络模型获取所述训练样本的参照语料类别;第一比对子模块,用于比对所述训练样本内不同样本的参照语料类别与所述语料类别是否一致;第一更新子模块,用于当参照语料类别与所述语料类别不一致时,反复循环迭代的更新所述深度神经网络模型中的权重,至所述参照语料类别与所述语料类别一致时结束。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图8,图8为本实施例计算机设备基本结构框图。
如图8所示,计算机设备的内部结构示意图。如图8所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种基于爬虫模型的问答语料数据构建的方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种基于爬虫模型的问答语料数据构建的方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多 或更少的部件,或者组合某些部件,或者具有不同的部件布置。
本实施方式中处理器用于执行图7中获取模块210、生成模块220、处理模块230和执行模块240的具体内容,存储器存储有执行上述模块所需的可读指令代码和各类数据。网络接口用于向用户终端或服务器之间的数据传输。本实施方式中的存储器存储有基于爬虫模型的语料构建方法中执行所有子模块所需的可读指令代码及数据,服务器能够调用服务器的可读指令代码及数据执行所有子模块的功能。
计算机设备通过获取待构建问答语料数据的主题词;将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据;将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。其中,关于主题词的问题生成通过网络爬虫自动获取用户真实问题,或通过人工智能学习用户的真实意图生成问题列表,相应的应答数据也通过网络爬虫获取真实的客服应答。本申请提高了问答数据构建的效率和质量,也提高了智能客服的问题命中率。
本申请还提供一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一实施例所述基于爬虫模型的语料构建方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于非易失性可读存储介质中,该可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子 步骤或者阶段的至少一部分轮流或者交替地执行。
以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (20)

  1. 一种基于爬虫模型的语料构建方法,其特征在于,包括下述步骤:
    获取待构建问答语料数据的主题词;
    将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;
    将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据;
    将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
  2. 根据权利要求1所述的基于爬虫模型的语料构建方法,其特征在于,在所述将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表的步骤中,具体包括下述步骤:
    将所述主题词输入到第二网络爬虫模型中,获取所述第二网络爬虫模型响应所述主题词而输出的疑问候选数据;
    按照预设的匹配规则对所述疑问候选数据进行匹配,获取疑问匹配数据,其中所述匹配规则至少包含疑问语料匹配规则;
    将所述疑问匹配数据作为所述主题词的问题列表。
  3. 根据权利要求2所述的基于爬虫模型的语料构建方法,其特征在于,在所述按照预设的匹配规则对所述疑问候选数据进行匹配的步骤中,所述匹配的步骤采用正则匹配算法获取疑问匹配数据。
  4. 根据权利要求1所述的基于爬虫模型的语料构建方法,其特征在于,在所述将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表的步骤中,具体包括下述步骤:
    将所述主题词输入到预先训练的Seq2Seq模型中;
    获取所述Seq2Seq模型响应所述主题词而输出的问题列表。
  5. 根据权利要求1所述的基于爬虫模型的语料构建方法,其特征在于,在所述将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据的步骤之后,还包括下述步骤:
    按照预设的过滤规则对所述响应数据进行过滤,获取过滤数据,其中,所述过滤规则至少包含疑问语料数据过滤规则;
    将所述过滤数据作为所述问题列表的应答数据。
  6. 根据权利要求1所述的基于爬虫模型的语料构建方法,其特征在于,在所述将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据的步骤之后,还包括下述步骤:
    将所述响应数据输入到预先训练的深度神经网络模型中,获取所述深度神经网络模型输出的对所述响应数据的分类信息,其中,所述分类信息至少将所述响应数据区分为疑问语料数据和非疑问语料数据;
    将所述非疑问语料数据作为所述问题列表的应答数据。
  7. 根据权利要求6所述的基于爬虫模型的语料构建方法,其特征在于,所述预先训练的深度神经网络模型是通过以下步骤进行训练:
    获取标记有语料类别的训练样本,其中所述语料类别至少包含疑问语料和非疑问语料;
    将所述训练样本输入深度卷积神经网络模型获取所述训练样本的参照语料类别;
    比对所述训练样本内不同样本的参照语料类别与所述语料类别是否一致;
    当参照语料类别与所述语料类别不一致时,反复循环迭代的更新所述深度神经网络模型中的权重,至所述参照语料类别与所述语料类别一致时结束。
  8. 一种基于爬虫模型的问答语料数据构建装置,其特征在于,包括:
    获取模块,用于获取待构建问答语料数据的主题词;
    生成模块,用于将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;
    处理模块,用于将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据,其中所述第一网络爬虫模型以所述问题列表为约束条件抓取目标数据;
    执行模块,用于将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
  9. 根据权利要求8所述的基于爬虫模型的问答语料数据构建装置,其特征在于,所述生成模块还包括:
    第一处理子模块,用于将所述主题词输入到第二网络爬虫模型中,获取所述第二网络爬虫模型响应所述主题词而输出的疑问候选数据;
    第一匹配子模块,用于按照预设的匹配规则对所述疑问候选数据进行匹配, 获取疑问匹配数据,其中所述匹配规则至少包含疑问语料匹配规则;
    第一执行子模块,用于将所述疑问匹配数据作为所述主题词的问题列表。
  10. 根据权利要求8所述的基于爬虫模型的问答语料数据构建装置,其特征在于,所述生成模块还包括:
    第二处理子模块,用于将所述主题词输入到预先训练的Seq2Seq模型中;
    第一获取子模块,用于获取所述Seq2Seq模型响应所述主题词而输出的问题列表。
  11. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
    获取待构建问答语料数据的主题词;
    将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;
    将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据;
    将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
  12. 根据权利要求11所述的计算机设备,其特征在于,在所述将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表的步骤中,具体包括下述步骤:
    将所述主题词输入到第二网络爬虫模型中,获取所述第二网络爬虫模型响应所述主题词而输出的疑问候选数据;
    按照预设的匹配规则对所述疑问候选数据进行匹配,获取疑问匹配数据,其中所述匹配规则至少包含疑问语料匹配规则;
    将所述疑问匹配数据作为所述主题词的问题列表。
  13. 根据权利要求12所述的计算机设备,其特征在于,在所述按照预设的匹配规则对所述疑问候选数据进行匹配的步骤中,所述匹配的步骤采用正则匹配算法获取疑问匹配数据。
  14. 根据权利要求11所述的计算机设备,其特征在于,在所述将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表的步骤中,具体包括下述步骤:
    将所述主题词输入到预先训练的Seq2Seq模型中;
    获取所述Seq2Seq模型响应所述主题词而输出的问题列表。
  15. 根据权利要求11所述的计算机设备,其特征在于,在所述将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据的步骤之后,还包括下述步骤:
    按照预设的过滤规则对所述响应数据进行过滤,获取过滤数据,其中,所述过滤规则至少包含疑问语料数据过滤规则;
    将所述过滤数据作为所述问题列表的应答数据。
  16. 一个或多个非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:
    获取待构建问答语料数据的主题词;
    将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表;
    将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据;
    将所述响应数据作为所述问题列表的应答数据,所述应答数据与所述问题列表关联构成所述主题词的问答语料数据。
  17. 根据权利要求16所述的非易失性可读存储介质,其特征在于,在所述将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表的步骤中,具体包括下述步骤:
    将所述主题词输入到第二网络爬虫模型中,获取所述第二网络爬虫模型响应所述主题词而输出的疑问候选数据;
    按照预设的匹配规则对所述疑问候选数据进行匹配,获取疑问匹配数据,其中所述匹配规则至少包含疑问语料匹配规则;
    将所述疑问匹配数据作为所述主题词的问题列表。
  18. 根据权利要求17所述的非易失性可读存储介质,其特征在于,在所述按照预设的匹配规则对所述疑问候选数据进行匹配的步骤中,所述匹配的步骤采用正则匹配算法获取疑问匹配数据。
  19. 根据权利要求16所述的非易失性可读存储介质,其特征在于,在所述将所述主题词输入到预先设定的问题生成模型中,获取所述问题生成模型响应所述主题词而输出的问题列表的步骤中,具体包括下述步骤:
    将所述主题词输入到预先训练的Seq2Seq模型中;
    获取所述Seq2Seq模型响应所述主题词而输出的问题列表。
  20. 根据权利要求16所述的非易失性可读存储介质,其特征在于,在所述将所述问题列表输入到预先设定的第一网络爬虫模型中,获取所述第一网络爬虫模型响应所述问题列表而输出的响应数据的步骤之后,还包括下述步骤:
    按照预设的过滤规则对所述响应数据进行过滤,获取过滤数据,其中,所述过滤规则至少包含疑问语料数据过滤规则;
    将所述过滤数据作为所述问题列表的应答数据。
PCT/CN2019/117698 2019-01-24 2019-11-12 基于爬虫模型的语料构建方法、装置及计算机设备 WO2020151318A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910065779.8A CN109918486B (zh) 2019-01-24 2019-01-24 智能客服的语料构建方法、装置、计算机设备及存储介质
CN201910065779.8 2019-01-24

Publications (1)

Publication Number Publication Date
WO2020151318A1 true WO2020151318A1 (zh) 2020-07-30

Family

ID=66960656

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117698 WO2020151318A1 (zh) 2019-01-24 2019-11-12 基于爬虫模型的语料构建方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN109918486B (zh)
WO (1) WO2020151318A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918486B (zh) * 2019-01-24 2024-03-19 平安科技(深圳)有限公司 智能客服的语料构建方法、装置、计算机设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308531A1 (en) * 2015-01-14 2017-10-26 Baidu Online Network Technology (Beijing) Co., Ltd. Method, system and storage medium for implementing intelligent question answering
CN108345640A (zh) * 2018-01-12 2018-07-31 上海大学 一种基于神经网络语义分析的问答语料库构建方法
CN108959559A (zh) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 问答对生成方法和装置
CN109190062A (zh) * 2018-08-03 2019-01-11 平安科技(深圳)有限公司 目标语料数据的爬取方法、装置及存储介质
CN109918486A (zh) * 2019-01-24 2019-06-21 平安科技(深圳)有限公司 智能客服的语料构建方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699590B (zh) * 2013-12-09 2017-04-05 北京奇立软件技术有限公司 提供图文教程类问题解决方案的方法和服务器
JP6520513B2 (ja) * 2015-07-17 2019-05-29 富士ゼロックス株式会社 問答情報提供システム、情報処理装置及びプログラム
US10275515B2 (en) * 2017-02-21 2019-04-30 International Business Machines Corporation Question-answer pair generation
CN108549710B (zh) * 2018-04-20 2023-06-27 腾讯科技(深圳)有限公司 智能问答方法、装置、存储介质及设备
CN108717433A (zh) * 2018-05-14 2018-10-30 南京邮电大学 一种面向程序设计领域问答系统的知识库构建方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308531A1 (en) * 2015-01-14 2017-10-26 Baidu Online Network Technology (Beijing) Co., Ltd. Method, system and storage medium for implementing intelligent question answering
CN108345640A (zh) * 2018-01-12 2018-07-31 上海大学 一种基于神经网络语义分析的问答语料库构建方法
CN108959559A (zh) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 问答对生成方法和装置
CN109190062A (zh) * 2018-08-03 2019-01-11 平安科技(深圳)有限公司 目标语料数据的爬取方法、装置及存储介质
CN109918486A (zh) * 2019-01-24 2019-06-21 平安科技(深圳)有限公司 智能客服的语料构建方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN109918486A (zh) 2019-06-21
CN109918486B (zh) 2024-03-19

Similar Documents

Publication Publication Date Title
WO2020177282A1 (zh) 一种机器对话方法、装置、计算机设备及存储介质
US11699298B2 (en) Training method of image-text matching model, bi-directional search method, and relevant apparatus
CN111753060B (zh) 信息检索方法、装置、设备及计算机可读存储介质
US10650311B2 (en) Suggesting resources using context hashing
WO2020140635A1 (zh) 文本匹配方法、装置及存储介质、计算机设备
CN107609185B (zh) 用于poi的相似度计算的方法、装置、设备和计算机可读存储介质
WO2020155619A1 (zh) 带情感的机器聊天方法、装置、计算机设备及存储介质
EP3314466A1 (en) Selecting representative video frames for videos
JP6745384B2 (ja) 情報をプッシュするための方法及び装置
EP3717984A1 (en) Method and apparatus for providing personalized self-help experience
CN106354856B (zh) 基于人工智能的深度神经网络强化搜索方法和装置
KR102326744B1 (ko) 사용자 참여형 키워드 선정 시스템의 제어 방법, 장치 및 프로그램
CN111625715B (zh) 信息提取方法、装置、电子设备及存储介质
CN114036398B (zh) 内容推荐和排序模型训练方法、装置、设备以及存储介质
CN113806588B (zh) 搜索视频的方法和装置
CN111666416A (zh) 用于生成语义匹配模型的方法和装置
CN116601626A (zh) 个人知识图谱构建方法、装置及相关设备
WO2020151318A1 (zh) 基于爬虫模型的语料构建方法、装置及计算机设备
JP2022541832A (ja) 画像を検索するための方法及び装置
CN116756281A (zh) 知识问答方法、装置、设备和介质
CN116414964A (zh) 智能客服问答知识库构建方法、装置、设备及介质
CN116401522A (zh) 一种金融服务动态化推荐方法和装置
CN112330387B (zh) 一种应用于看房软件的虚拟经纪人
CN115952852B (zh) 模型训练方法、文本检索方法、装置、电子设备和介质
CN114328797B (zh) 内容搜索方法、装置、电子设备、存储介质及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19911561

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19911561

Country of ref document: EP

Kind code of ref document: A1