CN111046152B

CN111046152B - Automatic FAQ question-answer pair construction method and device, computer equipment and storage medium

Info

Publication number: CN111046152B
Application number: CN201910969443.4A
Authority: CN
Inventors: 杨凤鑫; 徐国强
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2023-09-29
Anticipated expiration: 2039-10-12
Also published as: CN111046152A; WO2021068352A1

Abstract

The embodiment of the application discloses an automatic FAQ question-answer pair construction method, an automatic FAQ question-answer pair construction device, computer equipment and a storage medium. The method belongs to the technical field of artificial intelligence and natural language processing, and comprises the following steps: acquiring a document to be read; analyzing a document to be read, segmenting the analyzed document to obtain a segmented document serving as a target document; according to the questions to be answered and a preset screening model, selecting paragraphs matched with the questions to be answered from the target document as target paragraphs; and generating FAQ question-answer pairs based on a preset reading understanding model according to the target paragraphs and the questions to be answered. According to the embodiment of the application, the target paragraphs matched with the questions to be answered are screened, then the FAQ question-answering pairs are generated according to the target paragraphs and the questions to be answered, and the non-target paragraphs are not required to be processed, so that interference information brought by the generation of the FAQ question-answering to the non-target paragraphs is reduced to a certain extent, and the matching accuracy of the generated FAQ question-answering pairs is higher.

Description

Automatic FAQ question-answer pair construction method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence and natural language processing, in particular to an automatic FAQ question-answer pair construction method, an automatic FAQ question-answer pair construction device, computer equipment and a storage medium.

Background

FAQ is an abbreviation for english Frequently Asked Questions, and chinese means "frequently asked questions," or more colloquially called "common questions solutions. FAQ is considered a common online customer service approach, a good FAQ system, should answer at least 80% of the general questions and common questions of the user. Thus, the system is convenient for users, greatly reduces the pressure of website staff, saves a great deal of customer service cost and increases the satisfaction of customers. Therefore, how to effectively implement the construction of the FAQ database is particularly important.

At present, the automatic FAQ construction in the question-answering field mainly comprises the following three methods: (1) The method comprises the steps of segmenting words of an article to be read and a question to be answered, obtaining corresponding word strings after segmentation, and inputting the word strings into an automatic reading understanding model to output answers corresponding to the questions. (2) According to the similarity of the questions presented by the user and the existing question record in the question-answer database, finding out the question matched with the question of the user in the existing question-answer database, and returning the answer corresponding to the question to the user to finish FAQ answering. (3) And establishing a sentence pattern template corresponding to the standard question by adopting a manual entry mode for the established FAQ. And matching the question sentence of the user by using a sentence pattern template, and matching the question sentence to the FAQ through the mapping of the sentence pattern template and the FAQ. The three methods can realize automatic construction of the FAQ question-answer pair although successful in matching to a certain extent, but the matching accuracy of the FAQ question-answer pair is still lower.

Disclosure of Invention

The embodiment of the invention provides an automatic construction method, an automatic construction device, computer equipment and a storage medium of an FAQ question-answer pair, aiming at solving the problem that the automatic construction of the existing FAQ question-answer pair is lower in matching accuracy.

In a first aspect, an embodiment of the present invention provides a method for automatically constructing a FAQ question-answer pair, including:

acquiring a document to be read;

analyzing the document to be read and segmenting the analyzed document to obtain a segmented document serving as a target document;

according to the questions to be answered and a preset screening model, screening paragraphs matched with the questions to be answered from the target document to serve as target paragraphs;

and generating FAQ question-answer pairs based on a preset reading understanding model according to the target paragraphs and the questions to be answered.

In a second aspect, an embodiment of the present invention further provides an apparatus for automatically constructing a FAQ question-answer pair, including:

the acquisition unit is used for acquiring the document to be read;

the analysis segmentation unit is used for analyzing the document to be read and segmenting the analyzed document to obtain a segmented document serving as a target document;

the screening unit is used for screening paragraphs matched with the questions to be answered from the target document to serve as target paragraphs according to the questions to be answered and a preset screening model;

And the generating unit is used for generating FAQ question-answer pairs based on a preset reading understanding model according to the target paragraphs and the questions to be answered.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the above method.

The embodiment of the invention provides a FAQ question-answer pair automatic construction method, a FAQ question-answer pair automatic construction device, computer equipment and a storage medium. Wherein the method comprises the following steps: acquiring a document to be read; analyzing the document to be read and segmenting the analyzed document to obtain a segmented document serving as a target document; according to the questions to be answered and a preset screening model, screening paragraphs matched with the questions to be answered from the target document to serve as target paragraphs; and generating FAQ question-answer pairs based on a preset reading understanding model according to the target paragraphs and the questions to be answered. According to the technical scheme, the target paragraphs matched with the questions to be answered are screened, then the FAQ question-answering pairs are generated according to the target paragraphs and the questions to be answered, the non-target paragraphs are not required to be processed, interference information caused by the generation of the FAQ question-answering to the non-target paragraphs is reduced to a certain extent, and the generated FAQ question-answering pairs are higher in matching accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a scenario of an automatic FAQ question-answer pair construction method provided by an embodiment of the present invention;

fig. 2 is a schematic flow chart of an automatic FAQ question-answer pair construction method according to an embodiment of the present invention;

fig. 3 is a schematic sub-flowchart of an automatic FAQ question-answer pair construction method according to an embodiment of the present invention;

fig. 4 is a schematic sub-flowchart of an automatic FAQ question-answer pair construction method according to an embodiment of the present invention;

fig. 5 is a schematic sub-flowchart of an automatic FAQ question-answer pair construction method according to an embodiment of the present invention;

fig. 6 is a schematic flow chart of a method for automatically constructing FAQ question-answer pairs according to another embodiment of the present invention;

fig. 7 is a schematic block diagram of an automatic FAQ question-answer pair construction device according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of an analysis segmentation unit of an automatic FAQ question-answer pair construction device provided by an embodiment of the present invention;

Fig. 9 is a schematic block diagram of a screening unit of the FAQ question-answer pair automatic construction device provided by the embodiment of the present invention;

fig. 10 is a schematic block diagram of a generating unit of the FAQ question-answer pair automatic construction device provided by the embodiment of the present invention;

FIG. 11 is a schematic block diagram of an apparatus for automatically constructing FAQ question-answer pairs according to another embodiment of the present invention; and

fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Referring to fig. 1, fig. 1 is a schematic view of a scenario of an automatic FAQ question-answer pair construction method according to an embodiment of the present invention. The FAQ question-answer pair automatic construction method of the embodiment of the invention can be applied to a server, and can be realized through a software program configured on the server. The server communicates with the terminal, so that the server calls the document to be read uploaded by the terminal and obtains the FAQ question-answer pair after a series of processing is carried out according to the questions to be answered and the document to be read, and the automatic construction of the FAQ question-answer pair is realized. The terminal may be a desktop computer, a portable computer, a tablet computer, etc., and is not particularly limited herein. In addition, in fig. 1, the number of the terminals and the users is one, and it is understood that in the practical application process, the number of the terminals and the users may be plural, and fig. 1 is merely illustrative.

Referring to fig. 2, fig. 2 is a flow chart of an automatic FAQ question-answer pair construction method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S100 to S130.

S100, acquiring a document to be read.

Specifically, the server needs to automatically construct the FAQ question-answer pair, firstly, a document to be read needs to be obtained, and then, the FAQ question-answer pair can be generated after a series of processing is performed on the basis of the document to be read. In the embodiment of the invention, the document to be read can be uploaded by the user through the user terminal, and in particular, the document to be read can be uploaded by the user through the FAQ webpage of the user terminal so as to be sent to the server. In the embodiment of the invention, the document to be read is a PDF document.

It should be noted that, in other embodiments, the document to be read may be another type of document, such as a Word document.

S110, analyzing the document to be read and segmenting the analyzed document to obtain a segmented document serving as a target document.

Specifically, after the server obtains the document to be read, the document to be read needs to be parsed to obtain the document in the required format, and then the content in the document is segmented to finally obtain the document with the preset document structure.

Referring to fig. 3, in an embodiment, for example, in the embodiment of the present invention, the step S110 includes the following steps S111-S112.

S111, analyzing the document to be read by adopting a stacked CRF model to obtain an XML document.

S112, segmenting the XML document in a preset segmentation mode to obtain a document with a preset document structure as a target document.

In the embodiment of the invention, the document to be read is analyzed by adopting a stacked CRF model so as to obtain an XML document. Wherein CRF is an abbreviation of Conditional Random Field, which is herein called a conditional random field, and in this embodiment, a stacked CRF model is used because the processing time for parsing a document to be read by the stacked CRF model is relatively short and the processing effect is relatively good. XML is an abbreviation for Extensible Markup Language, which is herein fully referred to as the extensible markup language. After the document to be read is parsed to obtain an XML document, the XML document is segmented to obtain a segmented document as a target document. In particular, the XML document may be segmented by a preset segmentation method, where the segmentation method includes various manners, for example, the segmentation method selected in the embodiment uses the secondary header as the segment. In other embodiments, other segmentation methods such as a primary title or an article paragraph can be adopted according to actual requirements.

It should be noted that in other embodiments, other models may be used to parse the document to be read, for example, a hidden markov model, that is, a HMM (Hidden Markov Model) model may be used.

S120, according to the questions to be answered and a preset screening model, the paragraphs matched with the questions to be answered are screened out from the target document to serve as target paragraphs.

Specifically, after the server analyzes the document to be read and segments the analyzed document to obtain the target document, the server also needs to screen the target document to obtain a paragraph matched with the question to be answered as a target paragraph. In the embodiment of the invention, the questions to be answered are the questions stored in the question templates in the preset database. Specifically, after uploading a document to be read on the FAQ webpage of the terminal, the user may select a corresponding problem template according to the uploaded document content or document name to be read. For example, if the document content to be read uploaded by the user on the FAQ web page end page is a life or accident risk, a question template related to the life or accident risk is selected, where the question template includes a plurality of common questions related to the life or accident risk. The server calls a corresponding problem template according to the selection of a user, and screens out paragraphs matched with the problems in the problem template from the target document to serve as target paragraphs according to the problems in the problem template and a preset screening model.

In some embodiments, such as the embodiment of the present invention, as shown in fig. 4, the step S120 may include the following steps S121-S123.

S121, coding the target document according to the questions to be answered and the preset screening model to obtain a first paragraph text vector.

In the embodiment of the invention, in order to screen the target document to obtain the paragraphs matched with the questions to be answered, the target document is firstly required to be encoded according to the questions to be answered and the screening model to obtain the first paragraph text vector. The preset screening model is, for example, bert (Bidirectional Encoder Representations From Transformers) model. The Bert model is a model based on a transform and adopting a bi-directional language, can extract grammar and semantic information of a document, and can also extract the grammar and semantic information of the document in combination with context information of the document. Specifically, the server generates a first paragraph text vector for the target document according to the question to be answered and the Bert model. Wherein the first paragraph text vector is a three-dimensional vector, which is a text vector representation of the target document. The first paragraph text vector is generated for the target document by adopting the Bert model, because the Bert model can extract the grammar and semantic information of the target document, and can also extract the grammar and semantic information by combining the context information of the target document, thereby improving the extraction accuracy.

It should be noted that in other embodiments, other models may be used to screen the target document according to the actual requirements to obtain the target paragraph, for example, the Word2vec (Word to vector) model.

S122, calculating the probability that each first paragraph text vector is matched with the question to be answered according to the question to be answered.

S123, determining the paragraph corresponding to the first paragraph text vector with the highest probability as the paragraph matched with the to-be-answered question, and taking the paragraph as the target paragraph.

In the embodiment of the invention, after the server generates the first paragraph text vectors for the target document according to the questions to be answered and the Bert model, the probability that each first paragraph text vector is matched with the questions to be answered is calculated according to the questions to be answered, and the paragraphs corresponding to the first paragraph text vectors with the highest probability are sequenced and taken as target paragraphs according to the calculated probabilities. Specifically, a Softmax function is used for calculating the probability that each first paragraph text vector is matched with the question to be answered according to the question to be answered, the calculated probability is ordered after the probability is obtained, and the paragraph corresponding to the first paragraph text vector with the highest probability is taken as the target paragraph.

S130, generating FAQ question-answer pairs based on a preset reading understanding model according to the target paragraphs and the questions to be answered.

Specifically, after the server screens out paragraphs matching with the questions to be answered from the target document, a FAQ question-answer pair is generated according to the screened paragraphs and the questions to be answered. Specifically, the FAQ question-answer pair can be automatically generated through a preset reading understanding model. The reading understanding model is used for predicting the starting and ending positions of answers corresponding to the questions to be answered in the target paragraph according to the target paragraph and the questions to be answered, so that the answers are determined, and FAQ question-answer pairs are generated. In the embodiment of the invention, the paragraphs matched with the questions to be answered are screened out from the target document as target paragraphs, so that the server can generate the FAQ question-answer pairs based on the target paragraphs and the questions to be answered, the non-target paragraphs do not need to be processed, the interference information brought by the generation of the FAQ question-answer pairs to the non-target paragraphs is reduced to a certain extent, and the accuracy of the generated FAQ question-answer pairs is higher, thereby weakening the influence brought by the cross-domain and generating the FAQ question-answer pairs with higher matching accuracy to the cross-domain questions.

In some embodiments, such as the present embodiment, as shown in fig. 5, the step S130 may include the following steps S131-S134.

S131, encoding the target paragraph and the question to be answered respectively to obtain a second paragraph text vector and a question text vector.

In the embodiment of the invention, after the server screens out paragraphs matched with the questions to be answered from the target document, the screened target paragraphs and the questions to be answered are respectively encoded. Specifically, a preset model, for example, a Bert model, is used to encode the target paragraph and the question to be answered, and then a preset model, for example, encoerBlock, is used to recode the encoded target paragraph and the question to be answered, so as to obtain a text vector of the second paragraph and a text vector of the question. The second paragraph text vector and the question text vector are three-dimensional vectors, the first component in the three-dimensional vector is batch_size, the batch_size is a Batch processing parameter, and the limit value is the total number of training set samples. In this embodiment, the value of batch_size is 32, which indicates that the preset model is to Batch the target paragraph and the question to be answered in small batches. In other embodiments, the batch_size may be set to other values, and only the second paragraph text vector and the question text vector may be obtained after the target paragraph and the question to be answered are encoded. The second component in the three-dimensional vector is the length of the sentence. The third component in the three-dimensional vector is the dimension to which each word corresponds. The Encoder Block includes convolutional neural networks, self-attention mechanisms, and forward neural networks. Among them, convolutional neural network (Convolutional Neural Networks, CNN) is a feedforward neural network (Feedforward Neural Networks) with a Deep structure, which includes convolutional calculation, and is one of representative algorithms of Deep learning. The Self-Attention mechanism (Self-Attention) utilizes the Attention mechanism, which allows for the characterization of context information. Specifically, the target paragraph and the question to be answered are encoded by using the Bert model to obtain a first temporary paragraph text vector and a first temporary question text vector, wherein the first temporary paragraph text vector and the first temporary question text vector are three-dimensional vectors, which can be called as a first three-dimensional vector. And then, the convolution neural network is used for encoding the first temporary paragraph text vector and the first temporary problem text vector to obtain a second temporary paragraph text vector and a second temporary problem text vector, wherein the second temporary paragraph text vector and the second temporary problem text vector are three-dimensional vectors, which can be called as second three-dimensional vectors, and the dimension of each word in the second three-dimensional vector is reduced compared with the dimension of each word in the first three-dimensional vector. And then calculating a weight for each other word except the current word in the second temporary paragraph text vector and the second temporary problem text vector through a self-attention mechanism, carrying out weighted summation on the weight values to obtain a third temporary paragraph text vector and a third temporary problem text vector, wherein the third temporary paragraph text vector and the third temporary problem text vector can be called a third three-dimensional vector, the meaning of each component representation of the third three-dimensional vector is the same as that of the second three-dimensional vector, the third three-dimensional vector only carries out further extraction on the second three-dimensional vector, and the dimension of each word in the third three-dimensional vector is reduced compared with the dimension of each word in the second three-dimensional vector. And finally, continuously extracting the third temporary paragraph text vector and the third temporary question text vector through a forward neural network to obtain a final required paragraph text vector and a question text vector, wherein the final required paragraph text vector is defined as a second paragraph text vector, and the dimension of each word in the second paragraph text vector and the question text vector is reduced compared with the dimension of each word in the third temporary paragraph text vector and the third temporary question text vector. It will be appreciated that in this step, only one Encoder Block is superimposed, and there is a multi-layer network in one Encoder Block, and the higher the number of layers of the network, the problem of gradient extinction will exist when the network is counter-propagating, and in order to alleviate this problem, in this embodiment, residuals are added when convolutional neural networks, self-attention mechanisms and forward neural networks are used for encoding, and the residuals are added to alleviate this problem.

It should be noted that, in other embodiments, other models may be used to encode the target paragraph and the question to be answered according to actual requirements, and only the second paragraph text vector and the question text vector may be obtained, for example, the RNN (Recurrent Neural Network) model is used to replace Self-attention (Self-attention).

S132, encoding the second paragraph text vector and the question text vector to obtain a new text vector.

In the embodiment of the invention, after the target paragraph and the question to be answered are respectively encoded to obtain the second paragraph text vector and the question text vector, the second paragraph text vector and the question text vector are further encoded to obtain a new text vector. Specifically, the text vector of the second paragraph and the text vector of the question are subjected to the content coding operation in the Context-Query content layer. The Attention coding operation comprises content-to-Query and content-to-content. The content-to-Query Attention coding operation refers to that the length N of the content and the length M of the Query are formed into a correlation matrix N.times.M, then Softmax coding calculation is carried out on each row of the correlation matrix N.times.M to obtain an Attention score, and finally the text vector containing the Attention information is obtained by carrying out calculation weighted summation on the Attention score and the text vector of the original Query. The Query-to-Context Attention coding operation refers to that the length M of the Query and the length N of the Context form a correlation matrix M x N, then Softmax calculation is carried out on each row of the correlation matrix M x N to obtain an Attention score, and finally calculation weighted summation is carried out on the Attention score and the text vector of the original Context to obtain the text vector containing the Attention information. And obtaining a new text vector by performing Context-to-Query and Query-to-Context coding operation on the second paragraph text vector and the question text vector in the Context-Query attribute layer.

It should be noted that, in the embodiment of the present invention, the new text vector is also a three-dimensional vector, and the meaning represented by each component is the same as the second paragraph text vector and the question text vector, only the interaction between the second paragraph text vector and the question text vector is realized, and the third component, that is, the dimension of each word, is increased, so that the description is concise and convenient, and will not be repeated here.

S133, coding the new text vector according to a preset extraction model to obtain a target text vector.

In the embodiment of the invention, after the second paragraph text vector and the question text vector are encoded to obtain the new text vector, the new text vector is encoded according to a preset extraction model to obtain the target text vector. The preset extraction model is, for example, an encocoder Block, where the number of the encodings of the new text vector is different from the number of the blocks in the encodings of the step S131, but the predetermined extraction model includes a convolutional neural network, a self-attention mechanism, and a forward neural network, and residuals are added when the convolutional neural network, the self-attention mechanism, and the forward neural network encode the new text vector. In this step, three Encoder blocks are superimposed to encode the new text vector, so as to further extract the target text vector from the new text vector, and the dimension of each word in the target vector is reduced compared with the dimension of each word in the new text vector, so that the matching accuracy of generating the FAQ question-answer pair is higher.

S134, calculating the target text vector to obtain positions of starting and ending of the answers of the questions to be answered, so that the FAQ question-answer pair is generated.

In the embodiment of the invention, after the new text vector is encoded by the Encoder Block to obtain the target text vector, the target text vector is also required to be calculated to obtain the starting and ending positions of the answers of the questions to be answered, so that the FAQ question-answer pair is generated. Specifically, the text vector obtained by encoding the first Encoder Block in the step S133 and the text vector obtained by encoding the second Encoder Block are spliced together to serve as the starting position of the answer of the question to be answered, the text vector obtained by encoding the first Encoder Block and the text vector obtained by encoding the third Encoder Block are spliced together to serve as the ending position of the answer of the question to be answered, then Softmax operation is performed on the starting and ending positions of the answer of the question to be answered to obtain the probability of the starting and ending positions of the answer of the question to be answered, and the position with the maximum probability of the starting and ending positions of the answer of the question to be answered is taken as the starting and ending position of the answer of the question to be answered, so that an FAQ answer pair is generated, and automatic construction of the FAQ answer pair is realized.

Fig. 6 is a flow chart of an automatic FAQ question-answer pair construction method according to another embodiment of the present invention, as shown in fig. 6, in this embodiment, the method includes steps S100-S190. That is, in this embodiment, the method further includes steps S140 to S190 after step S130 of the above embodiment.

S140, acquiring the FAQ question-answer pair and feeding back the acquired FAQ question-answer pair to a user.

In the embodiment of the invention, a user uploads the content of a document to be read on a FAQ webpage end page, selects a related problem template according to the content or the document name of the document to be read, invokes the related problem template according to the selection of the user, screens out paragraphs matched with the problems in the problem template from a target document according to the problems in the problem template and a Bert screening model, and then generates a FAQ question-answer pair according to the target paragraphs and the questions to be answered, and feeds the generated FAQ question-answer pair back to the user. Specifically, the server obtains the generated FAQ question-answer pair and displays the obtained FAQ question-answer pair in the page of the FAQ webpage end, and the user can perform subsequent operations according to the needs. For example, if the user is satisfied with the generated FAQ question-answer pair, the FAQ question-answer pair may be directly derived, and if not satisfied, the FAQ webpage-side modification interface may be modified.

S150, judging whether a modification instruction sent by a user is received.

In the embodiment of the invention, after the server acquires the FAQ question-answer pair and feeds back the acquired FAQ question-answer pair to the user, the server judges whether a modification instruction sent by the user is received or not.

And S160, if a modification instruction sent by a user is received, taking the question input by the user in the modification instruction as the question to be answered, and returning to the step S120 of executing the screening model according to the question to be answered and the preset screening model to screen the paragraphs matched with the question to be answered from the target document as target paragraphs.

In the embodiment of the invention, if the server receives the modification instruction sent by the user, the server indicates that the user is not satisfied with the FAQ question and answer, the user can input the question which the user wants to ask in the modification page of the FAQ webpage end, and then the server takes the question input by the user in the modification instruction as the question to be answered, and returns to execute the step S120. That is, after the questions to be answered are redetermined, according to the pairs of questions to be answered and a preset screening model, the paragraphs matched with the questions to be answered are screened out from the target document to serve as target paragraphs, then the subsequent steps are sequentially executed, and finally the obtained pairs of FAQ questions and answers are fed back to the user.

S170, if a modification instruction sent by a user is not received, judging whether the to-be-answered question is a question in a preset database question template.

S180, if the to-be-answered question is not a question in a preset database question template, updating the question in the preset database question template according to the to-be-answered question.

And S190, if the to-be-answered question is a question in a preset database question template, not updating the question in the preset database question template.

In the embodiment of the invention, if the server does not receive the modification instruction sent by the user, the server indicates that the user is satisfied with the generated FAQ question-answer pair, judges whether the to-be-answered question is a question in the preset database question template, and if the to-be-answered question is a question in the preset database, the server indicates that the to-be-answered question is a question in the question template and the user is satisfied, and can derive the generated FAQ question-answer pair. If the questions to be answered are not the questions in the preset database, the questions to be answered are indicated to be the questions input by the user and the answer accuracy is higher, and the questions input by the user are required to be supplemented into the preset database question templates to update and expand the questions in the preset database question templates, so that the questions in the preset database question templates are richer and the user requirements can be met more when the automatic generating operation of the next FAQ question-answering pair is carried out.

It should be noted that, in the embodiment of the present invention, the operation of modifying the FAQ web page modification interface by the user is recorded, and the modified result is used as the history of the questions to be answered, and these history can be used as a large amount of labeling data to optimize the FAQ question-answering model.

Fig. 7 is a schematic block diagram of an FAQ question-answer pair automatic construction device 200 according to an embodiment of the present invention. As shown in fig. 7, the present invention also provides an automatic FAQ question-answer pair construction apparatus 200 corresponding to the above automatic FAQ question-answer pair construction method. The FAQ question-answer pair automatic construction apparatus 200 includes a unit for performing the above-described FAQ question-answer pair automatic construction method, and may be configured in a server. Specifically, referring to fig. 7, the FAQ question-answer pair automatic construction device 200 includes an obtaining unit 201, a parsing and segmenting unit 202, a screening unit 203, and a generating unit 204.

Wherein, the obtaining unit 201 is used for obtaining a document to be read; the parsing and segmenting unit 202 is configured to parse the document to be read and segment the parsed document to obtain a segmented document as a target document; the screening unit 203 is configured to screen, according to a question to be answered and a preset screening model, a paragraph matching with the question to be answered from the target document as a target paragraph; the generating unit 204 is configured to generate a FAQ question-answer pair based on a preset reading understanding model according to the target paragraph and the question to be answered.

In some embodiments, for example, in the present embodiment, as shown in fig. 8, the parsing and segmenting unit 202 includes a parsing unit 2021 and a segmenting unit 2022.

The parsing unit 2021 is configured to parse the document to be read by using a stacked CRF model to obtain an XML document; the segmentation unit 2022 is configured to segment the XML document by a preset segmentation manner, so as to obtain a document with a preset document structure as a target document.

In some embodiments, for example, in the present embodiment, as shown in fig. 9, the filtering unit 203 includes a first encoding unit 2031, a calculating unit 2032, and a determining unit 2033.

The first encoding unit 2031 is configured to encode the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector; a first calculating unit 2032 is configured to calculate, according to the questions to be answered, a probability that each of the first paragraph text vectors matches the questions to be answered; the determining unit 2033 is configured to determine a paragraph corresponding to the first paragraph text vector with the largest probability as a paragraph matching the question to be answered, and serve as a target paragraph.

In some embodiments, for example, in the present embodiment, as shown in fig. 10, the generating unit 204 includes a second encoding unit 2041, a third encoding unit 2042, a fourth encoding unit 2043, and a generating subunit 2044.

The second encoding unit 2041 is configured to encode the target paragraph and the question to be answered respectively to obtain a second paragraph text vector and a question text vector; the second encoding unit 2042 is configured to encode the second paragraph text vector and the question text vector to obtain a new text vector; the third encoding unit 2043 is configured to encode the new text vector according to a preset extraction model to obtain a target text vector; the second generating subunit 2044 is configured to calculate the target text vector to obtain positions of starting and ending of the answer of the question to be answered, so as to generate the FAQ question-answer pair.

In some embodiments, for example, in the present embodiment, as shown in fig. 11, the apparatus 200 further includes a feedback unit 205, a first determining unit 206, a modifying unit 207, a second determining unit 208, and an updating unit 209.

The feedback unit 205 is configured to obtain the FAQ question-answer pair and feed back the obtained FAQ question-answer pair to a user; the first judging unit 206 is configured to judge whether a modification instruction sent by a user is received; the modifying unit 207 is configured to, if the modifying instruction sent by the user is received, take a question input by the user in the modifying instruction as the question to be answered; the second determining unit 208 is configured to determine whether the to-be-answered question is a question in a preset database question template if the modification instruction sent by the user is not received; the updating unit 209 is configured to update the questions in the preset database question template according to the questions to be answered if the questions to be answered are not questions in the preset database question template.

The FAQ question-answer pair automatic construction means described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 12.

Referring to fig. 12, fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 300 is a server, and specifically, the server may be an independent server or a server cluster formed by a plurality of servers.

With reference to FIG. 12, the computer device 300 includes a processor 302, a memory, and a network interface 305, which are connected by a system bus 301, wherein the memory may include a non-volatile storage medium 503 and an internal memory 304.

The non-volatile storage medium 303 may store an operating system 3031 and a computer program 3032. The computer program 3032, when executed, may cause the processor 302 to perform a FAQ question-answer pair automatic construction method.

The processor 302 is used to provide computing and control capabilities to support the operation of the overall computer device 300.

The internal memory 304 provides an environment for the execution of a computer program 3032 in the non-volatile storage medium 303, which computer program 3032, when executed by the processor 302, causes the processor 302 to perform a FAQ question-answer pair automatic construction method.

The network interface 305 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 300 to which the present inventive arrangements may be applied, and that a particular computer device 300 may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.

Wherein the processor 302 is configured to execute a computer program 3032 stored in a memory to implement the following steps: acquiring a document to be read; analyzing the document to be read and segmenting the analyzed document to obtain a segmented document serving as a target document; according to the questions to be answered and a preset screening model, screening paragraphs matched with the questions to be answered from the target document to serve as target paragraphs; and generating FAQ question-answer pairs based on a preset reading understanding model according to the target paragraphs and the questions to be answered.

In some embodiments, for example, in this embodiment, when implementing the steps of parsing the document to be read and segmenting the parsed document to obtain a segmented document as the target document, the processor 302 specifically implements the following steps: analyzing the document to be read by adopting a stacked CRF model to obtain an XML document; and segmenting the XML document in a preset segmentation mode to obtain a document with a preset document structure as a target document.

In some embodiments, for example, in this embodiment, when the step of screening paragraphs matching the questions to be answered from the target document according to the questions to be answered and the preset screening model is implemented by the processor 302, the following steps are specifically implemented: coding the target document according to the questions to be answered and the preset screening model to obtain a first paragraph text vector; calculating the probability of matching each first paragraph text vector with the question to be answered according to the question to be answered; and determining the paragraph corresponding to the first paragraph text vector with the highest probability as the paragraph matched with the to-be-answered question, and taking the paragraph as the target paragraph.

In some embodiments, for example, in this embodiment, when the processor 302 implements the step of generating the FAQ question-answer pair according to the target paragraph and the question to be answered based on a preset reading understanding model, the following steps are specifically implemented: encoding the target paragraph and the question to be answered respectively to obtain a second paragraph text vector and a question text vector; encoding the second paragraph text vector and the question text vector to obtain a new text vector; coding the new text vector according to a preset extraction model to obtain a target text vector; and calculating the target text vector to obtain the starting and ending positions of the answers of the questions to be answered, thereby generating the FAQ question-answer pair.

In some embodiments, for example, in this embodiment, after the step of implementing the target paragraph and the question to be answered and generating the FAQ question-answer pair based on the preset generation model, the specific implementation further includes the following steps: acquiring the FAQ question-answer pair and feeding back the acquired FAQ question-answer pair to a user; judging whether a modification instruction sent by a user is received or not; if the modification instruction sent by the user is received, taking the question input by the user in the modification instruction as the question to be answered and returning to execute the step of screening paragraphs matched with the question to be answered from the target document as target paragraphs according to the question to be answered and a preset screening model; if the modification instruction sent by the user is not received, judging whether the to-be-answered question is a question in a preset database question template or not; if the to-be-answered question is not a question in a preset database question template, updating the question in the preset database question template according to the to-be-answered question; if the to-be-answered questions are questions in a preset database question template, the questions in the preset database question template are not updated.

It should be appreciated that in embodiments of the present application, the processor 302 may be a central processing unit (Central Processing Unit, CPU), the processor 302 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program may be stored in a storage medium that is a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program. The computer program, when executed by a processor, causes the processor to perform the steps of: acquiring a document to be read; analyzing the document to be read and segmenting the analyzed document to obtain a segmented document serving as a target document; according to the questions to be answered and a preset screening model, screening paragraphs matched with the questions to be answered from the target document to serve as target paragraphs; and generating FAQ question-answer pairs based on a preset reading understanding model according to the target paragraphs and the questions to be answered.

In some embodiments, for example, in this embodiment, when the processor executes the computer program to implement the steps of parsing the document to be read and segmenting the parsed document to obtain a segmented document as a target document, the steps are specifically implemented as follows: analyzing the document to be read by adopting a stacked CRF model to obtain an XML document; and segmenting the XML document in a preset segmentation mode to obtain a document with a preset document structure as a target document.

In some embodiments, for example, in this embodiment, when the processor executes the computer program to implement the step of screening, from the target document, a paragraph matching the question to be answered according to the question to be answered and a preset screening model, the specific implementation steps include: coding the target document according to the questions to be answered and the preset screening model to obtain a first paragraph text vector; calculating the probability of matching each first paragraph text vector with the question to be answered according to the question to be answered; and determining the paragraph corresponding to the first paragraph text vector with the highest probability as the paragraph matched with the to-be-answered question, and taking the paragraph as the target paragraph.

In some embodiments, for example, in this embodiment, when the processor executes the computer program to implement the step of generating the FAQ question-answer pair based on a preset reading understanding model according to the target paragraph and the question to be answered, the method specifically includes the following steps: encoding the target paragraph and the question to be answered respectively to obtain a second paragraph text vector and a question text vector; encoding the second paragraph text vector and the question text vector to obtain a new text vector; coding the new text vector according to a preset extraction model to obtain a target text vector; and calculating the target text vector to obtain the starting and ending positions of the answers of the questions to be answered, thereby generating the FAQ question-answer pair.

In some embodiments, for example, in this embodiment, after the step of executing the computer program to implement the target paragraph and the question to be answered, the step of generating the FAQ question-answer pair based on a preset generation model, the specific implementation further includes the following steps: acquiring the FAQ question-answer pair and feeding back the acquired FAQ question-answer pair to a user; judging whether a modification instruction sent by a user is received or not; if the modification instruction sent by the user is received, taking the question input by the user in the modification instruction as the question to be answered and returning to execute the step of screening paragraphs matched with the question to be answered from the target document as target paragraphs according to the question to be answered and a preset screening model; if the modification instruction sent by the user is not received, judging whether the to-be-answered question is a question in a preset database question template or not; if the to-be-answered question is not a question in a preset database question template, updating the question in the preset database question template according to the to-be-answered question; if the to-be-answered questions are questions in a preset database question template, the questions in the preset database question template are not updated.

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. An automatic construction method of a FAQ question-answer pair is characterized by comprising the following steps:

acquiring a document to be read;

analyzing the document to be read by adopting a stacked CRF model to obtain an XML document;

segmenting the XML document by a preset segmentation mode to obtain a document with a preset document structure as a target document, wherein the preset segmentation mode comprises a primary title segment, a secondary title segment and an article paragraph segment;

According to the questions to be answered and a preset screening model, screening paragraphs matched with the questions to be answered from the target document to serve as target paragraphs, wherein the preset screening model is a Bert model;

generating a FAQ question-answer pair based on a preset reading understanding model according to the target paragraph and the questions to be answered;

wherein the generating the FAQ question-answer pair based on the preset reading understanding model according to the target paragraph and the question to be answered includes:

coding the target paragraph and the question to be answered by adopting a preset model to obtain a second paragraph text vector and a question text vector, wherein the preset model is a Bert model and an EncoderBlock model, and the second paragraph text vector and the question text vector are three-dimensional vectors;

encoding the second paragraph text vector and the question text vector to obtain a new text vector, wherein the new text vector is the three-dimensional vector, and the first component, the second component and the third component in the three-dimensional vector are batch_size, sentence length and corresponding dimension of each word respectively;

coding the new text vector according to a preset extraction model to obtain a target text vector;

Calculating the target text vector to obtain positions of starting and ending of the answers of the questions to be answered, thereby generating the FAQ question-answer pair;

2. the method according to claim 1, wherein the step of screening out paragraphs matching the questions to be answered from the target document as target paragraphs according to the questions to be answered and a preset screening model includes:

coding the target document according to the questions to be answered and the preset screening model to obtain a first paragraph text vector;

calculating the probability of matching each first paragraph text vector with the question to be answered according to the question to be answered;

determining a paragraph corresponding to the first paragraph text vector with the highest probability as a paragraph matched with the to-be-answered question, and taking the paragraph as a target paragraph;

3. the method according to claim 1, wherein after generating the FAQ question-answer pair based on a preset reading understanding model according to the target paragraph and the question to be answered, further comprises:

acquiring the FAQ question-answer pair and feeding back the acquired FAQ question-answer pair to a user;

4. the method of claim 3, wherein after the obtaining the FAQ question-answer pair and feeding back the obtained FAQ question-answer pair to a user, further comprising:

Judging whether a modification instruction sent by a user is received or not;

if the modification instruction sent by the user is received, taking the question input by the user in the modification instruction as the question to be answered;

returning to execute the step of screening paragraphs matched with the questions to be answered from the target document as target paragraphs according to the questions to be answered and a preset screening model;

5. the method of claim 4, wherein the determining whether the modification instruction sent by the user is received further comprises:

if the modification instruction sent by the user is not received, judging whether the to-be-answered question is a question in a preset database question template or not;

if the to-be-answered question is not a question in a preset database question template, updating the question in the preset database question template according to the to-be-answered question;

6. an automatic FAQ question-answer pair construction device, comprising:

the acquisition unit is used for acquiring the document to be read;

the analysis unit is used for analyzing the document to be read by adopting a stacked CRF model to obtain an XML document;

the segmentation unit is used for segmenting the XML document in a preset segmentation mode to obtain a document with a preset document structure as a target document, wherein the preset segmentation mode comprises a primary title segment, a secondary title segment and an article paragraph segment;

The screening unit is used for screening paragraphs matched with the questions to be answered from the target document to serve as target paragraphs according to the questions to be answered and a preset screening model, wherein the preset screening model is a Bert model;

the generating unit is used for generating a FAQ question-answer pair based on a preset reading understanding model according to the target paragraph and the questions to be answered;

wherein the generating unit includes:

the second coding unit is used for respectively coding the target paragraph and the to-be-answered question by adopting a preset model to obtain a second paragraph text vector and a question text vector, wherein the preset model is a Bert model and an EncoderBlock model, and the second paragraph text vector and the question text vector are three-dimensional vectors;

the third coding unit is used for coding the second paragraph text vector and the question text vector to obtain a new text vector, wherein the new text vector is the three-dimensional vector, and the first component, the second component and the third component in the three-dimensional vector are batch_size, sentence length and corresponding dimension of each word respectively;

the fourth coding unit is used for coding the new text vector according to a preset extraction model to obtain a target text vector;

The generating subunit is used for calculating the target text vector to obtain the starting and ending positions of the answers of the questions to be answered, so as to generate the FAQ question-answer pair;

7. a computer device, characterized in that it comprises a memory and a processor, said memory having stored thereon a computer program, said processor implementing the method according to any of claims 1-5 when executing said computer program;

8. a computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any of claims 1-5.