WO2021068352A1 - Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium - Google Patents

Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium Download PDF

Info

Publication number
WO2021068352A1
WO2021068352A1 PCT/CN2019/118442 CN2019118442W WO2021068352A1 WO 2021068352 A1 WO2021068352 A1 WO 2021068352A1 CN 2019118442 W CN2019118442 W CN 2019118442W WO 2021068352 A1 WO2021068352 A1 WO 2021068352A1
Authority
WO
WIPO (PCT)
Prior art keywords
question
answered
paragraph
document
target
Prior art date
Application number
PCT/CN2019/118442
Other languages
French (fr)
Chinese (zh)
Inventor
杨凤鑫
徐国强
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021068352A1 publication Critical patent/WO2021068352A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the technical fields of artificial intelligence and natural language processing, and in particular to a method, device, computer equipment, and storage medium for automatically constructing FAQ question and answer pairs.
  • FAQ is the abbreviation of Frequently Asked Questions in English, and the Chinese means "Frequently Asked Questions", or more colloquially called "Frequently Asked Questions”. FAQ is considered to be a common online customer service method.
  • a good FAQ system should be able to answer at least 80% of users' general and frequently asked questions. This not only makes it convenient for users, but also greatly reduces the pressure on website staff, saves a lot of customer service costs, and increases customer satisfaction. Therefore, how to effectively implement the construction of the FAQ database is particularly important.
  • the embodiments of the present application provide a method, device, computer equipment, and storage medium for automatically constructing FAQ question and answer pairs, aiming to solve the problem of low matching accuracy of existing FAQ question and answer pairs for automatic construction.
  • an embodiment of the present application provides a method for automatically constructing FAQ question and answer pairs, which includes:
  • an embodiment of the present application also provides a device for automatically constructing FAQ question and answer pairs, which includes:
  • An obtaining unit for obtaining the document to be read; a parsing and segmentation unit for analyzing the document to be read and segmenting the parsed document to obtain the segmented document as a target document; a filtering unit, According to the question to be answered and a preset screening model, a paragraph that matches the question to be answered is selected from the target document as a target paragraph; a generating unit is used to select a paragraph according to the target paragraph and the to-be-answered Questions, generate FAQ question and answer pairs based on the preset reading comprehension model.
  • an embodiment of the present application also provides a computer device, which includes a memory and a processor, the memory stores a computer program, and the processor implements the above method when the computer program is executed.
  • the embodiments of the present application also provide a computer-readable storage medium, the storage medium stores a computer program, and the computer program can implement the foregoing method when executed by a processor.
  • the embodiments of the present application provide a method, device, computer equipment, and storage medium for automatically constructing FAQ question and answer pairs.
  • the target paragraphs that match the question to be answered are selected first, and then FAQ question and answer pairs are generated based on the target paragraph and the question to be answered, there is no need to deal with non-target paragraphs, which reduces to a certain extent Interference information caused by non-target paragraphs when generating FAQ question and answer pairs can make the generated FAQ question and answer pairs more accurate.
  • FIG. 1 is a schematic diagram of a scenario of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of a sub-flow of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of a sub-flow of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of a sub-flow of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
  • FIG. 6 is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by another embodiment of the application.
  • FIG. 7 is a schematic block diagram of a device for automatically constructing FAQ question and answer pairs provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of the analysis and segmentation unit of the FAQ question and answer automatic construction device provided by an embodiment of the application;
  • FIG. 9 is a schematic block diagram of the screening unit of the FAQ question and answer pair automatic construction device provided by an embodiment of the application.
  • FIG. 10 is a schematic block diagram of the generating unit of the FAQ question and answer pair automatic construction device provided by an embodiment of the application.
  • FIG. 11 is a schematic block diagram of a device for automatically constructing FAQ question and answer pairs provided by another embodiment of the application.
  • FIG. 12 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context .
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • FIG. 1 is a schematic diagram of a scene of a method for automatically constructing FAQ question and answer pairs provided by an embodiment of the present application.
  • the method for automatically constructing FAQ question and answer pairs in the embodiment of the present application can be applied to a server.
  • the method for automatically constructing FAQ question and answer pairs can be realized by a software program configured on the server.
  • the server communicates with the terminal, so that the server calls the document to be read uploaded by the user through the terminal and performs a series of processing according to the question to be answered and the document to be read to obtain the FAQ question and answer pair, so as to realize the automatic construction of the FAQ question and answer pair.
  • the terminal can be a desktop computer, a laptop computer, a tablet computer, etc., and there is no specific restriction here.
  • the number of the terminal and the user is one. It can be understood that in the actual application process, the number of the terminal and the user may be multiple, and FIG. 1 only serves as a schematic illustration.
  • FIG. 2 is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S100-S130.
  • the server For the server to realize the automatic construction of FAQ question and answer pairs, it first needs to obtain the document to be read, and then perform a series of processing based on the document to be read before generating the FAQ question and answer pair.
  • the user can upload the document to be read through the user terminal.
  • the user can upload the document to be read through the FAQ webpage terminal of the user terminal to send the document to be read to the server.
  • the document to be read is a PDF document.
  • the document to be read may also be other types of documents, such as Word documents.
  • the server needs to parse the document to be read to obtain a document in a required format, and then segment the content in the document to finally obtain a document with a preset document structure.
  • the step S110 includes the following steps S111-S112.
  • S112 Segment the XML document in a preset segmentation manner to obtain a document with a preset document structure as a target document.
  • the cascaded CRF model is used to parse the document to be read to obtain the XML document.
  • CRF is the abbreviation of Conditional Random Field
  • its full name is Conditional Random Field in Chinese.
  • the reason why the cascading CRF model is used in this embodiment is because the cascading CRF model has a relatively short processing time for parsing the document to be read and a better processing effect .
  • XML is the abbreviation of Extensible Markup Language, and its full Chinese name is Extensible Markup Language. After the document to be read is parsed to obtain the XML document, the XML document needs to be segmented to obtain the segmented document as the target document.
  • the XML document can be segmented through a preset segmentation method.
  • the segmentation method used includes multiple methods.
  • the selected segmentation method in this embodiment is the second-level heading.
  • other segmentation methods such as first-level headings or article paragraphs can also be selected according to actual needs.
  • HMM Hidden Markov Model
  • the server parses the document to be read and segmented the parsed document to obtain the target document, it needs to filter the target document to obtain a paragraph matching the question to be answered as the target paragraph.
  • the question to be answered is a question stored in a question template in a preset database.
  • the user can select a corresponding question template according to the uploaded document content or document name. For example, if the user uploads the content of the document to be read on the FAQ web page to be life insurance or accident insurance, select the question template related to life insurance or accident insurance, which includes multiple common questions related to life insurance or accident insurance .
  • the server calls the corresponding question template according to the user's selection and selects paragraphs matching the questions in the question template from the target document according to the questions in the question template and a preset screening model as the target paragraph.
  • the step S120 may include the following steps S121-S123.
  • the preset screening model is, for example, a Bert (Bidirectional Encoder Representations From Transformers) model.
  • the Bert model is a model based on Transformer that uses a bidirectional language. It can extract the syntax and semantic information of the document, and can also extract the document context information.
  • the server generates the first paragraph text vector for the target document according to the question to be answered and the Bert model.
  • the first paragraph text vector is a three-dimensional vector
  • the three-dimensional vector is a text vector representation of the target document.
  • the reason why the Bert model is used to generate the first paragraph text vector for the target document is because the Bert model can extract the syntax and semantic information of the target document, and can also extract the context information of the target document to improve the accuracy of the extraction.
  • S123 Determine the paragraph corresponding to the first paragraph text vector with the highest probability as a paragraph matching the question to be answered, and use it as a target paragraph.
  • the server after the server generates the first paragraph text vector for the target document according to the question to be answered and the Bert model, it also needs to calculate the probability that each first paragraph text vector matches the question to be answered according to the question to be answered and calculate The probability of sorting takes the paragraph corresponding to the text vector of the first paragraph with the highest probability as the target paragraph.
  • the Softmax function is used to calculate the probability that each first paragraph text vector matches the question to be answered according to the question to be answered, and then the calculated probabilities are sorted after the probability is obtained, and the paragraph corresponding to the text vector of the first paragraph with the highest probability is taken As the target paragraph.
  • a FAQ question and answer pair is generated based on a preset reading comprehension model.
  • the server After the server filters out paragraphs matching the question to be answered from the target document, it will generate a FAQ question and answer pair based on the selected paragraph and the question to be answered.
  • FAQ question and answer pairs can be automatically generated through a preset reading comprehension model.
  • the reading comprehension model is used to predict the start and end positions of the answer corresponding to the question to be answered in the target paragraph according to the target paragraph and the question to be answered, thereby determining the answer and generating a FAQ question and answer pair.
  • the server since the paragraph matching the question to be answered is selected from the target document as the target paragraph, the server further generates the FAQ question and answer pair based on the preset generation model based on the target paragraph and the question to be answered.
  • the processing of non-target paragraphs reduces the interference information caused by non-target paragraphs when generating FAQ question and answer pairs to a certain extent.
  • the accuracy rate of the generated FAQ question and answer pairs is relatively high, so it can weaken the influence of cross-domain and affect cross-domain.
  • the question generates a FAQ question and answer pair with a relatively high matching accuracy.
  • the step S130 may include the following steps S131-S134.
  • the server After the server selects paragraphs matching the question to be answered from the target document, it encodes the selected target paragraph and the question to be answered respectively. Specifically, first, a preset model such as a Bert model is used to encode the target paragraph and the question to be answered, and then a preset model such as Encoder Block is used to re-encode the encoded target paragraph and the question to be answered, so as to obtain the first Two paragraph text vector and question text vector.
  • the second paragraph text vector and the question text vector are both three-dimensional vectors
  • the first component in the three-dimensional vector is Batch_Size, which is a batch processing parameter, and its limit value is the total number of training set samples.
  • the Batch_Size value is 32, which indicates that the preset model uses a small batch to batch process the target paragraph and the question to be answered.
  • Batch_Size can also be set to other values, as long as the target paragraph and the question to be answered are encoded to obtain the second paragraph text vector and question text vector.
  • the second component in the three-dimensional vector is the length of the sentence.
  • the third component in the three-dimensional vector is the dimension corresponding to each word.
  • Encoder Block includes convolutional neural network, self-attention mechanism and forward neural network.
  • the convolutional neural network (Convolutional Neural Networks, CNN) is a type of feedforward neural network (Feedforward Neural Networks) that includes convolution calculations and has a deep structure, and is one of the representative algorithms of deep learning (Deep learning).
  • the self-attention mechanism utilizes the Attention mechanism, which can take into account the representation of contextual information. Specifically, first use the Bert model to encode the target paragraph and the question to be answered to obtain the first temporary paragraph text vector and the first temporary question text vector.
  • the first temporary paragraph text vector and the first temporary question text vector are both three-dimensional vectors. It can be called the first three-dimensional vector.
  • the text vectors are also three-dimensional vectors, which can be called the second three-dimensional vector, and the dimension of each word in the second three-dimensional vector is reduced compared to the dimension of each word in the first three-dimensional vector.
  • the second temporary paragraph text vector and the second temporary question text vector are calculated for each other word except the current word, and the weight values are weighted and summed to obtain the third temporary paragraph text
  • the vector and the third temporary question text vector, the third temporary paragraph text vector and the third temporary question text vector can be called the third three-dimensional vector.
  • each component has the same meaning.
  • the third three-dimensional vector is only a further extraction of the second three-dimensional vector, and the dimension of each word in the third three-dimensional vector is reduced compared to the dimension of each word in the second three-dimensional vector.
  • the third temporary paragraph text vector and the third temporary question text vector are extracted through a forward neural network to obtain the final required paragraph text vector and question text vector, where the final required paragraph text vector is defined as the second Paragraph text vector, the dimension of each word in the second paragraph text vector and the question text vector is reduced compared to the dimension of each word in the third temporary paragraph text vector and the third temporary question text vector.
  • the residual error is added when using the convolutional neural network, the self-attention mechanism, and the forward neural network for encoding, and adding the residual error can alleviate this problem.
  • RNN Recurrent Neural Network model replaces the self-attention mechanism (Self-attention).
  • the second paragraph text vector and the question text vector need to be encoded to obtain a new text vector.
  • the Attention encoding operation is performed on the second paragraph text vector and the question text vector in the Context-Query Attention layer.
  • Attention coding operations include Context-to-Query and Query-to-Context Attention coding operations.
  • the Attention encoding operation of Context-to-Query refers to the Context length N and the Query length M to form a correlation matrix N*M, and then Softmax encoding is calculated for each row of the correlation matrix N*M Attention score, and finally the Attention score and the original Query text vector are calculated and weighted and summed to obtain a text vector containing Attention information.
  • the Attention encoding operation of Query-to-Contex refers to the length M of the Query and the length N of the Context to form a correlation matrix M*N, and then the Softmax calculation is performed on each row of the correlation matrix M*N to obtain the Attention score Finally, the Attention score and the text vector of the original Context are calculated and weighted and summed to obtain the text vector containing the Attention information.
  • the new text vector can be obtained by performing Context-to-Query and Query-to-Context encoding operations on the second paragraph text vector and question text vector at the Context-Query Attention layer.
  • the new text vector in the embodiment of this application is also a three-dimensional vector, and the meaning of each component is the same as the second paragraph text vector and question text vector, but the second paragraph text vector and question are realized.
  • the third component that is, the dimension of each word, has been increased. For the sake of simplicity and convenience, it will not be repeated here.
  • S133 Encode the new text vector according to a preset extraction model to obtain a target text vector.
  • the new text vector needs to be encoded according to the preset extraction model to obtain the target text vector.
  • the preset extraction model is, for example, Encoder Block.
  • the Encoder Block is different from the number of Blocks in Encoder Block in step S131, but both include convolutional neural networks, self-attention mechanisms, and forward neural networks. Neural networks, self-attention mechanisms, and forward neural networks will add residuals when encoding new text vectors.
  • three Encoder Blocks are superimposed to encode the new text vector to further extract the target text vector from the new text vector. The dimension of each word in the target vector is compared to the dimension of each word in the new text vector. There is a decrease, which makes the matching accuracy of the generated FAQ question and answer pairs higher.
  • the target text vector after encoding the new text vector by Encoder Block to obtain the target text vector, the target text vector needs to be calculated to obtain the starting and ending positions of the answer to the question to be answered, thereby generating a FAQ question and answer pair.
  • the text vector obtained by the first Encoder Block encoding in step S133 and the text vector obtained by the second Encoder Block encoding are spliced together as the starting position of the answer to the question to be answered, and the first Encoder Block encoding is obtained
  • the text vector and the text vector encoded by the third Encoder Block are spliced together as the end position of the answer to the question to be answered, and then the softmax operation is performed on the start and end positions of the answer to the question to be answered to obtain the start and end of the answer to the question to be answered.
  • the probability of the end position, and the start and end positions of the answer to the question to be answered with the highest probability are taken as the start and end positions of the answer to the question to be answered, so as to generate the FAQ question and answer pair, and then realize the automatic construction of the FAQ question and answer pair.
  • FIG. 6 is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by another embodiment of the application. As shown in FIG. 6, in this embodiment, the method includes steps S100-S190. That is, in this embodiment, the method further includes steps S140-S190 after step S130 in the foregoing embodiment.
  • the user uploads the content of the document to be read on the FAQ web page, and selects the relevant question template according to the content or the document name of the document to be read.
  • the server calls the relevant question template according to the user's selection and selects the relevant question template according to the question template. Questions and Bert screening model, filter the paragraphs matching the questions in the question template from the target document as the target paragraph, and then generate FAQ pairs based on the target paragraphs and the questions to be answered, and then feedback the generated FAQ questions To the user.
  • the server obtains the generated FAQ question and answer pair and displays the obtained FAQ question and answer pair on the page of the FAQ web page, and the user can perform follow-up operations as needed. For example, if the user is satisfied with the generated FAQ question and answer, he can directly export the FAQ question and answer pair. If he is not satisfied, he can modify it on the FAQ page modification interface.
  • the server After the server obtains the FAQ question and answer pair and feeds the obtained FAQ question and answer pair back to the user, it will determine whether the modification instruction sent by the user is received.
  • Step S160 If the modification instruction sent by the user is received, the question input by the user in the modification instruction is used as the question to be answered, and the execution of the screening model based on the question to be answered and the preset screening model is returned from the target. Step S120 of selecting a paragraph matching the question to be answered from the document as the target paragraph.
  • the server receives the modification instruction sent by the user, it indicates that the user is not satisfied with the FAQ.
  • the user can enter the question he wants to ask on the modification page of the FAQ web page, and then the server will modify the instruction
  • the question input by the user is regarded as a question to be answered, and step S120 is executed back. That is, after re-determining the question to be answered, according to the pair of questions to be answered and the preset screening model, the paragraph matching the question to be answered is selected from the target document as the target paragraph, and then the subsequent steps are executed in sequence. Step, and finally feedback the obtained FAQ to the user.
  • the server does not receive the modification instruction sent by the user, it indicates that the user is satisfied with the generated FAQ, and then it is judged whether the question to be answered is a question in the preset database question template, and if the question to be answered is yes
  • the question in the preset database indicates that the question to be answered is the question in the question template and the user is satisfied, and the generated FAQ question and answer pair can be exported.
  • the question to be answered is not a question in the preset database, it indicates that the question to be answered is a question entered by the user and the answer rate is high.
  • the question entered by the user needs to be added to the preset database question template to update and expand the preset database. Set the questions in the database question template, so that when the next FAQ question and answer pair is automatically generated, the questions in the preset database question template will be more abundant, which can better meet the needs of users.
  • the user's modification operation on the modification interface of the FAQ web page will be recorded, and the modified result will be regarded as the historical record of the question to be answered.
  • These historical records can be used as A large amount of annotated data is used to optimize the model for the FAQ question and answer.
  • FIG. 7 is a schematic block diagram of a device 200 for automatically constructing FAQ question and answer pairs provided by an embodiment of the present application.
  • the present application also provides a device 200 for automatically constructing FAQ question and answer pairs.
  • the FAQ question and answer pair automatic construction device 200 includes a unit for executing the above FAQ question and answer pair automatic construction method, and the device may be configured in a server.
  • the apparatus 200 for automatically constructing FAQ question and answer pairs includes an acquiring unit 201, a parsing and segmenting unit 202, a screening unit 203 and a generating unit 204.
  • the obtaining unit 201 is used to obtain the document to be read;
  • the parsing and segmenting unit 202 is used to parse the document to be read and segment the parsed document to obtain the segmented document as the target document;
  • the unit 203 is configured to filter out paragraphs matching the question to be answered from the target document as a target paragraph according to the question to be answered and a preset screening model;
  • the generating unit 204 is configured to select paragraphs matching the question to be answered according to the target paragraph and the To answer the questions, generate FAQ question and answer pairs based on the preset reading comprehension model.
  • the analysis and segmentation unit 202 includes an analysis unit 2021 and a segmentation unit 2022.
  • the parsing unit 2021 is used to parse the document to be read using a cascaded CRF model to obtain an XML document;
  • the segmentation unit 2022 is used to segment the XML document by a preset segmentation method to obtain a pre-read Let the document with the document structure be the target document.
  • the screening unit 203 includes a first encoding unit 2031, a calculation unit 2032, and a determination unit 2033.
  • the first coding unit 2031 is configured to code the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector;
  • the first calculating unit 2032 is configured to code according to the question to be answered Calculate the probability that each of the first paragraph text vector matches the question to be answered;
  • the determining unit 2033 is configured to determine the paragraph corresponding to the first paragraph text vector with the highest probability as the one that matches the question to be answered Paragraph and serve as the target paragraph.
  • the generating unit 204 includes a second coding unit 2041, a third coding unit 2042, a fourth coding unit 2043, and a generating subunit 2044.
  • the second encoding unit 2041 is configured to encode the target paragraph and the question to be answered respectively to obtain a second paragraph text vector and a question text vector;
  • the second encoding unit 2042 is configured to encode the second paragraph text vector And the question text vector to obtain a new text vector;
  • the third coding unit 2043 is used to code the new text vector according to a preset extraction model to obtain a target text vector;
  • the second generating subunit 2044 is used to The target text vector is calculated to obtain the start and end positions of the answer to the question to be answered, thereby generating the FAQ question and answer pair.
  • the device 200 further includes a feedback unit 205, a first judgment unit 206, a modification unit 207, a second judgment unit 208 and an update unit 209.
  • the feedback unit 205 is used to obtain the FAQ question and answer pair and feedback the obtained FAQ question and answer pair to the user;
  • the first judgment unit 206 is used to judge whether the modification instruction sent by the user is received;
  • the modification unit 207 is used to receive When the modification instruction sent by the user is received, the question input by the user in the modification instruction is regarded as the question to be answered;
  • the second determining unit 208 is configured to determine if the modification instruction sent by the user is not received Whether the question to be answered is a question in a preset database question template;
  • the updating unit 209 is configured to update the preset database according to the question to be answered if the question to be answered is not a question in the preset database question template Questions in the question template.
  • the above FAQ question answering automatic construction device can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 12.
  • FIG. 12 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 300 is a server.
  • the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 300 includes a processor 302, a memory, and a network interface 305 connected through a system bus 301, where the memory may include a non-volatile storage medium 503 and an internal memory 304.
  • the non-volatile storage medium 303 can store an operating system 3031 and a computer program 3032.
  • the processor 302 can execute a method for automatically constructing FAQ question and answer pairs.
  • the processor 302 is used to provide calculation and control capabilities to support the operation of the entire computer device 300.
  • the internal memory 304 provides an environment for the operation of the computer program 3032 in the non-volatile storage medium 303.
  • the computer program 3032 When executed by the processor 302, it implements the method for automatically constructing FAQ question and answer pairs in the embodiment of the present application.
  • the network interface 305 is used for network communication with other devices.
  • the structure shown in FIG. 12 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 300 to which the solution of the present application is applied.
  • the specific computer device 300 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 302 may be a central processing unit (Central Processing Unit, CPU), and the processor 302 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer program can be stored in a storage medium, and the storage medium is a computer-readable storage medium.
  • the computer program is executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiment.
  • the storage medium may be a computer-readable storage medium.
  • the storage medium stores a computer program.
  • the processor is executed to implement the method for automatically constructing FAQ question and answer pairs in the embodiment of the present application.
  • the processor executes the computer program to realize the parsing of the document to be read and segmenting the parsed document to obtain the segmented document.
  • the following steps are specifically implemented: the document to be read is parsed using a cascading CRF model to obtain an XML document; the XML document is segmented by a preset segmentation method to obtain a preset The document with the document structure is used as the target document.
  • the storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
  • ROM Read-Only Memory
  • the steps in the method in the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs.
  • the units in the device of the embodiment of the present application may be combined, divided, and deleted according to actual needs.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.

Abstract

Disclosed are an automatic construction method and apparatus for an FAQ question-answer pair, and a computer device and a storage medium. The method belongs to the technical field of artificial intelligence and natural language processing, and comprises: acquiring a document to be read; parsing the document to be read, and paragraphing a parsed document to obtain a paragraphed document as a target document; screening out, according to a question to be answered and a preset screening model, a paragraph, matching said question, from the target document to serve as a target paragraph; and according to the target paragraph and said question, generating an FAQ question-answer pair on the basis of a preset reading comprehension model.

Description

FAQ问答对自动构建方法、装置、计算机设备及存储介质FAQ questions and answers on automatic construction methods, devices, computer equipment and storage media
本申请要求于2019年10月12日提交中国专利局、申请号为CN201910969443.4、申请名称为“FAQ问答对自动构建方法、装置、计算机设备及储存介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is CN201910969443.4, and the application name is "FAQ Questions and Answers on Automatic Construction Methods, Devices, Computer Equipment, and Storage Media" on October 12, 2019. The entire content is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能和自然语言处理技术领域,尤其涉及一种FAQ问答对自动构建方法、装置、计算机设备及存储介质。This application relates to the technical fields of artificial intelligence and natural language processing, and in particular to a method, device, computer equipment, and storage medium for automatically constructing FAQ question and answer pairs.
背景技术Background technique
FAQ是英文Frequently Asked Questions的缩写,中文意思就是“经常问到的问题”,或者更通俗地叫做“常见问题解答”。FAQ被认为是一种常用的在线顾客服务手段,一个好的FAQ系统,应该至少可以回答用户80%的一般问题以及常见问题。这样不仅方便了用户,也大大减轻了网站工作人员的压力,节省了大量的顾客服务成本,并且增加了顾客的满意度。因此,如何有效实现FAQ数据库的构建尤为重要。FAQ is the abbreviation of Frequently Asked Questions in English, and the Chinese means "Frequently Asked Questions", or more colloquially called "Frequently Asked Questions". FAQ is considered to be a common online customer service method. A good FAQ system should be able to answer at least 80% of users' general and frequently asked questions. This not only makes it convenient for users, but also greatly reduces the pressure on website staff, saves a lot of customer service costs, and increases customer satisfaction. Therefore, how to effectively implement the construction of the FAQ database is particularly important.
而目前,问答领域的FAQ自动构建主要有以下三种方法:(1)通过对待阅读的文章和待回答的问题进行分词,获取分词后得到相应的词语串,将词语串输入到自动阅读理解模型中,即可输出与问题对应的答案。(2)根据用户提出的问题与问答库中的已有问句记录相似性,在已有的“问题-答案”对数据库中找到与用户提问相匹配的问句,并将其对应的答案返回给用户,完成FAQ对答。(3)采用对已经建立的FAQ,以人工录入的方式建立与标准问句对应的句式模板。对用户的问句用句式模板进行匹配,再通过句式模板与FAQ的映射,匹配到FAQ。以上三种方法虽然能在一定程度上匹配成功,实现FAQ问答对的自动构建,但FAQ问答对的匹配准确度仍然比较低。At present, there are mainly three methods for automatic FAQ construction in the field of question and answer: (1) Word segmentation is performed through the article to be read and the question to be answered, and the corresponding word string is obtained after the word segmentation is obtained, and the word string is input to the automatic reading comprehension model , You can output the answer corresponding to the question. (2) According to the similarity between the question asked by the user and the existing question record in the Q&A database, find the question matching the user’s question in the existing "question-answer" pair database, and return the corresponding answer To the user, complete the FAQ answer. (3) Using the established FAQ to manually enter the sentence pattern template corresponding to the standard question sentence. Match the user's question sentence with a sentence pattern, and then match the sentence pattern with the FAQ to match the FAQ. Although the above three methods can match successfully to a certain extent and realize the automatic construction of FAQ question and answer pairs, the matching accuracy of FAQ question and answer pairs is still relatively low.
发明内容Summary of the invention
本申请实施例提供了一种FAQ问答对自动构建方法、装置、计算机设备及 存储介质,旨在解决现有FAQ问答对自动构建匹配准确度比较低的问题。The embodiments of the present application provide a method, device, computer equipment, and storage medium for automatically constructing FAQ question and answer pairs, aiming to solve the problem of low matching accuracy of existing FAQ question and answer pairs for automatic construction.
第一方面,本申请实施例提供了一种FAQ问答对自动构建方法,其包括:In the first aspect, an embodiment of the present application provides a method for automatically constructing FAQ question and answer pairs, which includes:
获取待阅读的文档;对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档;根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落;根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对。Obtain the document to be read; parse the document to be read and segment the parsed document to obtain the segmented document as the target document; according to the question to be answered and the preset screening model, from the target A paragraph that matches the question to be answered is selected from the document as a target paragraph; according to the target paragraph and the question to be answered, a FAQ question and answer pair is generated based on a preset reading comprehension model.
第二方面,本申请实施例还提供了一种FAQ问答对自动构建装置,其包括:In the second aspect, an embodiment of the present application also provides a device for automatically constructing FAQ question and answer pairs, which includes:
获取单元,用于获取待阅读的文档;解析分段单元,用于对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档;筛选单元,用于根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落;生成单元,用于根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对。An obtaining unit for obtaining the document to be read; a parsing and segmentation unit for analyzing the document to be read and segmenting the parsed document to obtain the segmented document as a target document; a filtering unit, According to the question to be answered and a preset screening model, a paragraph that matches the question to be answered is selected from the target document as a target paragraph; a generating unit is used to select a paragraph according to the target paragraph and the to-be-answered Questions, generate FAQ question and answer pairs based on the preset reading comprehension model.
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现上述方法。In a third aspect, an embodiment of the present application also provides a computer device, which includes a memory and a processor, the memory stores a computer program, and the processor implements the above method when the computer program is executed.
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序当被处理器执行时可实现上述方法。In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, the storage medium stores a computer program, and the computer program can implement the foregoing method when executed by a processor.
本申请实施例提供了一种FAQ问答对自动构建方法、装置、计算机设备及存储介质。本申请实施例的技术方案,由于是先筛选出与待回答问题相匹配的目标段落,再根据目标段落及待回答问题生成FAQ问答对,无需对非目标段落进行处理,在一定程度上减少了生成FAQ问答对时非目标段落带来的干扰信息,可使生成的FAQ问答对匹配准确度更高。The embodiments of the present application provide a method, device, computer equipment, and storage medium for automatically constructing FAQ question and answer pairs. In the technical solution of the embodiment of this application, since the target paragraphs that match the question to be answered are selected first, and then FAQ question and answer pairs are generated based on the target paragraph and the question to be answered, there is no need to deal with non-target paragraphs, which reduces to a certain extent Interference information caused by non-target paragraphs when generating FAQ question and answer pairs can make the generated FAQ question and answer pairs more accurate.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的一种FAQ问答对自动构建方法的场景示意图;FIG. 1 is a schematic diagram of a scenario of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
图2为本申请实施例提供的一种FAQ问答对自动构建方法的流程示意图;FIG. 2 is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by an embodiment of the application;
图3为本申请实施例提供的一种FAQ问答对自动构建方法的子流程示意图;FIG. 3 is a schematic diagram of a sub-flow of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
图4为本申请实施例提供的一种FAQ问答对自动构建方法的子流程示意图;FIG. 4 is a schematic diagram of a sub-flow of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
图5为本申请实施例提供的一种FAQ问答对自动构建方法的子流程示意图;FIG. 5 is a schematic diagram of a sub-flow of a method for automatically constructing a FAQ question and answer pair provided by an embodiment of the application;
图6为本申请另一实施例提供的一种FAQ问答对自动构建方法的流程示意图;FIG. 6 is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by another embodiment of the application;
图7为本申请实施例提供的一种FAQ问答对自动构建装置的示意性框图;FIG. 7 is a schematic block diagram of a device for automatically constructing FAQ question and answer pairs provided by an embodiment of the application;
图8为本申请实施例提供的FAQ问答对自动构建装置的解析分段单元的示意性框图;FIG. 8 is a schematic block diagram of the analysis and segmentation unit of the FAQ question and answer automatic construction device provided by an embodiment of the application;
图9为本申请实施例提供的FAQ问答对自动构建装置的筛选单元的示意性框图;FIG. 9 is a schematic block diagram of the screening unit of the FAQ question and answer pair automatic construction device provided by an embodiment of the application; FIG.
图10为本申请实施例提供的FAQ问答对自动构建装置的生成单元的示意性框图;FIG. 10 is a schematic block diagram of the generating unit of the FAQ question and answer pair automatic construction device provided by an embodiment of the application; FIG.
图11为本申请另一实施例提供的一种FAQ问答对自动构建装置的示意性框图;以及FIG. 11 is a schematic block diagram of a device for automatically constructing FAQ question and answer pairs provided by another embodiment of the application; and
图12为本申请实施例提供的一种计算机设备的示意性框图。FIG. 12 is a schematic block diagram of a computer device provided by an embodiment of the application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包 括这些组合。It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
如在本说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and the appended claims, the term "if" can be interpreted as "when" or "once" or "in response to determination" or "in response to detection" depending on the context . Similarly, the phrase "if determined" or "if detected [described condition or event]" can be interpreted as meaning "once determined" or "in response to determination" or "once detected [described condition or event]" depending on the context ]" or "in response to detection of [condition or event described]".
请参阅图1,图1是本申请实施例提供的一种FAQ问答对自动构建方法的场景示意图。本申请实施例的FAQ问答对自动构建方法可应用于服务器中,例如可通过配置于服务器上的软件程序来实现该FAQ问答对自动构建方法。服务器与终端进行通信,以使服务器调用用户通过终端上传的待阅读的文档并根据待回答的问题和待阅读文档进行一系列处理后得到FAQ问答对,实现FAQ问答对的自动构建。该终端可以为台式电脑、手提电脑、平板电脑等,在此不做具体限制。另外,在图1中,该终端和用户的个数为一个,可以理解的是,在实际应用过程中,该终端和用户的个数可以为多个,图1仅仅起到示意说明的作用。Please refer to FIG. 1. FIG. 1 is a schematic diagram of a scene of a method for automatically constructing FAQ question and answer pairs provided by an embodiment of the present application. The method for automatically constructing FAQ question and answer pairs in the embodiment of the present application can be applied to a server. For example, the method for automatically constructing FAQ question and answer pairs can be realized by a software program configured on the server. The server communicates with the terminal, so that the server calls the document to be read uploaded by the user through the terminal and performs a series of processing according to the question to be answered and the document to be read to obtain the FAQ question and answer pair, so as to realize the automatic construction of the FAQ question and answer pair. The terminal can be a desktop computer, a laptop computer, a tablet computer, etc., and there is no specific restriction here. In addition, in FIG. 1, the number of the terminal and the user is one. It can be understood that in the actual application process, the number of the terminal and the user may be multiple, and FIG. 1 only serves as a schematic illustration.
请参阅图2,图2是本申请实施例提供的一种FAQ问答对自动构建方法的流程示意图。如图2所示,该方法包括以下步骤S100-S130。Please refer to FIG. 2, which is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S100-S130.
S100、获取待阅读的文档。S100. Obtain a document to be read.
具体地,服务器要实现FAQ问答对的自动构建,首先需获取待阅读的文档,进而基于该待阅读的文档进行一系列处理后才可生成FAQ问答对。在本申请实施例中,可由用户通过用户终端上传待阅读的文档,具体地,用户可通过用户终端的FAQ网页端上传待阅读的文档以将所述待阅读的文档发送到服务器。其中,在本申请实施例中,所述待阅读的文档为PDF文档。Specifically, for the server to realize the automatic construction of FAQ question and answer pairs, it first needs to obtain the document to be read, and then perform a series of processing based on the document to be read before generating the FAQ question and answer pair. In the embodiment of the present application, the user can upload the document to be read through the user terminal. Specifically, the user can upload the document to be read through the FAQ webpage terminal of the user terminal to send the document to be read to the server. Wherein, in the embodiment of the present application, the document to be read is a PDF document.
需要说明的是,在其它实施例中,待阅读的文档也可以为其它类型的文档,例如Word文档。It should be noted that, in other embodiments, the document to be read may also be other types of documents, such as Word documents.
S110、对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档。S110: Parse the document to be read and segment the parsed document to obtain the segmented document as a target document.
具体地,服务器获取了待阅读的文档之后需对该待阅读的文档进行解析,以得到所需格式的文档,进而将该文档中的内容进行分段以最终得到具有预设文档结构的文档。Specifically, after obtaining the document to be read, the server needs to parse the document to be read to obtain a document in a required format, and then segment the content in the document to finally obtain a document with a preset document structure.
请参阅图3,在一实施例中,例如在本申请实施例中,所述步骤S110包括如下步骤S111-S112。Referring to FIG. 3, in an embodiment, for example, in an embodiment of the present application, the step S110 includes the following steps S111-S112.
S111、对所述待阅读的文档采用层叠CRF模型进行解析以得到XML文档。S111. Analyze the document to be read using a stacked CRF model to obtain an XML document.
S112、通过预设分段方式对所述XML文档进行分段,以得到具有预设文档结构的文档作为目标文档。S112: Segment the XML document in a preset segmentation manner to obtain a document with a preset document structure as a target document.
在本申请实施例中,采用层叠CRF模型对待阅读的文档进行解析,以得到XML文档。其中,CRF为Conditional Random Field的缩写,其中文全称为条件随机域,本实施例中之所以采用层叠CRF模型,是因为层叠CRF模型对待阅读的文档进行解析的处理时间比较短且处理效果较好。XML为Extensible Markup Language的缩写,其中文全称为可扩展标记语言。在对待阅读的文档进行解析而得到XML文档之后,需再对XML文档进行分段以得到分段后的文档作为目标文档。具体地,可通过预设的分段方式来实现对XML文档进行分段,其中所采用的分段方式包括有多种方式,譬如在本实施例中所选择的分段方式是以二级标题作为分段。在其它实施例中,也可以根据实际需求选择采用一级标题或文章段落等其它分段方式。In the embodiment of the present application, the cascaded CRF model is used to parse the document to be read to obtain the XML document. Among them, CRF is the abbreviation of Conditional Random Field, and its full name is Conditional Random Field in Chinese. The reason why the cascading CRF model is used in this embodiment is because the cascading CRF model has a relatively short processing time for parsing the document to be read and a better processing effect . XML is the abbreviation of Extensible Markup Language, and its full Chinese name is Extensible Markup Language. After the document to be read is parsed to obtain the XML document, the XML document needs to be segmented to obtain the segmented document as the target document. Specifically, the XML document can be segmented through a preset segmentation method. The segmentation method used includes multiple methods. For example, the selected segmentation method in this embodiment is the second-level heading. As a segment. In other embodiments, other segmentation methods such as first-level headings or article paragraphs can also be selected according to actual needs.
需要说明的是,在其它实施例中,也可采用其它模型对待阅读的文档进行解析,例如可采用隐马尔可夫模型,也即HMM(Hidden Markov Model)模型。It should be noted that in other embodiments, other models may also be used to parse the document to be read, for example, a hidden Markov model, that is, a HMM (Hidden Markov Model) model, may be used.
S120、根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落。S120: According to the question to be answered and a preset screening model, a paragraph matching the question to be answered is selected from the target document as a target paragraph.
具体地,服务器对待阅读的文档进行解析并对解析后的文档进行分段获得目标文档之后,还需对目标文档进行筛选以得到与待回答问题相匹配的段落作为目标段落。在本申请实施例中,待回答的问题为存储于预设数据库中的问题模板中的问题。具体地,用户在终端的FAQ网页端上传待阅读的文档之后,可根据上传的待阅读的文档内容或者文档名选择相应的问题模板。例如,若用户在FAQ网页端页面上传待阅读的文档内容为寿险或意外险,则选择与寿险或意外险相关的问题模板,该问题模板中包括了与寿险或意外险相关的多个常见问题。服务器根据用户的选择调用相应的问题模板并根据问题模板中的问题及预设的筛选模型,从目标文档中筛选出与问题模板中的问题相匹配的段落作为目标段落。Specifically, after the server parses the document to be read and segmented the parsed document to obtain the target document, it needs to filter the target document to obtain a paragraph matching the question to be answered as the target paragraph. In the embodiment of the present application, the question to be answered is a question stored in a question template in a preset database. Specifically, after the user uploads the document to be read on the FAQ web page end of the terminal, the user can select a corresponding question template according to the uploaded document content or document name. For example, if the user uploads the content of the document to be read on the FAQ web page to be life insurance or accident insurance, select the question template related to life insurance or accident insurance, which includes multiple common questions related to life insurance or accident insurance . The server calls the corresponding question template according to the user's selection and selects paragraphs matching the questions in the question template from the target document according to the questions in the question template and a preset screening model as the target paragraph.
在某些实施例,例如本申请实施例中,如图4所示,所述步骤S120可包括 以下步骤S121-S123。In some embodiments, such as the embodiment of the present application, as shown in FIG. 4, the step S120 may include the following steps S121-S123.
S121、根据所述待回答问题及预设的所述筛选模型对所述目标文档进行编码以得到第一段落文本向量。S121. Encode the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector.
在本申请实施例中,为实现对目标文档进行筛选以得到与待回答问题相匹配的段落,首先需要根据待回答问题及筛选模型对目标文档进行编码得到第一段落文本向量。其中,预设的筛选模型例如为Bert(Bidirectional Encoder Representations From Transformers)模型。Bert模型是一种基于Transformer采用了双向语言的一种模型,能够提取到文档的语法语义信息,并且还能够结合文档上下文信息进行提取。具体地,服务器根据待回答问题及Bert模型对目标文档生成第一段落文本向量。其中,第一段落文本向量为三维向量,该三维向量为目标文档的文本向量表示。之所以采用Bert模型对目标文档生成第一段落文本向量,是因为Bert模型能够提取目标文档的语法语义信息,并且还能够结合目标文档上下文信息进行提取,提高提取的准确性。In the embodiment of the present application, in order to filter the target document to obtain paragraphs that match the question to be answered, it is first necessary to encode the target document according to the question to be answered and the filter model to obtain the first paragraph text vector. Among them, the preset screening model is, for example, a Bert (Bidirectional Encoder Representations From Transformers) model. The Bert model is a model based on Transformer that uses a bidirectional language. It can extract the syntax and semantic information of the document, and can also extract the document context information. Specifically, the server generates the first paragraph text vector for the target document according to the question to be answered and the Bert model. Among them, the first paragraph text vector is a three-dimensional vector, and the three-dimensional vector is a text vector representation of the target document. The reason why the Bert model is used to generate the first paragraph text vector for the target document is because the Bert model can extract the syntax and semantic information of the target document, and can also extract the context information of the target document to improve the accuracy of the extraction.
需要说明的是,在其它实施例中,根据实际需求也可采用其它模型对目标文档进行筛选得到目标段落,例如Word2vec(Word to vector)模型。It should be noted that in other embodiments, other models can also be used to filter the target documents to obtain the target paragraphs according to actual needs, such as the Word2vec (Word to vector) model.
S122、根据所述待回答问题计算每个所述第一段落文本向量与所述待回答问题相匹配的概率。S122. Calculate the probability that each of the first paragraph text vectors matches the question to be answered according to the question to be answered.
S123、将概率最大的所述第一段落文本向量所对应的段落确定为与所述待回答问题相匹配的段落,并作为目标段落。S123: Determine the paragraph corresponding to the first paragraph text vector with the highest probability as a paragraph matching the question to be answered, and use it as a target paragraph.
在本申请实施例中,服务器根据待回答问题及Bert模型对目标文档生成第一段落文本向量之后,还需根据待回答问题计算每个第一段落文本向量与待回答问题相匹配的概率并对计算出的概率进行排序取概率最大的第一段落文本向量所对应的段落作为目标段落。具体地,使用Softmax函数根据待回答问题计算每个第一段落文本向量与待回答问题相匹配的概率,得到概率之后再对计算出的概率进行排序,取概率最大的第一段落文本向量所对应的段落作为目标段落。In the embodiment of this application, after the server generates the first paragraph text vector for the target document according to the question to be answered and the Bert model, it also needs to calculate the probability that each first paragraph text vector matches the question to be answered according to the question to be answered and calculate The probability of sorting takes the paragraph corresponding to the text vector of the first paragraph with the highest probability as the target paragraph. Specifically, the Softmax function is used to calculate the probability that each first paragraph text vector matches the question to be answered according to the question to be answered, and then the calculated probabilities are sorted after the probability is obtained, and the paragraph corresponding to the text vector of the first paragraph with the highest probability is taken As the target paragraph.
S130、根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对。S130. According to the target paragraph and the question to be answered, a FAQ question and answer pair is generated based on a preset reading comprehension model.
具体地,服务器从目标文档中筛选出与待回答问题相匹配的段落之后,会根据筛选出的段落及待回答问题生成FAQ问答对。具体地,可通过预设的阅读理解模型自动生成FAQ问答对。其中,所述阅读理解模型用于根据目标段落及 待回答问题在目标段落中预测出与待回答问题相对应的答案的开始和结束的位置,从而确定答案,生成FAQ问答对。在本申请实施例中,由于从目标文档中筛选出与待回答问题相匹配的段落作为目标段落,服务器进而根据目标段落及待回答问题,基于预设的生成模型可生成FAQ问答对,无需对非目标段落进行处理,在一定程度上减少了生成FAQ问答对时非目标段落带来的干扰信息,生成的FAQ问答对的准确率比较高,因此可削弱跨领域带来的影响,对跨领域问题生成匹配准确度比较高的FAQ问答对。Specifically, after the server filters out paragraphs matching the question to be answered from the target document, it will generate a FAQ question and answer pair based on the selected paragraph and the question to be answered. Specifically, FAQ question and answer pairs can be automatically generated through a preset reading comprehension model. Wherein, the reading comprehension model is used to predict the start and end positions of the answer corresponding to the question to be answered in the target paragraph according to the target paragraph and the question to be answered, thereby determining the answer and generating a FAQ question and answer pair. In the embodiment of the present application, since the paragraph matching the question to be answered is selected from the target document as the target paragraph, the server further generates the FAQ question and answer pair based on the preset generation model based on the target paragraph and the question to be answered. The processing of non-target paragraphs reduces the interference information caused by non-target paragraphs when generating FAQ question and answer pairs to a certain extent. The accuracy rate of the generated FAQ question and answer pairs is relatively high, so it can weaken the influence of cross-domain and affect cross-domain. The question generates a FAQ question and answer pair with a relatively high matching accuracy.
在某些实施例,例如本实施例中,如图5所示,所述步骤S130可包括以下步骤S131-S134。In some embodiments, such as this embodiment, as shown in FIG. 5, the step S130 may include the following steps S131-S134.
S131、对所述目标段落及所述待回答问题分别进行编码以得到第二段落文本向量及问题文本向量。S131. Encode the target paragraph and the question to be answered respectively to obtain a second paragraph text vector and a question text vector.
在本申请实施例中,服务器从目标文档中筛选出与待回答问题相匹配的段落之后,会对筛选出来的目标段落及待回答问题分别进行编码。具体地,首先是采用预设的模型例如Bert模型对目标段落及待回答问题进行编码,然后再采用预设的模型例如Encoder Block对编码后的目标段落及待回答问题进行再编码,从而得到第二段落文本向量及问题文本向量。其中,第二段落文本向量及问题文本向量均为三维向量,该三维向量中的第一分量为Batch_Size,该Batch_Size为批处理参数,其极限值为训练集样本总数。在本实施例中Batch_Size值为32,表明预设模型是采用小批量对目标段落及待回答问题进行批处理的。在其它实施例中,Batch_Size也可设置为其它值,只需达到对目标段落及待回答问题进行编码处理后可得到第二段落文本向量及问题文本向量即可。该三维向量中的第二分量为句子的长度。该三维向量中的第三分量为每个词对应的维度。Encoder Block包括卷积神经网络、自注意力机制以及前向神经网络。其中,卷积神经网络为(Convolutional Neural Networks,CNN)是一类包含卷积计算且具有深度结构的前馈神经网络(FeedforwardNeural Networks),是深度学习(Deep learning)的代表算法之一。自注意力机制(Self-attention)利用了Attention机制,可考虑到上下文信息的表征。具体地,首先使用Bert模型对目标段落及待回答问题进行编码可得到第一临时段落文本向量及第一临时问题文本向量,第一临时段落文本向量及第一临时问题文本向量均为三维向量,可称为第一三维向量。然后再使用卷积神经网络对第一临时段落文本向量及第一临时问题文本向量进行编 码可得到第二临时段落文本向量及第二临时问题文本向量,第二临时段落文本向量及第二临时问题文本向量也均为三维向量,可称为第二三维向量,第二三维向量中的每个词的维度相比于第一三维向量中的每个词的维度有所下降。其次再通过自注意力机制,将第二临时段落文本向量及第二临时问题文本向量中为除当前词外的每个其它词计算一个权重,将权重值做加权求和得到第三临时段落文本向量及第三临时问题文本向量,第三临时段落文本向量及第三临时问题文本向量可称为第三三维向量,第三三维向量与第二三维向量相比较每个分量表示的含义均相同,第三三维向量只是对第二三维向量的进一步提取且第三三维向量中的每个词的维度相比于第二三维向量中的每个词的维度有所下降。最后再将第三临时段落文本向量及第三临时问题文本向量通过一个前向神经网络继续提取得到最终所需的段落文本向量及问题文本向量,其中,最终所需的段落文本向量定义为第二段落文本向量,该第二段落文本向量及问题文本向量中的每个词的维度相比于第三临时段落文本向量及第三临时问题文本向量中的每个词的维度有所下降。可理解地,在此步骤中,只叠加了一个Encoder Block,而一个Encoder Block中有多层网络,而网络的层数越高反向传播时就会存在梯度消失的问题,为了缓解此问题,在本实施例中,使用卷积神经网络、自注意力机制以及前向神经网络进行编码时都会加上残差,而加上残差就可以缓解此问题。In the embodiment of the present application, after the server selects paragraphs matching the question to be answered from the target document, it encodes the selected target paragraph and the question to be answered respectively. Specifically, first, a preset model such as a Bert model is used to encode the target paragraph and the question to be answered, and then a preset model such as Encoder Block is used to re-encode the encoded target paragraph and the question to be answered, so as to obtain the first Two paragraph text vector and question text vector. Among them, the second paragraph text vector and the question text vector are both three-dimensional vectors, and the first component in the three-dimensional vector is Batch_Size, which is a batch processing parameter, and its limit value is the total number of training set samples. In this embodiment, the Batch_Size value is 32, which indicates that the preset model uses a small batch to batch process the target paragraph and the question to be answered. In other embodiments, Batch_Size can also be set to other values, as long as the target paragraph and the question to be answered are encoded to obtain the second paragraph text vector and question text vector. The second component in the three-dimensional vector is the length of the sentence. The third component in the three-dimensional vector is the dimension corresponding to each word. Encoder Block includes convolutional neural network, self-attention mechanism and forward neural network. Among them, the convolutional neural network (Convolutional Neural Networks, CNN) is a type of feedforward neural network (Feedforward Neural Networks) that includes convolution calculations and has a deep structure, and is one of the representative algorithms of deep learning (Deep learning). The self-attention mechanism (Self-attention) utilizes the Attention mechanism, which can take into account the representation of contextual information. Specifically, first use the Bert model to encode the target paragraph and the question to be answered to obtain the first temporary paragraph text vector and the first temporary question text vector. The first temporary paragraph text vector and the first temporary question text vector are both three-dimensional vectors. It can be called the first three-dimensional vector. Then use the convolutional neural network to encode the first temporary paragraph text vector and the first temporary question text vector to obtain the second temporary paragraph text vector and the second temporary question text vector, the second temporary paragraph text vector and the second temporary question The text vectors are also three-dimensional vectors, which can be called the second three-dimensional vector, and the dimension of each word in the second three-dimensional vector is reduced compared to the dimension of each word in the first three-dimensional vector. Secondly, through the self-attention mechanism, the second temporary paragraph text vector and the second temporary question text vector are calculated for each other word except the current word, and the weight values are weighted and summed to obtain the third temporary paragraph text The vector and the third temporary question text vector, the third temporary paragraph text vector and the third temporary question text vector can be called the third three-dimensional vector. Comparing the third three-dimensional vector with the second three-dimensional vector, each component has the same meaning. The third three-dimensional vector is only a further extraction of the second three-dimensional vector, and the dimension of each word in the third three-dimensional vector is reduced compared to the dimension of each word in the second three-dimensional vector. Finally, the third temporary paragraph text vector and the third temporary question text vector are extracted through a forward neural network to obtain the final required paragraph text vector and question text vector, where the final required paragraph text vector is defined as the second Paragraph text vector, the dimension of each word in the second paragraph text vector and the question text vector is reduced compared to the dimension of each word in the third temporary paragraph text vector and the third temporary question text vector. Understandably, in this step, only one Encoder Block is superimposed, and there are multiple layers of networks in an Encoder Block, and the higher the number of layers of the network, the problem of gradient disappearance will occur when backpropagation. In order to alleviate this problem, In this embodiment, the residual error is added when using the convolutional neural network, the self-attention mechanism, and the forward neural network for encoding, and adding the residual error can alleviate this problem.
需要说明的是,在其它实施例中,根据实际需求也可采用其它模型对目标段落及待回答问题分别进行编码只需得到第二段落文本向量及问题文本向量即可,例如用RNN(Recurrent Neural Network)模型替代自注意力机制(Self-attention)。It should be noted that in other embodiments, other models can also be used to encode the target paragraph and the question to be answered according to actual needs, and only the second paragraph text vector and the question text vector can be obtained, for example, RNN (Recurrent Neural Network model replaces the self-attention mechanism (Self-attention).
S132、对所述第二段落文本向量及所述问题文本向量进行编码以得到新文本向量。S132. Encode the second paragraph text vector and the question text vector to obtain a new text vector.
在本申请实施例中,对目标段落及待回答问题分别进行编码得到第二段落文本向量及问题文本向量之后,还需对第二段落文本向量及问题文本向量进行编码得到新文本向量。具体地,是在Context-Query Attention层对第二段落文本向量及问题文本向量进行Attention编码操作。其中,Attention编码操作包括Context-to-Query及Query-to-Context的Attention编码操作。Context-to-Query的Attention编码操作,指的是将Context的长度N和Query的长度M,构成一个 相关性矩阵N*M,然后对这个相关性矩阵N*M的每一行做Softmax编码计算得到Attention分数,最后将Attention分数与原始Query的文本向量进行计算加权求和得到包含Attention信息的文本向量。Query-to-Contex的Attention编码操作指的是将Query的长度M和Context的长度N,构成一个相关性矩阵M*N,然后对这个相关性矩阵M*N的每一行做Softmax计算得到Attention分数,最后将Attention分数与原始Context的文本向量进行计算加权求和得到包含Attention信息的文本向量。通过在Context-Query Attention层对第二段落文本向量及问题文本向量进行Context-to-Query及Query-to-Context的编码操作可得到新文本向量。In the embodiment of the present application, after the target paragraph and the question to be answered are respectively encoded to obtain the second paragraph text vector and the question text vector, the second paragraph text vector and the question text vector need to be encoded to obtain a new text vector. Specifically, the Attention encoding operation is performed on the second paragraph text vector and the question text vector in the Context-Query Attention layer. Among them, Attention coding operations include Context-to-Query and Query-to-Context Attention coding operations. The Attention encoding operation of Context-to-Query refers to the Context length N and the Query length M to form a correlation matrix N*M, and then Softmax encoding is calculated for each row of the correlation matrix N*M Attention score, and finally the Attention score and the original Query text vector are calculated and weighted and summed to obtain a text vector containing Attention information. The Attention encoding operation of Query-to-Contex refers to the length M of the Query and the length N of the Context to form a correlation matrix M*N, and then the Softmax calculation is performed on each row of the correlation matrix M*N to obtain the Attention score Finally, the Attention score and the text vector of the original Context are calculated and weighted and summed to obtain the text vector containing the Attention information. The new text vector can be obtained by performing Context-to-Query and Query-to-Context encoding operations on the second paragraph text vector and question text vector at the Context-Query Attention layer.
需要说明的是,在本申请实施例中的新文本向量也为三维向量,且每个分量所表示的含义与第二段落文本向量及问题文本向量相同,只是实现了第二段落文本向量及问题文本向量的交互,其第三分量即每个词的维度有所增加,为描述简洁方便,在此不再赘述。It should be noted that the new text vector in the embodiment of this application is also a three-dimensional vector, and the meaning of each component is the same as the second paragraph text vector and question text vector, but the second paragraph text vector and question are realized. For the interaction of text vectors, the third component, that is, the dimension of each word, has been increased. For the sake of simplicity and convenience, it will not be repeated here.
S133、根据预设的提取模型对所述新文本向量进行编码以得到目标文本向量。S133: Encode the new text vector according to a preset extraction model to obtain a target text vector.
在本申请实施例中,对第二段落文本向量及问题文本向量进行编码以得到新文本向量之后,还需根据预设的提取模型对新文本向量进行编码以得到目标文本向量。其中,预设的提取模型例如为Encoder Block,该Encoder Block与步骤S131中Encoder Block中的Block个数不一样,但都包括卷积神经网络、自注意力机制以及前向神经网络,在卷积神经网络、自注意力机制以及前向神经网络对新文本向量进行编码时都会加上残差。在此步骤中叠加了三个Encoder Block对新文本向量进行编码,以进一步从新文本向量中提取目标文本向量,目标向量中的每个词的维度相比于新文本向量中的每个词的维度有所下降,从而使得生成FAQ问答对的匹配准确度更高。In the embodiment of the present application, after the second paragraph text vector and the question text vector are encoded to obtain the new text vector, the new text vector needs to be encoded according to the preset extraction model to obtain the target text vector. Among them, the preset extraction model is, for example, Encoder Block. The Encoder Block is different from the number of Blocks in Encoder Block in step S131, but both include convolutional neural networks, self-attention mechanisms, and forward neural networks. Neural networks, self-attention mechanisms, and forward neural networks will add residuals when encoding new text vectors. In this step, three Encoder Blocks are superimposed to encode the new text vector to further extract the target text vector from the new text vector. The dimension of each word in the target vector is compared to the dimension of each word in the new text vector. There is a decrease, which makes the matching accuracy of the generated FAQ question and answer pairs higher.
S134、对所述目标文本向量进行计算以得到所述待回答问题的答案开始及结束的位置,从而生成所述FAQ问答对。S134. Calculate the target text vector to obtain the start and end positions of the answer to the question to be answered, so as to generate the FAQ question and answer pair.
在本申请实施例中,通过Encoder Block对新文本向量进行编码得到目标文本向量之后,还需对目标文本向量进行计算以得到待回答问题的答案开始及结束的位置,从而生成FAQ问答对。具体地,将步骤S133中第一个Encoder Block编码得到的文本向量和第二个Encoder Block编码得到的文本向量拼接在一起作 为待回答问题的答案开始的位置,将第一个Encoder Block编码得到的文本向量和第三个Encoder Block编码得到的文本向量拼接在一起作为待回答问题的答案结束的位置,然后对待回答问题的答案开始及结束的位置分别进行Softmax操作,得到待回答问题的答案开始及结束位置的概率,并取待回答问题的答案开始及结束位置概率最大的作为待回答问题的答案开始及结束的位置,从而生成FAQ问答对,进而实现FAQ问答对的自动构建。In the embodiment of the present application, after encoding the new text vector by Encoder Block to obtain the target text vector, the target text vector needs to be calculated to obtain the starting and ending positions of the answer to the question to be answered, thereby generating a FAQ question and answer pair. Specifically, the text vector obtained by the first Encoder Block encoding in step S133 and the text vector obtained by the second Encoder Block encoding are spliced together as the starting position of the answer to the question to be answered, and the first Encoder Block encoding is obtained The text vector and the text vector encoded by the third Encoder Block are spliced together as the end position of the answer to the question to be answered, and then the softmax operation is performed on the start and end positions of the answer to the question to be answered to obtain the start and end of the answer to the question to be answered. The probability of the end position, and the start and end positions of the answer to the question to be answered with the highest probability are taken as the start and end positions of the answer to the question to be answered, so as to generate the FAQ question and answer pair, and then realize the automatic construction of the FAQ question and answer pair.
图6为本申请另一实施例提供的FAQ问答对自动构建方法的流程示意图,如图6所示,在本实施例中,所述方法包括步骤S100-S190。也即,在本实施例中,所述方法在上述实施例的步骤S130之后,还包括步骤S140-S190。FIG. 6 is a schematic flowchart of a method for automatically constructing FAQ question and answer pairs provided by another embodiment of the application. As shown in FIG. 6, in this embodiment, the method includes steps S100-S190. That is, in this embodiment, the method further includes steps S140-S190 after step S130 in the foregoing embodiment.
S140、获取所述FAQ问答对并将获取的所述FAQ问答对反馈给用户。S140. Obtain the FAQ question and answer pair and feed back the obtained FAQ question and answer pair to the user.
在本申请实施例中,用户在FAQ网页端页面上传待阅读的文档内容,并根据待阅读文档的内容或者文档名选择相关的问题模板,服务器根据用户的选择调用相关问题模板并根据问题模板中的问题及Bert筛选模型,从目标文档中筛选出与问题模板中的问题相匹配的段落作为目标段落,然后根据目标段落及待回答的问题生成FAQ问答对之后,会将生成的FAQ问答对反馈给用户。具体地,服务器获取生成的FAQ问答对并将获取的FAQ问答对显示于FAQ网页端的页面中,用户可根据需要进行后续操作。例如,若用户对所产生的FAQ问答对满意,可直接导出FAQ问答对,若不满意,可在FAQ网页端修改界面进行修改。In the embodiment of this application, the user uploads the content of the document to be read on the FAQ web page, and selects the relevant question template according to the content or the document name of the document to be read. The server calls the relevant question template according to the user's selection and selects the relevant question template according to the question template. Questions and Bert screening model, filter the paragraphs matching the questions in the question template from the target document as the target paragraph, and then generate FAQ pairs based on the target paragraphs and the questions to be answered, and then feedback the generated FAQ questions To the user. Specifically, the server obtains the generated FAQ question and answer pair and displays the obtained FAQ question and answer pair on the page of the FAQ web page, and the user can perform follow-up operations as needed. For example, if the user is satisfied with the generated FAQ question and answer, he can directly export the FAQ question and answer pair. If he is not satisfied, he can modify it on the FAQ page modification interface.
S150、判断是否接收到用户发送的修改指令。S150: Determine whether a modification instruction sent by the user is received.
在本申请实施例中,服务器获取FAQ问答对并将获取的FAQ问答对反馈给用户之后,会判断是否接收到用户发送的修改指令。In the embodiment of this application, after the server obtains the FAQ question and answer pair and feeds the obtained FAQ question and answer pair back to the user, it will determine whether the modification instruction sent by the user is received.
S160、若接收到用户发送的修改指令,则将所述修改指令中由用户所输入的问题作为所述待回答问题,返回执行所述根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落的步骤S120。S160. If the modification instruction sent by the user is received, the question input by the user in the modification instruction is used as the question to be answered, and the execution of the screening model based on the question to be answered and the preset screening model is returned from the target. Step S120 of selecting a paragraph matching the question to be answered from the document as the target paragraph.
在本申请实施例中,服务器若接收到用户发送的修改指令,则表明用户对FAQ问答对并不满意,用户可在FAQ网页端的修改页面中输入自己想提问的问题,然后服务器将修改指令中由用户所输入的问题作为待回答问题,并返回执行步骤S120。也即,重新确定待回答的问题之后,根据待回答问题对及预设筛 选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落,之后,再依次执行后续步骤,最后再将获取的FAQ问答对再反馈给用户。In the embodiment of this application, if the server receives the modification instruction sent by the user, it indicates that the user is not satisfied with the FAQ. The user can enter the question he wants to ask on the modification page of the FAQ web page, and then the server will modify the instruction The question input by the user is regarded as a question to be answered, and step S120 is executed back. That is, after re-determining the question to be answered, according to the pair of questions to be answered and the preset screening model, the paragraph matching the question to be answered is selected from the target document as the target paragraph, and then the subsequent steps are executed in sequence. Step, and finally feedback the obtained FAQ to the user.
S170、若未接收到用户发送的修改指令,则判断所述待回答问题是否为预设数据库问题模板中的问题。S170: If the modification instruction sent by the user is not received, determine whether the question to be answered is a question in a preset database question template.
S180、若所述待回答问题不是预设数据库问题模板中的问题,则根据所述待回答问题更新所述预设数据库问题模板中的问题。S180: If the question to be answered is not a question in the preset database question template, update the question in the preset database question template according to the question to be answered.
S190、若所述待回答问题是预设数据库问题模板中的问题,则不更新所述预设数据库问题模板中的问题。S190: If the question to be answered is a question in a preset database question template, do not update the question in the preset database question template.
在本申请实施例中,服务器若没有接收到用户发送的修改指令,则表明用户对生成的FAQ问答对满意,则判断待回答问题是否为预设数据库问题模板中的问题,若待回答问题是预设数据库中的问题,则表明待回答问题为问题模板中的问题且用户较为满意,可导出生成的FAQ问答对。若待回答问题不是预设数据库中的问题,则表明待回答问题为用户输入的问题且回答正确率较高,则需将该用户输入的问题补充到预设数据库问题模板中以更新和扩充预设数据库问题模板中的问题,从而在进行下一次FAQ问答对自动生成操作时,预设数据库问题模板中的问题就会更加丰富,就更加能满足用户需求。In the embodiment of this application, if the server does not receive the modification instruction sent by the user, it indicates that the user is satisfied with the generated FAQ, and then it is judged whether the question to be answered is a question in the preset database question template, and if the question to be answered is yes The question in the preset database indicates that the question to be answered is the question in the question template and the user is satisfied, and the generated FAQ question and answer pair can be exported. If the question to be answered is not a question in the preset database, it indicates that the question to be answered is a question entered by the user and the answer rate is high. The question entered by the user needs to be added to the preset database question template to update and expand the preset database. Set the questions in the database question template, so that when the next FAQ question and answer pair is automatically generated, the questions in the preset database question template will be more abundant, which can better meet the needs of users.
需要说明的是,在本申请实施例中,用户在FAQ网页端修改界面进行修改的操作将会被记录下来,修改后的结果将会被当作待回答问题的历史记录,这些历史记录可作为大量的标注数据以便对FAQ问答对模型进行优化操作。It should be noted that in the embodiment of this application, the user's modification operation on the modification interface of the FAQ web page will be recorded, and the modified result will be regarded as the historical record of the question to be answered. These historical records can be used as A large amount of annotated data is used to optimize the model for the FAQ question and answer.
图7是本申请实施例提供的一种FAQ问答对自动构建装置200的示意性框图。如图7所示,对应于以上FAQ问答对自动构建方法,本申请还提供一种FAQ问答对自动构建装置200。该FAQ问答对自动构建装置200包括用于执行上述FAQ问答对自动构建方法的单元,该装置可以被配置于服务器中。具体地,请参阅图7,该FAQ问答对自动构建装置200包括获取单元201、解析分段单元202、筛选单元203以及生成单元204。FIG. 7 is a schematic block diagram of a device 200 for automatically constructing FAQ question and answer pairs provided by an embodiment of the present application. As shown in FIG. 7, corresponding to the above method for automatically constructing FAQ question and answer pairs, the present application also provides a device 200 for automatically constructing FAQ question and answer pairs. The FAQ question and answer pair automatic construction device 200 includes a unit for executing the above FAQ question and answer pair automatic construction method, and the device may be configured in a server. Specifically, referring to FIG. 7, the apparatus 200 for automatically constructing FAQ question and answer pairs includes an acquiring unit 201, a parsing and segmenting unit 202, a screening unit 203 and a generating unit 204.
其中,获取单元201用于获取待阅读的文档;解析分段单元202用于对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档;筛选单元203用于根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落;生成单元204用于根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问 答对。Wherein, the obtaining unit 201 is used to obtain the document to be read; the parsing and segmenting unit 202 is used to parse the document to be read and segment the parsed document to obtain the segmented document as the target document; The unit 203 is configured to filter out paragraphs matching the question to be answered from the target document as a target paragraph according to the question to be answered and a preset screening model; the generating unit 204 is configured to select paragraphs matching the question to be answered according to the target paragraph and the To answer the questions, generate FAQ question and answer pairs based on the preset reading comprehension model.
在某些实施例,例如本实施例中,如图8所示,所述解析分段单元202包括解析单元2021及分段单元2022。In some embodiments, such as this embodiment, as shown in FIG. 8, the analysis and segmentation unit 202 includes an analysis unit 2021 and a segmentation unit 2022.
其中,解析单元2021用于对所述待阅读的文档采用层叠CRF模型进行解析以得到XML文档;分段单元2022用于通过预设分段方式对所述XML文档进行分段,以得到具有预设文档结构的文档作为目标文档。Wherein, the parsing unit 2021 is used to parse the document to be read using a cascaded CRF model to obtain an XML document; the segmentation unit 2022 is used to segment the XML document by a preset segmentation method to obtain a pre-read Let the document with the document structure be the target document.
在某些实施例,例如本实施例中,如图9所示,所述筛选单元203包括第一编码单元2031、计算单元2032以及确定单元2033。In some embodiments, such as this embodiment, as shown in FIG. 9, the screening unit 203 includes a first encoding unit 2031, a calculation unit 2032, and a determination unit 2033.
其中,第一编码单元2031用于根据所述待回答问题及预设的所述筛选模型对所述目标文档进行编码以得到第一段落文本向量;第一计算单元2032用于根据所述待回答问题计算每个所述第一段落文本向量与所述待回答问题相匹配的概率;确定单元2033用于将概率最大的所述第一段落文本向量所对应的段落确定为与所述待回答问题相匹配的段落,并作为目标段落。Wherein, the first coding unit 2031 is configured to code the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector; the first calculating unit 2032 is configured to code according to the question to be answered Calculate the probability that each of the first paragraph text vector matches the question to be answered; the determining unit 2033 is configured to determine the paragraph corresponding to the first paragraph text vector with the highest probability as the one that matches the question to be answered Paragraph and serve as the target paragraph.
在某些实施例,例如本实施例中,如图10所示,所述生成单元204包括第二编码单元2041、第三编码单元2042、第四编码单元2043、生成子单元2044。In some embodiments, such as this embodiment, as shown in FIG. 10, the generating unit 204 includes a second coding unit 2041, a third coding unit 2042, a fourth coding unit 2043, and a generating subunit 2044.
其中,第二编码单元2041用于对所述目标段落及所述待回答问题分别进行编码以得到第二段落文本向量及问题文本向量;第二编码单元2042用于对所述第二段落文本向量及所述问题文本向量进行编码以得到新文本向量;第三编码单元2043用于根据预设的提取模型对所述新文本向量进行编码以得到目标文本向量;第二生成子单元2044用于对所述目标文本向量进行计算以得到所述待回答问题的答案开始及结束的位置,从而生成所述FAQ问答对。Wherein, the second encoding unit 2041 is configured to encode the target paragraph and the question to be answered respectively to obtain a second paragraph text vector and a question text vector; the second encoding unit 2042 is configured to encode the second paragraph text vector And the question text vector to obtain a new text vector; the third coding unit 2043 is used to code the new text vector according to a preset extraction model to obtain a target text vector; the second generating subunit 2044 is used to The target text vector is calculated to obtain the start and end positions of the answer to the question to be answered, thereby generating the FAQ question and answer pair.
在某些实施例,例如本实施例中,如图11所示,所述装置200还包括反馈单元205、第一判断单元206、修改单元207、第二判断单元208以及更新单元209。In some embodiments, such as this embodiment, as shown in FIG. 11, the device 200 further includes a feedback unit 205, a first judgment unit 206, a modification unit 207, a second judgment unit 208 and an update unit 209.
其中,反馈单元205用于获取所述FAQ问答对并将获取的所述FAQ问答对反馈给用户;第一判断单元206用于判断是否接收到用户发送的修改指令;修改单元207用于若接收到用户发送的所述修改指令,则将所述修改指令中由用户所输入的问题作为所述待回答问题;第二判断单元208用于若未接收到用户发送的所述修改指令,则判断所述待回答问题是否为预设数据库问题模板中的问题;更新单元209用于若所述待回答问题不是预设数据库问题模板中的问题, 则根据所述待回答问题更新所述预设数据库问题模板中的问题。Wherein, the feedback unit 205 is used to obtain the FAQ question and answer pair and feedback the obtained FAQ question and answer pair to the user; the first judgment unit 206 is used to judge whether the modification instruction sent by the user is received; the modification unit 207 is used to receive When the modification instruction sent by the user is received, the question input by the user in the modification instruction is regarded as the question to be answered; the second determining unit 208 is configured to determine if the modification instruction sent by the user is not received Whether the question to be answered is a question in a preset database question template; the updating unit 209 is configured to update the preset database according to the question to be answered if the question to be answered is not a question in the preset database question template Questions in the question template.
上述FAQ问答对自动构建装置可以实现为一种计算机程序的形式,该计算机程序可以在如图12所示的计算机设备上运行。The above FAQ question answering automatic construction device can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 12.
请参阅图12,图12是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备300为服务器,具体地,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 12, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 300 is a server. Specifically, the server may be an independent server or a server cluster composed of multiple servers.
参阅图12,该计算机设备300包括通过系统总线301连接的处理器302、存储器和网络接口305,其中,存储器可以包括非易失性存储介质503和内存储器304。12, the computer device 300 includes a processor 302, a memory, and a network interface 305 connected through a system bus 301, where the memory may include a non-volatile storage medium 503 and an internal memory 304.
该非易失性存储介质303可存储操作系统3031和计算机程序3032。该计算机程序3032被执行时,可使得处理器302执行一种FAQ问答对自动构建方法。The non-volatile storage medium 303 can store an operating system 3031 and a computer program 3032. When the computer program 3032 is executed, the processor 302 can execute a method for automatically constructing FAQ question and answer pairs.
该处理器302用于提供计算和控制能力,以支撑整个计算机设备300的运行。The processor 302 is used to provide calculation and control capabilities to support the operation of the entire computer device 300.
该内存储器304为非易失性存储介质303中的计算机程序3032的运行提供环境,该计算机程序3032被处理器302执行时,以实现本申请实施例的FAQ问答对自动构建方法。The internal memory 304 provides an environment for the operation of the computer program 3032 in the non-volatile storage medium 303. When the computer program 3032 is executed by the processor 302, it implements the method for automatically constructing FAQ question and answer pairs in the embodiment of the present application.
该网络接口305用于与其它设备进行网络通信。本领域技术人员可以理解,图12中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备300的限定,具体的计算机设备300可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 305 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 12 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 300 to which the solution of the present application is applied. The specific computer device 300 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
应当理解,在本申请实施例中,处理器302可以是中央处理单元(Central Processing Unit,CPU),该处理器302还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor 302 may be a central processing unit (Central Processing Unit, CPU), and the processor 302 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成。该计算机程序可存储于一存储介质中,该存储介质为计算机可读存储介质。该计算机程序被该计算 机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by computer programs instructing relevant hardware. The computer program can be stored in a storage medium, and the storage medium is a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiment.
因此,本申请还提供一种存储介质。该存储介质可以为计算机可读存储介质。该存储介质存储有计算机程序。该计算机程序被处理器执行时使处理器执行以实现本申请实施例的FAQ问答对自动构建方法。Therefore, this application also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program. When the computer program is executed by the processor, the processor is executed to implement the method for automatically constructing FAQ question and answer pairs in the embodiment of the present application.
在某些实施例,例如本实施例中,所述处理器在执行所述计算机程序而实现所述对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档步骤时,具体实现如下步骤:对所述待阅读的文档采用层叠CRF模型进行解析以得到XML文档;通过预设分段方式对所述XML文档进行分段,以得到具有预设文档结构的文档作为目标文档。In some embodiments, such as this embodiment, the processor executes the computer program to realize the parsing of the document to be read and segmenting the parsed document to obtain the segmented document. When a document is used as a target document step, the following steps are specifically implemented: the document to be read is parsed using a cascading CRF model to obtain an XML document; the XML document is segmented by a preset segmentation method to obtain a preset The document with the document structure is used as the target document.
所述存储介质可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的计算机可读存储介质。The storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。本申请实施例装置中的单元可以根据实际需要进行合并、划分和删减。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。The steps in the method in the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs. The units in the device of the embodiment of the present application may be combined, divided, and deleted according to actual needs. In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,终端,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own focus. For a part that is not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,尚且本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, even if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application is also intended to include these modifications and variations.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种FAQ问答对自动构建方法,包括:A FAQ question and answer pair automatic construction method, including:
    获取待阅读的文档;Get the document to be read;
    对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档;Parsing the document to be read and segmenting the parsed document to obtain the segmented document as a target document;
    根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落;According to the question to be answered and a preset screening model, a paragraph that matches the question to be answered is selected from the target document as a target paragraph;
    根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对。According to the target paragraph and the question to be answered, a FAQ question and answer pair is generated based on a preset reading comprehension model.
  2. 根据权利要求1所述的方法,其中,所述对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档,包括:The method according to claim 1, wherein the parsing the document to be read and segmenting the parsed document to obtain the segmented document as the target document comprises:
    对所述待阅读的文档采用层叠CRF模型进行解析以得到XML文档;Parse the document to be read using a cascading CRF model to obtain an XML document;
    通过预设分段方式对所述XML文档进行分段,以得到具有预设文档结构的文档作为目标文档。The XML document is segmented in a preset segmentation manner to obtain a document with a preset document structure as a target document.
  3. 根据权利要求1所述的方法,其中,所述根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落,包括:The method according to claim 1, wherein the filtering out a paragraph matching the question to be answered as a target paragraph from the target document based on the question to be answered and a preset screening model comprises:
    根据所述待回答问题及预设的所述筛选模型对所述目标文档进行编码以得到第一段落文本向量;Encoding the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector;
    根据所述待回答问题计算每个所述第一段落文本向量与所述待回答问题相匹配的概率;Calculating the probability that each of the first paragraph text vectors matches the question to be answered according to the question to be answered;
    将概率最大的所述第一段落文本向量所对应的段落确定为与所述待回答问题相匹配的段落,并作为目标段落。The paragraph corresponding to the first paragraph text vector with the highest probability is determined as a paragraph matching the question to be answered, and used as a target paragraph.
  4. 根据权利要求1所述的方法,其中,所述根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对,包括:The method according to claim 1, wherein said generating a FAQ question and answer pair based on a preset reading comprehension model according to the target paragraph and the question to be answered comprises:
    对所述目标段落及所述待回答问题分别进行编码以得到第二段落文本向量及问题文本向量;Respectively encoding the target paragraph and the question to be answered to obtain a second paragraph text vector and a question text vector;
    对所述第二段落文本向量及所述问题文本向量进行编码以得到新文本向量;Encoding the second paragraph text vector and the question text vector to obtain a new text vector;
    根据预设的提取模型对所述新文本向量进行编码以得到目标文本向量;Encoding the new text vector according to a preset extraction model to obtain a target text vector;
    对所述目标文本向量进行计算以得到所述待回答问题的答案开始及结束的 位置,从而生成所述FAQ问答对。The target text vector is calculated to obtain the starting and ending positions of the answer to the question to be answered, thereby generating the FAQ question and answer pair.
  5. 根据权利要求1所述的方法,其中,所述根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对之后,还包括:The method according to claim 1, wherein after said generating a FAQ question and answer pair based on a preset reading comprehension model according to the target paragraph and the question to be answered, the method further comprises:
    获取所述FAQ问答对并将获取的所述FAQ问答对反馈给用户。The FAQ question and answer pair is obtained and the obtained FAQ question and answer pair are fed back to the user.
  6. 根据权利要求5所述的方法,其中,所述获取所述FAQ问答对并将获取的所述FAQ问答对反馈给用户之后,还包括:5. The method according to claim 5, wherein after the obtaining the FAQ question and answer pair and feeding back the obtained FAQ question and answer pair to the user, the method further comprises:
    判断是否接收到用户发送的修改指令;Determine whether the modification instruction sent by the user is received;
    若接收到用户发送的所述修改指令,则将所述修改指令中由用户所输入的问题作为所述待回答问题;If the modification instruction sent by the user is received, use the question input by the user in the modification instruction as the question to be answered;
    返回执行所述根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落的步骤。Return to the step of performing the step of filtering out the paragraph matching the question to be answered from the target document as the target paragraph according to the question to be answered and the preset screening model.
  7. 根据权利要求6所述的方法,其中,所述判断是否接收到用户发送的修改指令之后,还包括:The method according to claim 6, wherein after said determining whether a modification instruction sent by a user is received, the method further comprises:
    若未接收到用户发送的所述修改指令,则判断所述待回答问题是否为预设数据库问题模板中的问题;If the modification instruction sent by the user is not received, determining whether the question to be answered is a question in a preset database question template;
    若所述待回答问题不是预设数据库问题模板中的问题,则根据所述待回答问题更新所述预设数据库问题模板中的问题。If the question to be answered is not a question in the preset database question template, then the question in the preset database question template is updated according to the question to be answered.
  8. 一种FAQ问答对自动构建装置,其中,包括:A FAQ question and answer pair automatic construction device, which includes:
    获取单元,用于获取待阅读的文档;The obtaining unit is used to obtain the document to be read;
    解析分段单元,用于对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档;The parsing and segmentation unit is used to analyze the document to be read and segment the parsed document to obtain the segmented document as a target document;
    筛选单元,用于根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落;The screening unit is configured to screen out paragraphs matching the question to be answered from the target document as the target paragraph according to the question to be answered and a preset screening model;
    生成单元,用于根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对。The generating unit is configured to generate FAQ question and answer pairs based on a preset reading comprehension model according to the target paragraph and the question to be answered.
  9. 根据权利要求8所述的FAQ问答对自动构建装置,其中,所述筛选单元包括:The device for automatically constructing FAQ question and answer pairs according to claim 8, wherein the screening unit comprises:
    第一编码单元,用于根据所述待回答问题及预设的所述筛选模型对所述目标文档进行编码以得到第一段落文本向量;A first encoding unit, configured to encode the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector;
    计算单元,用于根据所述待回答问题计算每个所述第一段落文本向量与所 述待回答问题相匹配的概率;A calculation unit, configured to calculate the probability that each of the first paragraph text vectors matches the question to be answered according to the question to be answered;
    确定单元,用于将概率最大的所述第一段落文本向量所对应的段落确定为与所述待回答问题相匹配的段落,并作为目标段落。The determining unit is configured to determine the paragraph corresponding to the first paragraph text vector with the highest probability as the paragraph matching the question to be answered, and use it as the target paragraph.
  10. 根据权利要求8所述的FAQ问答对自动构建装置,其中,所述生成单元包括:The apparatus for automatically constructing FAQ question and answer pairs according to claim 8, wherein the generating unit comprises:
    第二编码单元,用于对所述目标段落及所述待回答问题分别进行编码以得到第二段落文本向量及问题文本向量;The second coding unit is configured to respectively code the target paragraph and the question to be answered to obtain a second paragraph text vector and a question text vector;
    第三编码单元,用于对所述第二段落文本向量及所述问题文本向量进行编码以得到新文本向量;The third encoding unit is used to encode the second paragraph text vector and the question text vector to obtain a new text vector;
    第四编码单元,用于根据预设的提取模型对所述新文本向量进行编码以得到目标文本向量;The fourth encoding unit is configured to encode the new text vector according to a preset extraction model to obtain a target text vector;
    生成子单元,用于对所述目标文本向量进行计算以得到所述待回答问题的答案开始及结束的位置,从而生成所述FAQ问答对。A generating subunit is used to calculate the target text vector to obtain the start and end positions of the answer to the question to be answered, thereby generating the FAQ question and answer pair.
  11. 一种计算机设备,包括存储器以及与所述存储器相连的处理器;其中,所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:A computer device includes a memory and a processor connected to the memory; wherein the memory is used to store a computer program; the processor is used to run the computer program stored in the memory to perform the following steps:
    获取待阅读的文档;Get the document to be read;
    对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档;Parsing the document to be read and segmenting the parsed document to obtain the segmented document as a target document;
    根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落;According to the question to be answered and a preset screening model, a paragraph that matches the question to be answered is selected from the target document as a target paragraph;
    根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对。According to the target paragraph and the question to be answered, a FAQ question and answer pair is generated based on a preset reading comprehension model.
  12. 根据权利要求11所述的计算机设备,其中,所述对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档的步骤包括:11. The computer device according to claim 11, wherein the step of parsing the document to be read and segmenting the parsed document to obtain the segmented document as the target document comprises:
    对所述待阅读的文档采用层叠CRF模型进行解析以得到XML文档;Parse the document to be read using a cascading CRF model to obtain an XML document;
    通过预设分段方式对所述XML文档进行分段,以得到具有预设文档结构的文档作为目标文档。The XML document is segmented in a preset segmentation manner to obtain a document with a preset document structure as a target document.
  13. 根据权利要求11所述的计算机设备,其中,所述根据待回答问题及预设 的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落的步骤包括:11. The computer device according to claim 11, wherein the step of selecting a paragraph matching the question to be answered as a target paragraph from the target document based on the question to be answered and a preset screening model comprises:
    根据所述待回答问题及预设的所述筛选模型对所述目标文档进行编码以得到第一段落文本向量;Encoding the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector;
    根据所述待回答问题计算每个所述第一段落文本向量与所述待回答问题相匹配的概率;Calculating the probability that each of the first paragraph text vectors matches the question to be answered according to the question to be answered;
    将概率最大的所述第一段落文本向量所对应的段落确定为与所述待回答问题相匹配的段落,并作为目标段落。The paragraph corresponding to the first paragraph text vector with the highest probability is determined as a paragraph matching the question to be answered, and used as a target paragraph.
  14. 根据权利要求11所述的计算机设备,其中,所述根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对的步骤包括:11. The computer device according to claim 11, wherein the step of generating a FAQ question and answer pair based on a preset reading comprehension model according to the target paragraph and the question to be answered comprises:
    对所述目标段落及所述待回答问题分别进行编码以得到第二段落文本向量及问题文本向量;Respectively encoding the target paragraph and the question to be answered to obtain a second paragraph text vector and a question text vector;
    对所述第二段落文本向量及所述问题文本向量进行编码以得到新文本向量;Encoding the second paragraph text vector and the question text vector to obtain a new text vector;
    根据预设的提取模型对所述新文本向量进行编码以得到目标文本向量;Encoding the new text vector according to a preset extraction model to obtain a target text vector;
    对所述目标文本向量进行计算以得到所述待回答问题的答案开始及结束的位置,从而生成所述FAQ问答对。The target text vector is calculated to obtain the start and end positions of the answer to the question to be answered, thereby generating the FAQ question and answer pair.
  15. 根据权利要求11所述的计算机设备,其中,所述根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对的步骤之后包括:11. The computer device according to claim 11, wherein the step of generating a FAQ question and answer pair based on a preset reading comprehension model according to the target paragraph and the question to be answered comprises:
    获取所述FAQ问答对并将获取的所述FAQ问答对反馈给用户。The FAQ question and answer pair is obtained and the obtained FAQ question and answer pair are fed back to the user.
  16. 根据权利要求15所述的计算机设备,其中,所述获取所述FAQ问答对并将获取的所述FAQ问答对反馈给用户的步骤之后还包括:15. The computer device according to claim 15, wherein after the step of obtaining the FAQ question and answer pair and feeding back the obtained FAQ question and answer pair to the user, the method further comprises:
    判断是否接收到用户发送的修改指令;Determine whether the modification instruction sent by the user is received;
    若接收到用户发送的所述修改指令,则将所述修改指令中由用户所输入的问题作为所述待回答问题;If the modification instruction sent by the user is received, use the question input by the user in the modification instruction as the question to be answered;
    返回执行所述根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落的步骤。Return to the step of performing the step of filtering out the paragraph matching the question to be answered from the target document as the target paragraph according to the question to be answered and the preset screening model.
  17. 根据权利要求16所述的计算机设备,其中,所述判断是否接收到用户发送的修改指令的步骤之后还包括:The computer device according to claim 16, wherein after the step of determining whether a modification instruction sent by the user is received, the method further comprises:
    若未接收到用户发送的所述修改指令,则判断所述待回答问题是否为预设数据库问题模板中的问题;If the modification instruction sent by the user is not received, determining whether the question to be answered is a question in a preset database question template;
    若所述待回答问题不是预设数据库问题模板中的问题,则根据所述待回答问题更新所述预设数据库问题模板中的问题。If the question to be answered is not a question in the preset database question template, then the question in the preset database question template is updated according to the question to be answered.
  18. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行以下步骤:A computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
    获取待阅读的文档;Get the document to be read;
    对所述待阅读的文档进行解析并对解析后的文档进行分段以得到分段后的文档作为目标文档;Parsing the document to be read and segmenting the parsed document to obtain the segmented document as a target document;
    根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落;According to the question to be answered and a preset screening model, a paragraph that matches the question to be answered is selected from the target document as a target paragraph;
    根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对。According to the target paragraph and the question to be answered, a FAQ question and answer pair is generated based on a preset reading comprehension model.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据待回答问题及预设的筛选模型,从所述目标文档中筛选出与所述待回答问题相匹配的段落作为目标段落的步骤包括:18. The computer-readable storage medium according to claim 18, wherein, according to the question to be answered and a preset screening model, a paragraph that matches the question to be answered is selected from the target document as the target paragraph. The steps include:
    根据所述待回答问题及预设的所述筛选模型对所述目标文档进行编码以得到第一段落文本向量;Encoding the target document according to the question to be answered and the preset screening model to obtain a first paragraph text vector;
    根据所述待回答问题计算每个所述第一段落文本向量与所述待回答问题相匹配的概率;Calculating the probability that each of the first paragraph text vectors matches the question to be answered according to the question to be answered;
    将概率最大的所述第一段落文本向量所对应的段落确定为与所述待回答问题相匹配的段落,并作为目标段落。The paragraph corresponding to the first paragraph text vector with the highest probability is determined as a paragraph matching the question to be answered, and used as a target paragraph.
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述根据所述目标段落及所述待回答问题,基于预设的阅读理解模型生成FAQ问答对的步骤包括:18. The computer-readable storage medium of claim 18, wherein the step of generating FAQ question and answer pairs based on a preset reading comprehension model according to the target paragraph and the question to be answered comprises:
    对所述目标段落及所述待回答问题分别进行编码以得到第二段落文本向量及问题文本向量;Respectively encoding the target paragraph and the question to be answered to obtain a second paragraph text vector and a question text vector;
    对所述第二段落文本向量及所述问题文本向量进行编码以得到新文本向量;Encoding the second paragraph text vector and the question text vector to obtain a new text vector;
    根据预设的提取模型对所述新文本向量进行编码以得到目标文本向量;Encoding the new text vector according to a preset extraction model to obtain a target text vector;
    对所述目标文本向量进行计算以得到所述待回答问题的答案开始及结束的位置,从而生成所述FAQ问答对。The target text vector is calculated to obtain the start and end positions of the answer to the question to be answered, thereby generating the FAQ question and answer pair.
PCT/CN2019/118442 2019-10-12 2019-11-14 Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium WO2021068352A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910969443.4A CN111046152B (en) 2019-10-12 2019-10-12 Automatic FAQ question-answer pair construction method and device, computer equipment and storage medium
CN201910969443.4 2019-10-12

Publications (1)

Publication Number Publication Date
WO2021068352A1 true WO2021068352A1 (en) 2021-04-15

Family

ID=70231764

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118442 WO2021068352A1 (en) 2019-10-12 2019-11-14 Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN111046152B (en)
WO (1) WO2021068352A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435213A (en) * 2021-07-09 2021-09-24 支付宝(杭州)信息技术有限公司 Method and device for returning answers aiming at user questions and knowledge base
CN114330718A (en) * 2021-12-23 2022-04-12 北京百度网讯科技有限公司 Method and device for extracting causal relationship and electronic equipment
CN117290483A (en) * 2023-10-09 2023-12-26 成都明途科技有限公司 Answer determination method, model training method, device and electronic equipment
CN113435213B (en) * 2021-07-09 2024-04-30 支付宝(杭州)信息技术有限公司 Method and device for returning answers to user questions and knowledge base

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597821B (en) * 2020-05-13 2021-03-19 北京嘀嘀无限科技发展有限公司 Method and device for determining response probability
CN113779203A (en) * 2020-06-09 2021-12-10 北京金山数字娱乐科技有限公司 Method and device for generating paragraph set and inference method and device
CN111858879B (en) * 2020-06-18 2024-04-05 达观数据有限公司 Question and answer method and system based on machine reading understanding, storage medium and computer equipment
CN111858878B (en) * 2020-06-18 2023-12-22 达观数据有限公司 Method, system and storage medium for automatically extracting answer from natural language text
CN112434149B (en) * 2020-06-24 2023-09-19 北京金山数字娱乐科技有限公司 Information extraction method, information extraction device, information extraction equipment and storage medium
CN114238787B (en) * 2020-08-31 2024-03-29 腾讯科技(深圳)有限公司 Answer processing method and device
CN112149838A (en) * 2020-09-03 2020-12-29 第四范式(北京)技术有限公司 Method, device, electronic equipment and storage medium for realizing automatic model building
CN112464641B (en) * 2020-10-29 2023-01-03 平安科技(深圳)有限公司 BERT-based machine reading understanding method, device, equipment and storage medium
CN112347226B (en) * 2020-11-06 2023-05-26 平安科技(深圳)有限公司 Document knowledge extraction method, device, computer equipment and readable storage medium
CN112634889B (en) * 2020-12-15 2023-08-08 深圳平安智慧医健科技有限公司 Electronic case input method, device, terminal and medium based on artificial intelligence
CN112507096A (en) * 2020-12-16 2021-03-16 平安银行股份有限公司 Document question-answer pair splitting method and device, electronic equipment and storage medium
CN112818093B (en) * 2021-01-18 2023-04-18 平安国际智慧城市科技股份有限公司 Evidence document retrieval method, system and storage medium based on semantic matching
CN112800202A (en) * 2021-02-05 2021-05-14 北京金山数字娱乐科技有限公司 Document processing method and device
CN113010679A (en) * 2021-03-18 2021-06-22 平安科技(深圳)有限公司 Question and answer pair generation method, device and equipment and computer readable storage medium
CN113065332B (en) * 2021-04-22 2023-05-12 深圳壹账通智能科技有限公司 Text processing method, device, equipment and storage medium based on reading model
CN114117022B (en) * 2022-01-26 2022-05-06 杭州远传新业科技有限公司 FAQ similarity problem generation method and system
CN114528821B (en) * 2022-04-25 2022-09-06 中国科学技术大学 Understanding-assisted dialog system manual evaluation method and device and storage medium
CN117556906B (en) * 2024-01-11 2024-04-05 卓世智星(天津)科技有限公司 Question-answer data set generation method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
US20150178623A1 (en) * 2013-12-23 2015-06-25 International Business Machines Corporation Automatically Generating Test/Training Questions and Answers Through Pattern Based Analysis and Natural Language Processing Techniques on the Given Corpus for Quick Domain Adaptation
US20160292204A1 (en) * 2015-03-30 2016-10-06 Avaya Inc. System and method for compiling and dynamically updating a collection of frequently asked questions
CN107220296A (en) * 2017-04-28 2017-09-29 北京拓尔思信息技术股份有限公司 The generation method of question and answer knowledge base, the training method of neutral net and equipment
CN109918487A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 Intelligent answer method and system based on network encyclopedia
CN110196901A (en) * 2019-06-28 2019-09-03 北京百度网讯科技有限公司 Construction method, device, computer equipment and the storage medium of conversational system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110125734A1 (en) * 2009-11-23 2011-05-26 International Business Machines Corporation Questions and answers generation
CN110019719B (en) * 2017-12-15 2023-04-25 微软技术许可有限责任公司 Assertion-based question and answer
CN110083692B (en) * 2019-04-22 2023-01-24 齐鲁工业大学 Text interactive matching method and device for financial knowledge question answering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
US20150178623A1 (en) * 2013-12-23 2015-06-25 International Business Machines Corporation Automatically Generating Test/Training Questions and Answers Through Pattern Based Analysis and Natural Language Processing Techniques on the Given Corpus for Quick Domain Adaptation
US20160292204A1 (en) * 2015-03-30 2016-10-06 Avaya Inc. System and method for compiling and dynamically updating a collection of frequently asked questions
CN107220296A (en) * 2017-04-28 2017-09-29 北京拓尔思信息技术股份有限公司 The generation method of question and answer knowledge base, the training method of neutral net and equipment
CN109918487A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 Intelligent answer method and system based on network encyclopedia
CN110196901A (en) * 2019-06-28 2019-09-03 北京百度网讯科技有限公司 Construction method, device, computer equipment and the storage medium of conversational system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435213A (en) * 2021-07-09 2021-09-24 支付宝(杭州)信息技术有限公司 Method and device for returning answers aiming at user questions and knowledge base
CN113435213B (en) * 2021-07-09 2024-04-30 支付宝(杭州)信息技术有限公司 Method and device for returning answers to user questions and knowledge base
CN114330718A (en) * 2021-12-23 2022-04-12 北京百度网讯科技有限公司 Method and device for extracting causal relationship and electronic equipment
CN117290483A (en) * 2023-10-09 2023-12-26 成都明途科技有限公司 Answer determination method, model training method, device and electronic equipment

Also Published As

Publication number Publication date
CN111046152B (en) 2023-09-29
CN111046152A (en) 2020-04-21

Similar Documents

Publication Publication Date Title
WO2021068352A1 (en) Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium
US11487954B2 (en) Multi-turn dialogue response generation via mutual information maximization
US10515155B2 (en) Conversational agent
US10503834B2 (en) Template generation for a conversational agent
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
US11468239B2 (en) Joint intent and entity recognition using transformer models
WO2019084867A1 (en) Automatic answering method and apparatus, storage medium, and electronic device
US20190236155A1 (en) Feedback for a conversational agent
US11934781B2 (en) Systems and methods for controllable text summarization
US20220172040A1 (en) Training a machine-learned model based on feedback
WO2020199595A1 (en) Long text classification method and device employing bag-of-words model, computer apparatus, and storage medium
US11170765B2 (en) Contextual multi-channel speech to text
CN114118100A (en) Method, apparatus, device, medium and program product for generating dialogue statements
EP3525107A1 (en) Conversational agent
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN114970538A (en) Text error correction method and device
CN113378543B (en) Data analysis method, method for training data analysis model and electronic equipment
CN116957047B (en) Sampling network updating method, device, equipment and medium
CN112269856B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN113284498B (en) Client intention identification method and device
EP3522037A1 (en) Feedback for a conversational agent
CN115374261A (en) Work order filling method, work order filling device, electronic equipment and medium
CN116484857A (en) Text generation method, apparatus, computer device and storage medium
CN116821309A (en) Context construction method based on large language model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19948667

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19948667

Country of ref document: EP

Kind code of ref document: A1