CN110955761A - Method and device for acquiring question and answer data in document, computer equipment and storage medium - Google Patents

Method and device for acquiring question and answer data in document, computer equipment and storage medium Download PDF

Info

Publication number
CN110955761A
CN110955761A CN201910970168.8A CN201910970168A CN110955761A CN 110955761 A CN110955761 A CN 110955761A CN 201910970168 A CN201910970168 A CN 201910970168A CN 110955761 A CN110955761 A CN 110955761A
Authority
CN
China
Prior art keywords
document
candidate
answer
question
questioning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910970168.8A
Other languages
Chinese (zh)
Inventor
朱昱锦
徐国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN201910970168.8A priority Critical patent/CN110955761A/en
Publication of CN110955761A publication Critical patent/CN110955761A/en
Priority to PCT/CN2020/106124 priority patent/WO2021068615A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of artificial intelligence, is applied to the industry of smart cities, and particularly relates to a method and a device for acquiring question and answer data in a document, computer equipment and a storage medium. The method in one embodiment comprises: acquiring a document to be processed and an input document problem, identifying entity words in the document problem through an entity word identification technology, and taking the identified entity words as key words of the document problem; carrying out synonym expansion and semantic expansion on the keywords respectively to obtain question factors, splitting the document to be processed to obtain a plurality of document fragments, and taking the document fragments containing the question factors as candidate fragments; searching in the candidate segments based on the questioning factors to obtain candidate answers of the document questions; and sequencing all the candidate answers according to the similarity, and taking the candidate answer with the highest sequencing as the answer of the document question. The questioning factor covers two layers of synonyms and semantics, the candidate answers are wide in coverage, and the accuracy of the obtained questioning answers can be effectively improved.

Description

Method and device for acquiring question and answer data in document, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technology, and in particular, to a method and apparatus for acquiring question and answer data in a document, a computer device, and a storage medium.
Background
The document refers to a kind of written material for recording information and expressing intention. The documents can be written literal materials formed according to a certain style and requirements in the social activities of organs, groups, enterprises and public institutions and individuals for certain needs. In the occasions of entering, checking, updating libraries and the like which need to quickly review a large number of documents, the requirement of extracting the question and answer information of the documents by self definition is very urgent.
The traditional acquisition of the document question and answer information is generally based on keyword retrieval, the way of retrieving by using keywords stays at a grammar level, and the relevance between some contents returned by retrieval and answers is not very close, so that the accuracy of the question and answer information acquired by the retrieval way is not high.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for acquiring question and answer data in a document, which can improve accuracy.
A method for obtaining question and answer data in a document, the method comprising:
acquiring a document to be processed and an input document problem;
identifying entity words in the document problem through an entity word identification technology, and taking the identified entity words as key words of the document problem;
carrying out synonym expansion and semantic expansion on the keywords respectively to obtain question factors;
splitting the document to be processed to obtain a plurality of document fragments, and taking the document fragments containing the questioning factors as candidate fragments;
searching in the candidate segment based on the questioning factor to obtain a candidate answer of the document question;
and sequencing all the candidate answers according to the similarity, and taking the candidate answer with the highest sequencing as the answer of the document question.
In one embodiment, the splitting the document to be processed to obtain a plurality of document fragments includes:
converting the to-be-processed document into character strings, and splitting the to-be-processed document into different document segments according to natural segments when the length of the character strings of the to-be-processed document is greater than a preset length and the to-be-processed document comprises a plurality of natural segments;
when the length of the character string of the to-be-processed document is smaller than or equal to a preset length, splitting the to-be-processed document into different document segments based on a preset sliding window length and a preset interval.
In one embodiment, said searching in said candidate segment based on said questioning factor to obtain a candidate answer to said paperwork question includes:
acquiring a trained reading understanding task model, wherein the reading understanding task model comprises an embedding layer, an embedding coding layer, a context-query attention layer, a model coding layer and an output layer which are sequentially connected;
inputting the questioning factor and the candidate segment into the embedding layer, and respectively coding the questioning factor and the candidate segment through the embedding coding layer to obtain a questioning factor coding block and a candidate segment coding block;
acquiring the similarity between the questioning factor coding block and the candidate segment coding block through the context-query attention layer;
based on the similarity between the questioning factor coding block and the candidate segment coding block, obtaining the predicted position of the candidate answer through a model coding layer;
and calculating the probability that each prediction position is a candidate answer starting position and the probability of a candidate answer ending position through the decoding of the output layer, wherein the prediction position with the probability larger than a preset first threshold value is taken as the candidate answer starting position, and the prediction position with the probability larger than a preset second threshold value is taken as the candidate answer ending position.
In one embodiment, the ranking each of the candidate answers according to the similarity, and using the top ranked candidate answer as the answer to the paperwork question, includes:
carrying out pairwise similarity matching calculation on a plurality of candidate answers corresponding to a single candidate segment, and taking the candidate answer with the highest similarity mean value as the candidate answer of the single candidate segment;
taking the average of the similarity of the candidate answer of the single candidate segment and other candidate answers of the single candidate segment as the candidate weight of the single candidate segment;
obtaining the matching degree of the single candidate segment and the question factor, and obtaining the weight of the candidate answer according to the matching degree and the candidate weight of the single candidate segment;
and obtaining the weight value corresponding to the candidate answer of each candidate segment, and taking the candidate answer corresponding to the highest value in each weight value as the answer of the document question.
In one embodiment, the obtaining the matching degree of the single candidate segment and the question factor includes:
acquiring the number of first words after synonym expansion processing and the number of second words after semantic expansion processing;
and inputting the ratio of the first word quantity to the second word quantity and the single candidate segment into an Elasticissearch retrieval model to obtain the matching degree of the single candidate segment and the question factor.
An apparatus for acquiring question and answer data in a document, the apparatus comprising:
the information acquisition module is used for acquiring documents to be processed and input document problems;
the keyword acquisition module is used for identifying entity words in the document problem through an entity word identification technology and taking the identified entity words as keywords of the document problem;
the questioning factor acquisition module is used for performing synonym expansion and semantic expansion on the keywords respectively to obtain questioning factors;
a candidate segment obtaining module, configured to split the document to be processed to obtain multiple document segments, and use the document segment including the question factor as a candidate segment;
the candidate answer obtaining module is used for searching in the candidate segments based on the questioning factor to obtain candidate answers of the document questions;
and the candidate answer processing module is used for sequencing all the candidate answers according to the similarity and taking the candidate answer with the highest sequencing as the answer of the document question.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a document to be processed and an input document problem;
identifying entity words in the document problem through an entity word identification technology, and taking the identified entity words as key words of the document problem;
carrying out synonym expansion and semantic expansion on the keywords respectively to obtain question factors;
splitting the document to be processed to obtain a plurality of document fragments, and taking the document fragments containing the questioning factors as candidate fragments;
searching in the candidate segment based on the questioning factor to obtain a candidate answer of the document question;
and sequencing all the candidate answers according to the similarity, and taking the candidate answer with the highest sequencing as the answer of the document question.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a document to be processed and an input document problem;
identifying entity words in the document problem through an entity word identification technology, and taking the identified entity words as key words of the document problem;
carrying out synonym expansion and semantic expansion on the keywords respectively to obtain question factors;
splitting the document to be processed to obtain a plurality of document fragments, and taking the document fragments containing the questioning factors as candidate fragments;
searching in the candidate segment based on the questioning factor to obtain a candidate answer of the document question;
and sequencing all the candidate answers according to the similarity, and taking the candidate answer with the highest sequencing as the answer of the document question.
The method, the device, the computer equipment and the storage medium for acquiring the question and answer data in the document identify the entity words in the input document question by the entity word identification technology, use the identified entity words as the key words of the document question, respectively carry out synonym expansion and semantic expansion on the key words to obtain the question factors, the obtained question factors cover two layers of synonyms and semantics, split the document to be processed to obtain a plurality of document segments, use the document segments containing the question factors as candidate segments, so that the obtained candidate segments have wider range, search in the candidate segments based on the question factors to obtain the candidate answers of the document question, sort the candidate answers according to the similarity, use the candidate answer with the highest rank as the answer of the document question, so that the candidate answers are widely covered, then screen the candidate answers to finally determine the answer of the document question, the accuracy rate of the obtained questioning answers can be effectively improved.
Drawings
FIG. 1 is a diagram of an exemplary implementation of a method for retrieving question and answer data from a document;
FIG. 2 is a schematic flow chart illustrating a method for obtaining question and answer data in a document according to an embodiment;
FIG. 3 is a flowchart illustrating a candidate answer obtaining step according to an embodiment;
FIG. 4 is a flowchart illustrating the candidate answer ranking step according to one embodiment;
FIG. 5 is a block diagram showing a structure of a question-answer data acquiring device in a document according to an embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The method for acquiring question and answer data in the document can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 acquires the document to be processed and the input document problem from the terminal 102, identifies entity words in the document problem through an entity word identification technology, and takes the identified entity words as key words of the document problem; carrying out synonym expansion and semantic expansion on the keywords respectively to obtain question factors; and searching in the document to be processed based on the questioning factor to obtain the answer of the document question. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for acquiring question and answer data in a document is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:
step 202, obtaining the document to be processed and the input document problem.
The document to be processed can be uploaded by the user through the user terminal, and the user can ask for an answer document. The document question refers to a question posed by a user on a document to be processed, for example, a lawyer fee is asked for a certain document, the document question may specifically be "what the lawyer fee is", or "the amount of the lawyer fee", or may be a supplementary explanation of possible incidental information, such as a phrase or a sentence which often appears with the question in experience, or may be a different name of a word in the question. Such as words that often appear before and after the lawyer fee may be payments, commitments, etc.
And step 204, identifying entity words in the document problem through an entity word identification technology, and taking the identified entity words as key words of the document problem.
The identification of the entity words in the document problem through the entity word identification technology specifically means that word segmentation processing is performed on the input document problem first, and word segmentation processing is performed on the document problem through a word segmentation tool, wherein the word segmentation tool can be a jieba tool, a snowNLP tool, a pylnpir tool, a hierarchical tool and the like. For example, the word segmentation tool is used for carrying out word segmentation on the document question of 'the amount of lawyer fee', and the result of the word segmentation is 'the amount of lawyer fee'. The word segmentation can also be carried out by a maximum matching method, a reverse maximum matching method and the like. After word segmentation processing, part-of-speech tagging is performed, wherein the part-of-speech tagging refers to the fact that words are segmented into categories such as nouns, verbs and adjectives, and the part-of-speech tagging can be achieved based on probability statistics or based on preset rules. The entity word refers to a word representing a name of a person, a place, an organization name, and the like, and the entity word may be a noun. Taking the word segmentation processing result of lawyer fee/amount as an example, extracting the words corresponding to the noun as the keywords of the document problem to obtain the keywords of lawyer fee and amount.
And step 206, performing synonym expansion and semantic expansion on the keywords respectively to obtain question factors.
Synonym expansion can be carried out on the keywords based on a preset synonym dictionary, dictionary query is carried out on the keywords, and synonyms corresponding to the keywords are returned after the keywords are found in the dictionary. Semantic expansion is carried out on the keywords based on a preset common sense knowledge base, synonymy relation expansion of the knowledge network is applied, and all words synonymous with the keywords are obtained through a synonymy searching mode of the knowledge network.
And step 208, splitting the document to be processed to obtain a plurality of document segments, and taking the document segments containing the questioning factors as candidate segments.
In one embodiment, splitting a document to be processed to obtain a plurality of document fragments includes: converting the to-be-processed document into character strings, and splitting the to-be-processed document into different document segments according to natural segments when the length of the character strings of the to-be-processed document is greater than a preset length and the to-be-processed document comprises a plurality of natural segments; when the length of the character string of the document to be processed is smaller than or equal to the preset length, the document to be processed is split into different document segments based on the preset sliding window length and the preset distance. For example, the number of words corresponding to the length of the character string exceeds 1 ten thousand words, the document to be processed comprises a plurality of natural segments, and the document to be processed is directly split into different document segments according to the natural segments. When the character string length of the document to be processed is short, the sliding window and the space are used for splitting the document to be processed, for example, a 300-word short document, the length of the sliding window can be defined as 5 sentences, the space is defined as 2 sentences, that is, each 5 sentences form a document segment, and every 2 sentences are used as the start of the next document segment.
And step 210, searching in the candidate segments based on the questioning factors to obtain candidate answers of the document questions.
The question factors and the candidate segments can be input into the reading understanding task model through a standard reading understanding task model QANet, and the model outputs candidate answers. When the number of the questioning factors is M and the number of the document fragments containing the questioning factors in the document to be processed is N, assuming that each question returns one answer, a total of M multiplied by N answers are generated.
In one embodiment, as shown in fig. 3, finding the candidate answer to the paperwork question based on the question asking factor in the candidate segment includes: step 302, acquiring a trained reading understanding task model, wherein the reading understanding task model comprises an embedding layer, an embedding coding layer, a context-query attention layer, a model coding layer and an output layer which are connected in sequence; step 304, inputting the questioning factors and the candidate segments into an embedded layer, and respectively coding the questioning factors and the candidate segments through an embedded coding layer to obtain questioning factor coding blocks and candidate segment coding blocks; step 306, acquiring the similarity between the questioning factor coding block and the candidate segment coding block through a context-query attention layer; step 308, obtaining the predicted position of the candidate answer through a model coding layer based on the similarity between the questioning factor coding block and the candidate segment coding block; and 310, calculating the probability that each prediction position is a candidate answer starting position and the probability of a candidate answer ending position through output layer decoding, wherein the prediction position with the probability larger than a preset first threshold value is used as the candidate answer starting position, and the prediction position with the probability larger than a preset second threshold value is used as the candidate answer ending position. Reading understanding task model QANet contains five major components: an embedding layer, an embedding encoding layer, a context-query attention layer, a model encoding layer, and an output layer. The embedded encoder and the model encoder of the QANT abandon a complex recursive structure of an RNN (Recurrent Neural Network), and construct a Neural Network by using a convolution and self-attention mechanism, so that the training rate and the inference rate of the model are greatly accelerated, and input words can be processed in parallel. Inputting the candidate segments and the questioning factors to an embedded layer of a reading understanding task model, respectively coding the candidate segments and the questioning factors by the embedded coding layer, then learning the similarity between the two coding blocks in a context-query attention layer, coding the coding blocks by the model coding layer through vectors passing through the attention layer to obtain the predicted positions of the candidate answers, and finally calculating the probability that each predicted position is the beginning and the end of the candidate answer corresponding to the paperwork question through decoding of an output layer. Assuming that the candidate segment C includes n words, it may be expressed as C ═ C _1, C _ 2.,. C _ n }, and the questioning factor Q includes m words, Q ═ Q _1, Q _ 2.,. Q _ m }, and outputs a Span set, S ═ C _ i, C _ (i +1),..,. C _ (i + j) }, and Span refers to extracting a continuous segment from the candidate segment as an answer.
And step 212, sequencing the candidate answers according to the similarity, and taking the candidate answer with the highest sequencing as the answer of the document question.
In one embodiment, as shown in fig. 4, ranking the candidate answers according to the similarity, and using the top ranked candidate answer as the answer to the document question includes: step 402, carrying out pairwise similarity matching calculation on a plurality of candidate answers corresponding to a single candidate segment, and taking the candidate answer with the highest similarity mean value as the candidate answer of the single candidate segment; step 404, using the similarity mean of the candidate answer of the single candidate segment and other candidate answers of the single candidate segment as the candidate weight of the single candidate segment; step 406, obtaining the matching degree of the single candidate segment and the question factor, and obtaining the weight of the candidate answer according to the matching degree and the candidate weight of the single candidate segment; and step 408, obtaining weights corresponding to the candidate answers of the candidate segments, and taking the candidate answer corresponding to the highest value in the weights as the answer of the document question. Pairwise similarity matching calculation of multiple answers obtained by each candidate segment can be realized through a fuzzy model. The fuzzy Wuzzy model is used for calculating the matching degree between character strings, firstly, the answers obtained from each candidate segment are converted into the character strings, and then, the similarity matching degree of the character strings is output by calling the functions in the fuzzy Wuzzy model based on the converted character strings, namely, the similarity matching degree between every two answers is obtained. Specifically, normalization processing may be performed on the matching degree and the candidate weight values of each candidate segment, weighted summation may be performed according to the matching degree and the candidate weight values after the normalization processing, so as to obtain the weight values of each candidate answer, and calculation is simplified through the normalization processing, thereby improving the answer obtaining efficiency. More specifically, the matching degree and the candidate weight are as follows, 6.5: and 3.5, carrying out weighted summation, and verifying that the accuracy of the obtained answer is higher when the proportion is in the proportion through multiple tests.
The method for acquiring the question and answer data in the document identifies the entity words in the input document questions by the entity word identification technology, uses the identified entity words as the key words of the document questions, performs synonym expansion and semantic expansion on the key words to obtain the question factors, and the obtained question factors cover two layers of synonyms and semantics, splitting the document to be processed to obtain a plurality of document segments, taking the document segments containing the questioning factors as candidate segments, the obtained candidate segments have wider range, candidate answers of the document question are obtained by searching in the candidate segments based on the questioning factors, then all the candidate answers are ranked according to the similarity, the candidate answer with the top ranking is taken as the answer of the document question, therefore, the coverage of the candidate answers is wide, the candidate answers are screened to finally determine the answer of the document question, and the accuracy of the obtained questioning answer can be effectively improved.
In one embodiment, obtaining the matching degree of a single candidate segment and a question factor includes: acquiring the number of first words after synonym expansion processing and the number of second words after semantic expansion processing; and inputting the ratio of the first word quantity to the second word quantity and the single candidate segment into an Elasticissearch retrieval model to obtain the matching degree of the single candidate segment and the questioning factor. For example, each document segment can be stored in an Elasticsearch model, which is used for rapidly searching the stored documents, and each document segment is regarded as a document. The Elasticissearch retrieval model may first perform rough extraction on the document according to the retrieval statement, for example, by using a question factor, specifically, traverse each document segment by using the question factor, exclude the document segment not including the question factor, and obtain a candidate segment, that is, the candidate segment is the document segment including the question factor. And then returning the matching degree of the candidate segments and the question factors according to the ratio of the words after the expansion of the synonyms in the question factors and the words after the semantic expansion. Wherein, the ratio of the words after the synonym expansion to the words after the semantic expansion can be specifically 3: and 1, multiple tests verify that the accuracy of the obtained answers is higher when the proportion is in the range. The candidate segment can be specifically put in a candidate list for output, when the Elasticissearch retrieval model returns the candidate segment, the Elasticissearch retrieval model also returns the corresponding matching degree, the matching degree can be specifically a score of the matching degree, the score is subjected to min-max normalization processing, and the score is stored in a score list.
It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided an apparatus for acquiring question and answer data in a document, including: an information obtaining module 502, a keyword obtaining module 504, a questioning factor obtaining module 506, a candidate segment obtaining module 508, a candidate answer obtaining module 510, and a candidate answer processing module 512. And the information acquisition module is used for acquiring the document to be processed and the input document problem. And the keyword acquisition module is used for identifying the entity words in the document problem through an entity word identification technology and taking the identified entity words as the keywords of the document problem. And the questioning factor acquisition module is used for performing synonym expansion and semantic expansion on the keywords respectively to obtain questioning factors. And the candidate segment acquisition module is used for splitting the document to be processed to obtain a plurality of document segments and taking the document segments containing the questioning factors as candidate segments. And the candidate answer obtaining module is used for searching in the candidate segments based on the questioning factor to obtain the candidate answers of the document questions. And the candidate answer processing module is used for sequencing all candidate answers according to the similarity and taking the candidate answer with the top sequence as the answer of the document question.
In one embodiment, the candidate segment acquisition module comprises: the first splitting unit is used for converting the document to be processed into character strings, and when the length of the character strings of the document to be processed is larger than the preset length and the document to be processed comprises a plurality of natural sections, the document to be processed is split into different document segments according to the natural sections; and the second splitting unit is used for splitting the document to be processed into different document fragments based on the preset sliding window length and the preset interval when the length of the character string of the document to be processed is less than or equal to the preset length.
In one embodiment, the candidate answer obtaining module includes: the model acquisition unit is used for acquiring a trained reading understanding task model, and the reading understanding task model comprises an embedded layer, an embedded coding layer, a context-query attention layer, a model coding layer and an output layer which are sequentially connected; the encoding unit is used for inputting the questioning factors and the candidate segments into the embedded layer, and respectively encoding the questioning factors and the candidate segments through the embedded encoding layer to obtain questioning factor encoding blocks and candidate segment encoding blocks; the encoding block processing unit is used for acquiring the similarity between the questioning factor encoding block and the candidate segment encoding block through a context-query attention layer; the position acquisition unit is used for acquiring the predicted position of the candidate answer through a model coding layer based on the similarity between the questioning factor coding block and the candidate segment coding block; and the position processing unit is used for calculating the probability that each predicted position is a candidate answer starting position and the probability of a candidate answer ending position through output layer decoding, taking the predicted position with the probability larger than a preset first threshold value as the candidate answer starting position, and taking the predicted position with the probability larger than a preset second threshold value as the candidate answer ending position.
In one embodiment, the candidate answer processing module is further configured to perform pairwise similarity matching calculation on multiple candidate answers corresponding to a single candidate segment, and use a candidate answer with the highest similarity mean value as a candidate answer of the single candidate segment; taking the similarity mean value of the candidate answer of the single candidate segment and other candidate answers of the single candidate segment as a candidate weight of the single candidate segment; obtaining the matching degree of a single candidate segment and the question factor, and obtaining the weight of a candidate answer according to the matching degree and the candidate weight of the single candidate segment; and obtaining the weight value corresponding to the candidate answer of each candidate segment, and taking the candidate answer corresponding to the highest value in each weight value as the answer of the document question.
In one embodiment, the candidate answer processing module is further configured to obtain a first number of terms after the synonym expansion processing and a second number of terms after the semantic expansion processing; and inputting the ratio of the first word quantity to the second word quantity and the single candidate segment into an Elasticissearch retrieval model to obtain the matching degree of the single candidate segment and the questioning factor.
For the specific limitations of the device for acquiring question and answer data in the document, reference may be made to the above limitations of the method for acquiring question and answer data in the document, which are not described herein again. All or part of the modules in the device for acquiring the question and answer data in the document can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing the data of documents to be processed, document questions, question factors, candidate answers and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for acquiring question and answer data in a document.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, which includes a memory storing a computer program and a processor implementing the steps of the question-answer data acquisition method in the document in any one of the embodiments when the processor executes the computer program.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the steps of the question-answer data acquisition method in the document in any one of the embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for obtaining question and answer data in a document, the method comprising:
acquiring a document to be processed and an input document problem;
identifying entity words in the document problem through an entity word identification technology, and taking the identified entity words as key words of the document problem;
carrying out synonym expansion and semantic expansion on the keywords respectively to obtain question factors;
splitting the document to be processed to obtain a plurality of document fragments, and taking the document fragments containing the questioning factors as candidate fragments;
searching in the candidate segment based on the questioning factor to obtain a candidate answer of the document question;
and sequencing all the candidate answers according to the similarity, and taking the candidate answer with the highest sequencing as the answer of the document question.
2. The method of claim 1, wherein said splitting said document to be processed into a plurality of document fragments comprises:
converting the to-be-processed document into character strings, and splitting the to-be-processed document into different document segments according to natural segments when the length of the character strings of the to-be-processed document is greater than a preset length and the to-be-processed document comprises a plurality of natural segments;
when the length of the character string of the to-be-processed document is smaller than or equal to a preset length, splitting the to-be-processed document into different document segments based on a preset sliding window length and a preset interval.
3. The method of claim 1, wherein said finding a candidate answer to said paperwork question based on said questioning factor in said candidate segment comprises:
acquiring a trained reading understanding task model, wherein the reading understanding task model comprises an embedding layer, an embedding coding layer, a context-query attention layer, a model coding layer and an output layer which are sequentially connected;
inputting the questioning factor and the candidate segment into the embedding layer, and respectively coding the questioning factor and the candidate segment through the embedding coding layer to obtain a questioning factor coding block and a candidate segment coding block;
acquiring the similarity between the questioning factor coding block and the candidate segment coding block through the context-query attention layer;
based on the similarity between the questioning factor coding block and the candidate segment coding block, obtaining the predicted position of the candidate answer through a model coding layer;
and calculating the probability that each prediction position is a candidate answer starting position and the probability of a candidate answer ending position through the decoding of the output layer, wherein the prediction position with the probability larger than a preset first threshold value is taken as the candidate answer starting position, and the prediction position with the probability larger than a preset second threshold value is taken as the candidate answer ending position.
4. The method of claim 1, wherein ranking each of the candidate answers according to similarity, with a top ranked candidate answer as the answer to the paperwork question, comprises:
carrying out pairwise similarity matching calculation on a plurality of candidate answers corresponding to a single candidate segment, and taking the candidate answer with the highest similarity mean value as the candidate answer of the single candidate segment;
taking the average of the similarity of the candidate answer of the single candidate segment and other candidate answers of the single candidate segment as the candidate weight of the single candidate segment;
obtaining the matching degree of the single candidate segment and the question factor, and obtaining the weight of the candidate answer according to the matching degree and the candidate weight of the single candidate segment;
and obtaining the weight value corresponding to the candidate answer of each candidate segment, and taking the candidate answer corresponding to the highest value in each weight value as the answer of the document question.
5. The method according to claim 4, wherein the obtaining the matching degree of the single candidate segment with the question factor comprises:
acquiring the number of first words after synonym expansion processing and the number of second words after semantic expansion processing;
and inputting the ratio of the first word quantity to the second word quantity and the single candidate segment into an Elasticissearch retrieval model to obtain the matching degree of the single candidate segment and the question factor.
6. An apparatus for acquiring question and answer data in a document, the apparatus comprising:
the information acquisition module is used for acquiring documents to be processed and input document problems;
the keyword acquisition module is used for identifying entity words in the document problem through an entity word identification technology and taking the identified entity words as keywords of the document problem;
the questioning factor acquisition module is used for performing synonym expansion and semantic expansion on the keywords respectively to obtain questioning factors;
a candidate segment obtaining module, configured to split the document to be processed to obtain multiple document segments, and use the document segment including the question factor as a candidate segment;
the candidate answer obtaining module is used for searching in the candidate segments based on the questioning factor to obtain candidate answers of the document questions;
and the candidate answer processing module is used for sequencing all the candidate answers according to the similarity and taking the candidate answer with the highest sequencing as the answer of the document question.
7. The apparatus of claim 6, wherein the candidate segment obtaining module comprises:
the first splitting unit is used for converting the to-be-processed document into character strings, and when the length of the character strings of the to-be-processed document is larger than a preset length and the to-be-processed document comprises a plurality of natural segments, splitting the to-be-processed document into different document fragments according to the natural segments;
and the second splitting unit is used for splitting the document to be processed into different document fragments based on a preset sliding window length and a preset interval when the length of the character string of the document to be processed is smaller than or equal to a preset length.
8. The apparatus of claim 6, wherein the candidate answer obtaining module comprises:
the reading understanding task model comprises an embedding layer, an embedding coding layer, a context-query attention layer, a model coding layer and an output layer which are sequentially connected;
the encoding unit is used for inputting the questioning factor and the candidate segment into the embedded layer, and respectively encoding the questioning factor and the candidate segment through the embedded encoding layer to obtain a questioning factor encoding block and a candidate segment encoding block;
the coding block processing unit is used for acquiring the similarity between the questioning factor coding block and the candidate segment coding block through the context-query attention layer;
the position obtaining unit is used for obtaining the predicted position of the candidate answer through a model coding layer based on the similarity between the questioning factor coding block and the candidate segment coding block;
and the position processing unit is used for calculating the probability that each predicted position is a candidate answer starting position and the probability of a candidate answer ending position through the decoding of the output layer, taking the predicted position with the probability greater than a preset first threshold value as the candidate answer starting position, and taking the predicted position with the probability greater than a preset second threshold value as the candidate answer ending position.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN201910970168.8A 2019-10-12 2019-10-12 Method and device for acquiring question and answer data in document, computer equipment and storage medium Pending CN110955761A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910970168.8A CN110955761A (en) 2019-10-12 2019-10-12 Method and device for acquiring question and answer data in document, computer equipment and storage medium
PCT/CN2020/106124 WO2021068615A1 (en) 2019-10-12 2020-07-31 Method and device for acquiring question and answer data in document, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910970168.8A CN110955761A (en) 2019-10-12 2019-10-12 Method and device for acquiring question and answer data in document, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110955761A true CN110955761A (en) 2020-04-03

Family

ID=69975597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910970168.8A Pending CN110955761A (en) 2019-10-12 2019-10-12 Method and device for acquiring question and answer data in document, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110955761A (en)
WO (1) WO2021068615A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625635A (en) * 2020-05-27 2020-09-04 北京百度网讯科技有限公司 Question-answer processing method, language model training method, device, equipment and storage medium
CN111782790A (en) * 2020-07-03 2020-10-16 阳光保险集团股份有限公司 Document analysis method and device, electronic equipment and storage medium
CN112287080A (en) * 2020-10-23 2021-01-29 平安科技(深圳)有限公司 Question sentence rewriting method and device, computer equipment and storage medium
CN112417126A (en) * 2020-12-02 2021-02-26 车智互联(北京)科技有限公司 Question answering method, computing equipment and storage medium
CN112507079A (en) * 2020-12-15 2021-03-16 科大讯飞股份有限公司 Document case situation matching method, device, equipment and storage medium
WO2021068615A1 (en) * 2019-10-12 2021-04-15 深圳壹账通智能科技有限公司 Method and device for acquiring question and answer data in document, computer device, and storage medium
CN113157890A (en) * 2021-04-25 2021-07-23 深圳壹账通智能科技有限公司 Intelligent question and answer method and device, electronic equipment and readable storage medium
CN114330718A (en) * 2021-12-23 2022-04-12 北京百度网讯科技有限公司 Method and device for extracting causal relationship and electronic equipment
WO2022227165A1 (en) * 2021-04-28 2022-11-03 平安科技(深圳)有限公司 Question and answer method and apparatus for machine reading comprehension, computer device, and storage medium
CN116340467A (en) * 2023-05-11 2023-06-27 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment and computer readable storage medium
CN117708283A (en) * 2023-11-29 2024-03-15 北京中关村科金技术有限公司 Recall content determining method, recall content determining device and electronic equipment

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204976B (en) * 2021-04-19 2024-03-29 北京大学 Real-time question and answer method and system
CN113220832B (en) * 2021-04-30 2023-09-05 北京金山数字娱乐科技有限公司 Text processing method and device
CN113553412B (en) * 2021-06-30 2023-07-25 北京百度网讯科技有限公司 Question-answering processing method, question-answering processing device, electronic equipment and storage medium
CN113515932B (en) * 2021-07-28 2023-11-10 北京百度网讯科技有限公司 Method, device, equipment and storage medium for processing question and answer information
CN113536788B (en) * 2021-07-28 2023-12-05 平安科技(上海)有限公司 Information processing method, device, storage medium and equipment
CN113656393B (en) * 2021-08-24 2024-01-12 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and storage medium
CN114840648B (en) * 2022-03-21 2024-08-20 阿里巴巴(中国)有限公司 Answer generation method, device and computer program product
CN115292469B (en) * 2022-09-28 2023-02-07 之江实验室 Question-answering method combining paragraph search and machine reading understanding
CN117056497B (en) * 2023-10-13 2024-01-23 北京睿企信息科技有限公司 LLM-based question and answer method, electronic equipment and storage medium
CN117669512B (en) * 2024-02-01 2024-05-14 腾讯科技(深圳)有限公司 Answer generation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902652A (en) * 2014-02-27 2014-07-02 深圳市智搜信息技术有限公司 Automatic question-answering system
US20180089569A1 (en) * 2016-09-28 2018-03-29 International Business Machines Corporation Generating a temporal answer to a question
CN109697228A (en) * 2018-12-13 2019-04-30 平安科技(深圳)有限公司 Intelligent answer method, apparatus, computer equipment and storage medium
CN109800284A (en) * 2018-12-19 2019-05-24 中国电子科技集团公司第二十八研究所 A kind of unstructured information intelligent Answer System construction method of oriented mission

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7389208B1 (en) * 2000-06-30 2008-06-17 Accord Solutions, Inc. System and method for dynamic knowledge construction
CN110955761A (en) * 2019-10-12 2020-04-03 深圳壹账通智能科技有限公司 Method and device for acquiring question and answer data in document, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902652A (en) * 2014-02-27 2014-07-02 深圳市智搜信息技术有限公司 Automatic question-answering system
US20180089569A1 (en) * 2016-09-28 2018-03-29 International Business Machines Corporation Generating a temporal answer to a question
CN109697228A (en) * 2018-12-13 2019-04-30 平安科技(深圳)有限公司 Intelligent answer method, apparatus, computer equipment and storage medium
CN109800284A (en) * 2018-12-19 2019-05-24 中国电子科技集团公司第二十八研究所 A kind of unstructured information intelligent Answer System construction method of oriented mission

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021068615A1 (en) * 2019-10-12 2021-04-15 深圳壹账通智能科技有限公司 Method and device for acquiring question and answer data in document, computer device, and storage medium
US11645316B2 (en) 2020-05-27 2023-05-09 Beijing Baidu Netcom Science Technology Co., Ltd. Question answering method and language model training method, apparatus, device, and storage medium
CN111625635A (en) * 2020-05-27 2020-09-04 北京百度网讯科技有限公司 Question-answer processing method, language model training method, device, equipment and storage medium
CN111625635B (en) * 2020-05-27 2023-09-29 北京百度网讯科技有限公司 Question-answering processing method, device, equipment and storage medium
CN111782790A (en) * 2020-07-03 2020-10-16 阳光保险集团股份有限公司 Document analysis method and device, electronic equipment and storage medium
CN112287080A (en) * 2020-10-23 2021-01-29 平安科技(深圳)有限公司 Question sentence rewriting method and device, computer equipment and storage medium
CN112287080B (en) * 2020-10-23 2023-10-03 平安科技(深圳)有限公司 Method and device for rewriting problem statement, computer device and storage medium
CN112417126A (en) * 2020-12-02 2021-02-26 车智互联(北京)科技有限公司 Question answering method, computing equipment and storage medium
CN112417126B (en) * 2020-12-02 2024-01-23 车智互联(北京)科技有限公司 Question answering method, computing device and storage medium
CN112507079A (en) * 2020-12-15 2021-03-16 科大讯飞股份有限公司 Document case situation matching method, device, equipment and storage medium
CN112507079B (en) * 2020-12-15 2023-01-17 科大讯飞股份有限公司 Document case situation matching method, device, equipment and storage medium
CN113157890A (en) * 2021-04-25 2021-07-23 深圳壹账通智能科技有限公司 Intelligent question and answer method and device, electronic equipment and readable storage medium
CN113157890B (en) * 2021-04-25 2024-06-11 深圳壹账通智能科技有限公司 Intelligent question-answering method and device, electronic equipment and readable storage medium
WO2022227165A1 (en) * 2021-04-28 2022-11-03 平安科技(深圳)有限公司 Question and answer method and apparatus for machine reading comprehension, computer device, and storage medium
CN114330718A (en) * 2021-12-23 2022-04-12 北京百度网讯科技有限公司 Method and device for extracting causal relationship and electronic equipment
CN116340467A (en) * 2023-05-11 2023-06-27 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment and computer readable storage medium
CN116340467B (en) * 2023-05-11 2023-11-17 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment and computer readable storage medium
CN117708283A (en) * 2023-11-29 2024-03-15 北京中关村科金技术有限公司 Recall content determining method, recall content determining device and electronic equipment

Also Published As

Publication number Publication date
WO2021068615A1 (en) 2021-04-15

Similar Documents

Publication Publication Date Title
CN110955761A (en) Method and device for acquiring question and answer data in document, computer equipment and storage medium
CN110765265B (en) Information classification extraction method and device, computer equipment and storage medium
CN111160017B (en) Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN110598206B (en) Text semantic recognition method and device, computer equipment and storage medium
CN111353310B (en) Named entity identification method and device based on artificial intelligence and electronic equipment
CN108427707B (en) Man-machine question and answer method, device, computer equipment and storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
WO2020258506A1 (en) Text information matching degree detection method and apparatus, computer device and storage medium
CN113934830B (en) Text retrieval model training, question and answer retrieval method, device, equipment and medium
CN111191002B (en) Neural code searching method and device based on hierarchical embedding
CN109543007A (en) Put question to data creation method, device, computer equipment and storage medium
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN114139551A (en) Method and device for training intention recognition model and method and device for recognizing intention
CN112085091B (en) Short text matching method, device, equipment and storage medium based on artificial intelligence
CN110309504B (en) Text processing method, device, equipment and storage medium based on word segmentation
CN111400340B (en) Natural language processing method, device, computer equipment and storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN112149410A (en) Semantic recognition method and device, computer equipment and storage medium
CN112766319A (en) Dialogue intention recognition model training method and device, computer equipment and medium
CN116804998A (en) Medical term retrieval method and system based on medical semantic understanding
CN116975212A (en) Answer searching method and device for question text, computer equipment and storage medium
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN118245564A (en) Method and device for constructing feature comparison library supporting semantic review and repayment
CN113516094A (en) System and method for matching document with review experts
CN114416925B (en) Sensitive word recognition method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200403