CN110717324B

CN110717324B - Judgment document answer information extraction method, device, extractor, medium and equipment

Info

Publication number: CN110717324B
Application number: CN201910840224.6A
Authority: CN
Inventors: 孙玉霞; 杨梓铭; 邝庆璇
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2023-04-07
Anticipated expiration: 2039-09-06
Also published as: CN110717324A

Abstract

The invention discloses a method, a device, an extractor, a medium and equipment for extracting referee document answer information, wherein word segmentation processing with Chinese vocabularies as granularity is firstly carried out on data in a training sample set comprising referee documents, questions and reference answers, and then each word block obtained by word segmentation is coded to obtain text characteristics of each word block; then training the ERNIE model through the text characteristics of each word block in each training sample to obtain a prediction model; training the deep neural network through a training sample to obtain an answer verification model; when testing is carried out, word segmentation and coding processing are carried out on the referee document and the question, then the referee document and the question are input into the prediction model to obtain a prediction answer, the prediction answer is input into the answer verification model to be verified, under the condition that verification is passed, the prediction answer is used as a final answer, and otherwise, no solution is judged. The invention greatly improves the accuracy of extracting the judgment document answer information.

Description

Judgment document answer information extraction method, device, extractor, medium and equipment

Technical Field

The invention relates to the technical field of referee document information processing, in particular to a referee document answer information extraction method, a referee document answer information extraction device, an extractor, a medium and equipment.

Background

With the continuous development and improvement of the legal construction of China, a large amount of Chinese law referee documents (relating to privacy and national confidentiality except business secrets) are disclosed, and a large amount of reference materials are accumulated for judicial and law enforcement and even related legal research. The legal referee documents contain abundant critical information of cases, such as time, place, character relationship, etc., so that judges, lawyers and other legal personnel can manually read related Chinese referee documents to acquire related information for a long time.

However, there are the following disadvantages in manually reading the official document: (1) The number of official documents is huge, and much time and energy are needed for searching and screening, which affects the efficiency of information acquisition. (2) After retrieving the relevant official documents, it is still necessary to manually exclude redundant information that is not required by the reader in order to find useful information. (3) The official document contains terms that are obscure relative to the lay professional and not directly understandable to the general public.

Therefore, the machine intelligently extracts the research of the legal document information and has important significance. The purpose of the technology is: the referee document is mined by using a natural language processing model, when a user proposes a question, the referee document is read and understood by using a machine, and corresponding answer information, namely an answer, is extracted from the referee document, so that the working efficiency is improved.

Related technologies available in the art for extracting text information include:

(1) The LSTM (Long short-term memory) model proposed by Hochreiter et al is suitable for parallel computing, and can receive and process sentence sequences of different lengths and capture text features therein.

(2) The CNN model proposed by Kim can also be used for extraction of text information of referee documents, the model supports parallel computation, but the negative correlation between local parts and entities of the model has an influence on the model, after several layers of pooling, the position characteristics of words are lost, and the prediction result is adversely affected.

(3) The model comprises a BERT model, wherein the model is pre-trained through unsupervised learning by utilizing a large number of data sets, and tuning after pre-training is trained through utilizing a specific data set to be suitable for a characteristic scene. BERT can achieve better effect than LSTM and CNN in many scenes, but it does not introduce a map mechanism, is not suitable for knowledge-driven NLP tasks, and is proved by experiments that the information extraction of referee documents is not as good as that of a network model based on knowledge enhancement.

In summary, in the current work result, the text feature extraction information is mainly used, and the problems of recognition of semantic knowledge entities in the text, context-related text feature extraction, text semantic reasoning and the like are not considered, so that several methods in the prior art are not accurate enough for answer information given by the questions proposed by the referee document.

Disclosure of Invention

The first purpose of the invention is to overcome the defects of the prior art and provide a referee document answer information extraction method, which combines semantic knowledge entity text information extraction, knowledge enhancement model and semantic reasoning inspection to realize the extraction of referee document answer information and greatly improve the accuracy of referee document answer information extraction.

The second objective of the present invention is to provide an apparatus for extracting the answer information of the official document.

The third objective of the present invention is to provide an answer information extractor for referee document

A fourth object of the present invention is to provide a storage medium.

It is a fifth object of the invention to provide a computing device.

The first purpose of the invention is realized by the following technical scheme: a method for extracting referee document answer information comprises the following steps:

acquiring a plurality of referee documents, wherein each referee document corresponds to each question and is a known reference answer, and taking each referee document and one question corresponding to the referee document as each training sample;

performing word segmentation by taking vocabulary as granularity aiming at the referee documents and the questions in each training sample and the reference answers of the referee documents corresponding to the questions in the training samples; then, coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block of the judgment documents, the questions and the reference answers in the training samples are obtained;

taking the text features corresponding to each word block of the referee document and the question in each training sample as the input of an ERNIE model, and taking the text features corresponding to each word block of the reference answer in the training sample as a label to train the ERNIE model to obtain a prediction model;

for each training sample, determining whether the question corresponding to the referee document in the training sample has the answering ability according to the reference answer of the question corresponding to the referee document in the training sample; the method comprises the steps of taking the whole connection of the referee document and the question of each training sample as a premise, taking the reference answer of the question corresponding to the referee document in each training sample as an assumption and inputting the assumption into a deep neural network, taking the result of whether the question corresponding to the referee document in each training sample has the answerability as the deep neural network for outputting, and training the deep neural network to obtain an answer verification model;

regarding the referee document with the answer to be extracted, taking the question of the referee document for the referee document as a test sample; aiming at the referee document and the problem in the test sample, performing word segmentation by taking the vocabulary as the granularity; then, coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block in the judgment documents and the questions in the test samples are obtained;

inputting the text characteristics corresponding to each word block in the referee document and the question in the test sample into a prediction model, and outputting a prediction answer corresponding to the test sample through the prediction model;

taking the whole connection of the referee document and the question in the test sample as a premise, taking the predicted answer corresponding to the test sample as a hypothesis, inputting the hypothesis into an answer verification model, and obtaining whether the predicted answer corresponding to the test sample has the answering property or not through the answer verification model;

if so, taking the prediction answer obtained by the prediction model as the final answer of the question raised by the referee document in the test sample;

if not, the problem proposed for the referee document is determined to be unsolved.

Preferably, the deep neural network of the trained answer verification model is a long-short term memory network LSTM or BiLSTM.

Preferably, each word block of the referee document and the question in the training sample and the test sample is coded to generate a word segmentation vector, a sentence pair vector and a position vector which are used as text characteristics corresponding to each word block in the referee document and the question in the training sample and the test sample; and coding each word block of the reference answer of the corresponding question of the referee document in the training sample to generate word segmentation vectors, sentence pair vectors and position vectors which are used as the text characteristics of each word block of the reference answer in the training sample.

Furthermore, after word segmentation vectors, sentence pair vectors and position vectors of each word block of the referee document and the question in the training sample and the test sample are obtained, the word segmentation vectors, the sentence pair vectors and the position vectors corresponding to all the word segmentations of the referee document are fused to obtain a first vector; fusing word segmentation vectors, sentence pair vectors and position vectors corresponding to all the word segmentations of the problem to obtain a second tensor; after word segmentation vectors, sentence pair vectors and position vectors corresponding to all word blocks of a reference answer of a question corresponding to the referee document in the training sample are obtained, the word segmentation vectors, the sentence pair vectors and the position vectors corresponding to all the word segments of the reference answer are fused to obtain a third tensor;

when the ERNIE model is trained, a first tensor and a second tensor obtained under a training sample are used as the input of the ERNIE model, and a third tensor is used as a label of the ERNIE model to train the ERNIE model; when a test sample is tested, a first tensor and a second tensor obtained under the test sample are input into the prediction model.

Preferably, aiming at the referee documents and problems in the training samples and the testing samples and the reference answers of the referee documents corresponding to the referee documents in the training samples, the word segmentation device processes the referee documents and problems by using a Chinese full-word coverage model, performs word segmentation by using words as granularity to obtain word blocks of the words, and then performs coding on each word block to obtain word segmentation vectors, sentence pair vectors and position vectors corresponding to each word block.

Preferably, the joining of the official document and the question means: and connecting the character string corresponding to the question behind the character corresponding to the referee document.

The second purpose of the invention is realized by the following technical scheme: an official document answer information extraction device, comprising:

the training sample acquisition module is used for acquiring a plurality of referee documents, each referee document corresponds to each question as a known reference answer, and each referee document and one question corresponding to each referee document are used as each training sample;

the first word segmentation module is used for performing word segmentation by taking vocabulary as granularity aiming at the referee documents and the questions in each training sample and the reference answers of the referee documents corresponding to the questions in each training sample;

the first coding module is used for coding each word block obtained by the first segmentation module to obtain text characteristics corresponding to each word block of the judgment documents, the questions and the reference answers in the training samples;

the prediction model construction module is used for taking the text characteristics corresponding to each word block of the referee document and the question in each training sample as the input of the ERNIE model, taking the text characteristics corresponding to each word block of the reference answer in each training sample as the output of the ERNIE model, and training the ERNIE model to obtain a prediction model;

the answer verification model building module is used for determining whether the question corresponding to the referee document in the training sample has the answering ability or not according to the reference answer of the question corresponding to the referee document in the training sample; the method comprises the steps of taking the whole connection of a referee document and a question of each training sample as a premise, taking reference answers of the questions corresponding to the referee document in each training sample as assumptions, inputting the assumptions into a neural network, taking a result of whether the questions corresponding to the referee document in the training sample have the answerability as a deep neural network label, and training the deep neural network to obtain an answer verification model;

the device comprises a test sample acquisition module, a test analysis module and a test analysis module, wherein the test sample acquisition module is used for acquiring a test sample, and the test sample is a referee document and a problem provided by the referee document;

the second word segmentation module is used for performing word segmentation by taking the vocabulary as granularity aiming at the referee document and the problem in the test sample; then, coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block of the judgment document and the question in the test sample are obtained;

the prediction module is used for inputting the text characteristics corresponding to each word block in the referee document and the question in the test sample into the prediction model and outputting a prediction answer corresponding to the test sample through the prediction model;

the verification module is used for taking the whole connection of the referee document and the question in the test sample as a premise, taking the predicted answer corresponding to the test sample as an assumption, inputting the assumption into the answer verification model, and obtaining whether the predicted answer corresponding to the test sample has the answering ability or not through the answer verification model;

and the answer obtaining module is used for taking the predicted answer obtained by the prediction model as the final answer of the question proposed by the referee document in the test sample under the condition that the verification module verifies that the predicted answer corresponding to the test sample has the answering property, otherwise, judging that the question proposed by the referee document is unsolved.

The third purpose of the invention is realized by the following technical scheme: an answer information extractor for referee documents, which comprises a word segmentation device, an encoder, an answer extractor and an answer verifier; wherein:

the word segmentation device is used for inputting the referee documents in the training samples, the problems in the training samples, the reference answers of the referee documents in the training samples, the referee documents in the testing samples or the problems in the testing samples, and performing word segmentation treatment by taking vocabularies as granularity respectively aiming at the input to obtain word blocks corresponding to the input; taking each referee document and a corresponding question as each training sample, wherein the question corresponding to the referee document in the training samples is a known reference answer; taking a referee document with an answer to be extracted and a question proposed by the referee document as a test sample;

the encoder is used for inputting each word block output after word segmentation processing of the word segmenter and encoding each input word block to obtain text characteristics corresponding to each input word block;

the answer extractor is used for inputting text characteristics corresponding to each word block of the referee document and the question in the test sample, extracting answers corresponding to the question from the referee document of the test sample according to the text characteristics corresponding to each word block of the referee document and the question in the test sample, and taking the answers as predicted answers corresponding to the test sample;

the answer verifier is used for inputting the whole connection of the referee document and the question in the test sample and is used as a premise; the input answer extractor is used for inputting the predicted answer corresponding to the test sample extracted by the answer extractor, and the predicted answer is used as a hypothesis; then, judging whether the relation exists between the preconditions and the hypothesis by adopting a natural language reasoning mode, namely judging whether the predicted answer corresponding to the test sample has the answering property; if so, taking the prediction answer obtained by the prediction model as the final answer of the question raised by the referee document in the test sample; if not, judging that the problem proposed by the referee document is unsolved;

wherein:

the answer extractor inputs the text characteristics corresponding to each word block of the official documents and the questions in the training sample through the ERNIE model, and the text characteristics corresponding to each word block of the reference answers in the training sample are obtained by training as labels;

and the answer verifier inputs the whole connection of the referee document and the question in the training sample as a premise through the deep neural network, inputs the reference answer of the referee document corresponding to the question in the assumed training sample, and trains and obtains a result that whether the referee document corresponding to the question in the training sample has the answerability as a label.

The fourth purpose of the invention is realized by the following technical scheme: a storage medium storing a program which, when executed by a processor, realizes the referee document answer information extraction method according to the first object of the present invention.

The fifth purpose of the invention is realized by the following technical scheme: a computing device comprising a processor and a memory for storing a program executable by the processor, wherein the processor executes the program stored in the memory to implement the method for extracting answer information from a referee document according to the first object of the present invention.

Compared with the prior art, the invention has the following advantages and effects:

(1) In the judging document answer information extraction method, word segmentation processing with Chinese vocabulary as granularity is carried out on data in a training sample set including a judging document, a question and a reference answer, and then each word block obtained by word segmentation is coded to obtain text characteristics of each word block; then training the ERNIE model through the text characteristics of each word block obtained from each training sample to obtain a prediction model; meanwhile, training is carried out on the deep neural network through the referee documents, the questions and the reference answers in the training samples, and an answer verification model is obtained; when the actual test is carried out, firstly, the referee document and the proposed problem are subjected to word segmentation treatment, and then, each word block is coded to obtain the text characteristics of each word block; inputting the text characteristics of each word block corresponding to the referee document and the question into a prediction model to obtain a prediction answer, then inputting the prediction answer into an answer verification model for verification, taking the prediction answer as a final answer under the condition that the verification is passed, and otherwise, judging that the question proposed for the referee document is unsolved. It can be seen from the above that, the method of the present invention combines the word segmentation processing technology, the ERNIE language model and the answer verification to realize the extraction of the answer information of the referee document, wherein, the word segmentation processing technology utilizes the prior semantic knowledge to convert the referee document and the question into the semantic knowledge entity, after the text information extracted from the semantic knowledge entity detects the predicted answer through the knowledge-enhanced ERNIE language model, the answer verification model constructed by the deep neural network verifies the answerability of the input question, i.e. the availability of the extracted information, by the semantic reasoning technology, thereby greatly improving the reading comprehension ability aiming at the text and making the extraction accuracy of the answer information in the referee document higher.

(2) In the judgment document answer information extraction method, aiming at the judgment documents and problems in training samples and test samples and the reference answers of the problems corresponding to the judgment documents in the training samples, a word segmentation device processes the judgment documents and the problems by utilizing a Chinese full-word coverage model, performs word segmentation by taking vocabularies as granularity, and obtains word blocks of the words, namely semantic knowledge entities; therefore, the method adopts full word coverage to cover all word segments, can effectively help capture important features, and improves the accuracy of lexical analysis and grammar analysis, thereby further improving the accuracy of answer information extraction in the referee document.

(3) In the judging document answer information extraction method, the deep neural network trained to obtain the answer verification model can use a long-short term memory network LSTM or BiLSTM, when the prediction file is output aiming at the prediction model for verification, the long-short term memory network realizes a semantic reasoning function in the deep neural network, and judges whether a logic implication relation exists between a { problem, a judging document } and a prediction answer, namely whether the problem provided aiming at the judging document has the answering or not is detected; the judgment documents and the questions are the premise, the predicted answers are the hypothesis, and the classes of the answer verification model classification obtained by the long-term and short-term memory network training comprise implications, contradictions and neutrality. If the verification result is implication or neutral, it indicates that the question is answerable and the predicted answer is valid. Otherwise, the question is unanswerable, i.e., unsolved, and the referee's document reader answers "unknown". Because the LSTM or the BilSTM can automatically extract and retain the vocabulary and semantic information characteristics of the sentence, a large amount of characteristic extraction work does not need to be finished manually, and richer text characteristics are obtained. And LSTM or BilSTM utilizes the vocabulary characteristic expression hypothesis of the premise to establish a relationship of text inclusion, and judges whether the text inclusion exists between the premise and the hypothesis, thereby judging whether the question has answerability and improving the accuracy rate of question-answer answers.

Drawings

FIG. 1 is a data set distribution plot of an embodiment acquired by the present invention.

Fig. 2 is a schematic structural diagram of an answer information extractor of a referee document according to the present invention.

FIG. 3 is a comparison chart of the method of the present invention and the prior art which simply uses ERNIE model to extract the answer information of the referee document.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

The embodiment discloses a method for extracting answer information of a referee document, which comprises the following steps:

s1, obtaining a plurality of referee documents, wherein each referee document corresponds to each question and is a known reference answer, and taking each referee document and one question corresponding to each referee document as each training sample;

s2, performing word segmentation by taking vocabulary as granularity aiming at the referee document and the question in each training sample and the reference answer of the referee document corresponding to the question in the training sample to obtain each word block; in the embodiment, a word segmentation device processes a referee document, a question and a reference answer by using a Chinese full-word covering model, divides a word into a plurality of word blocks based on a field (WordPiece) word segmentation method, and covers the word blocks belonging to a complete word by using prior semantic knowledge, wherein the word blocks from a word are covered by MASK to obtain a semantic knowledge entity. The full word coverage model is a model which is pre-trained by a large number of entities and can accurately capture rich characteristics from a text, when the text is subjected to word segmentation, each sentence has a tag [ CLS ] as a start, and two sentences are segmented by [ SEP ].

Respectively coding each word block in the obtained referee document, question and reference answer to obtain text characteristics corresponding to each word block of the referee document, question and reference answer in the training sample; in this embodiment, the word segmentation vectors Token Embedding, sentence pair vector Embedding and Position vector Position Embedding are generated by encoding each word block obtained by word segmentation, and are used as the text features corresponding to each word block in the official document, question and reference answer. The word segmentation vector Token Embedding is used for storing word senses of word blocks, and the sentence vector Segment Embedding is used for storing the inclusion relationship between the word blocks and sentences, namely which sentence the word block is located in; the Position vector Position Embedding stores the Position of the word block in the sentence.

In this embodiment, after the word segmentation vectors, sentence pair vectors and position vectors of each word block of the referee document, the question and the reference answer in the training sample are obtained, the word segmentation vectors, sentence pair vectors and position vectors corresponding to all the word segmentations of the referee document are fused to obtain a first vector; fusing word segmentation vectors, sentence pair vectors and position vectors corresponding to all the segmentation words of the problem to obtain a second tensor; fusing word segmentation vectors, sentence pair vectors and position vectors corresponding to all the word segments of the reference answer to obtain a third tensor;

s3, taking the text features corresponding to the word blocks of the referee documents and the questions in each training sample as input of an ERNIE model, and taking the text features corresponding to the word blocks of the reference answers in the training sample as labels to train the ERNIE model to obtain a prediction model;

in the embodiment, in the step S2, the segmentation vectors, the sentence pair vectors and the position vectors corresponding to all the segmentation words of the referee document in the training sample are fused to obtain a first tensor, the segmentation vectors, the sentence pair vectors and the position vectors corresponding to all the segmentation words of the question in the training sample are fused to obtain a second tensor, and the segmentation vectors, the sentence pair vectors and the position vectors corresponding to all the segmentation words of the reference answer in the training sample are fused to obtain a third tensor; therefore, when the ERNIE model is trained in this step, the first tensor and the second tensor obtained under each training sample are used as the input of the ERNIE model, and the third tensor is used as the label of the ERNIE model to train the ERNIE model.

S4, aiming at each training sample, determining whether the question corresponding to the referee document in the training sample has the answering ability or not according to the reference answer of the question corresponding to the referee document in the training sample; and taking the whole connection of the referee document and the question of each training sample as a premise, taking the reference answer of the question corresponding to the referee document in each training sample as an assumption to be input into the deep neural network, taking the result of whether the question corresponding to the referee document in each training sample has the answering ability as the deep neural network output, and training the deep neural network to obtain an answer verification model.

In this embodiment, the deep neural network may use a long short term memory network LSTM or a BiLSTM; the link between the official document and the question refers to: the character string corresponding to the question is connected to the back of the character corresponding to the official document, so that the whole is formed.

S5, regarding the referee document with the answer to be extracted, taking the question of the referee document, which is provided by the referee document, as a test sample; aiming at the referee documents and problems in the test sample, performing word segmentation by taking vocabularies as granularity to obtain word blocks; in this embodiment, the same word segmentation method as that in step S2 is used to segment the referee document and question in the test sample, so as to obtain semantic knowledge entities corresponding to the referee document and question in the test sample.

Coding each word block of the judgment document and the question in the test sample obtained in the step to obtain text characteristics corresponding to each word block of the judgment document and the question in the test sample; in this embodiment, as shown in step S2, a word segmentation vector Token Embedding, a sentence pair vector Segment Embedding and a Position vector Position Embedding are generated by encoding each word block in the referee document and the question in the test sample, as text features corresponding to each word block in the referee document and the question in the test sample. In this embodiment, the word segmentation vectors, sentence pair vectors, and position vectors corresponding to all the word segmentations of the referee document in the test sample are fused to obtain a first tensor, and the word segmentation vectors, sentence pair vectors, and position vectors corresponding to all the word segmentations of the question in the test sample are fused to obtain a second tensor.

S6, inputting the text characteristics corresponding to each word block in the referee document and the question in the test sample into a prediction model, and outputting a prediction answer corresponding to the test sample through the prediction model; in this embodiment, in step S5, the segmentation vectors, sentence pair vectors, and position vectors corresponding to all the segmentation words of the referee document in the test sample are fused into a first tensor, and the segmentation vectors, sentence pair vectors, and position vectors corresponding to all the segmentation words of the question in the test sample are fused into a second tensor, so in this step, the first tensor and the second tensor obtained under the test sample are input into the prediction model to predict the answer.

S7, taking the whole connection of the referee document and the question in the test sample as a premise, taking the predicted answer corresponding to the test sample as a hypothesis, inputting the hypothesis into an answer verification model, and obtaining whether the predicted answer corresponding to the test sample has the answering property or not through the answer verification model;

In this embodiment, the official documents constituting the training samples in step S1 may be collected from a web of chinese official documents, in this embodiment, 4 thousand civil official documents and 4 thousand criminal official documents are collected in total. Each official document contains 5 questions each containing 1 manually annotated reference answer. The problems are divided into 3 types, including a segment extraction problem, a non-judgment problem and a non-solution problem. The data set is specifically analyzed before training the predictive model and the answer-validation model. The problem category distribution of the data set is shown in fig. 1, and there are 4 ten thousand problems, of which 38766 problems are solved and 1234 problems are unsolved. Among the problems with solution, 33651 total problems belong to the clip extraction type problem, and 5115 problems belong to the non-judgment type problem. The answer of the segment extraction question comes from the original text of the referee document and sentences intercepted in the original text are continuously intercepted. Mostly non-judgment problem

The questions are asked in the form of general question sentences, and the unsolved questions refer to the questions that the relevant information cannot be found from the referee documents.

In this embodiment, the data set is divided into a training set and a test set according to the proportion of 7:3, the training set is used for training the model, and the test set is used for reading understanding performance test of the model. The training set is used in the method of the present embodiment, the prediction model and the answer verification model are obtained through the steps in the method of the present embodiment, then the test set is used in the method of the present embodiment, the test samples of the test set are input into the prediction model and the answer verification model through the method of the present embodiment, and the accuracy of extracting the answer information in the obtained referee document is shown in fig. 3. Based on the same training set and test set, the average accuracy of the answer information extraction in the referee document obtained when a single ERNIE model is used for answer extraction is shown in fig. 3. Comparing the method of the embodiment with the method of extracting answers by using a single ERNIE model in fig. 3, the accuracy of extracting answers by using the method of the embodiment is significantly higher.

Example 2

The embodiment discloses a referee's document answer information extraction element, includes:

the training sample acquisition module is used for acquiring a plurality of referee documents, wherein each referee document corresponds to each question and is a known reference answer, and each referee document and one question corresponding to each referee document are used as each training sample;

the first word segmentation module is used for performing word segmentation by taking vocabulary as granularity aiming at the referee documents and the questions in each training sample and the reference answers of the referee documents corresponding to the questions in each training sample; in this embodiment, the first segmentation module processes the referee document, the question and the reference answer by using a Chinese full-word covering model through a segmenter, the segmenter divides a word into a plurality of word blocks based on a field (WordPiece) segmentation method, and then covers the word blocks belonging to a complete word by using priori semantic knowledge, wherein the word blocks from a vocabulary are covered by a MASK to obtain a semantic knowledge entity.

The first coding module is used for coding each word block obtained by the first segmentation module to obtain text characteristics corresponding to each word block of the judgment documents, the questions and the reference answers in the training samples; in this embodiment, the word segmentation vectors Token Embedding, sentence pair vector Embedding and Position vector Position Embedding are generated by encoding each word block obtained by word segmentation, and are used as the text features corresponding to each word block in the official document, question and reference answer.

And the prediction model construction module is used for taking the text features corresponding to the word blocks of the referee documents and the questions in each training sample as the input of the ERNIE model, and taking the text features corresponding to the word blocks of the reference answers in each training sample as the output of the ERNIE model to train the ERNIE model so as to obtain the prediction model.

The answer verification model building module is used for determining whether the question corresponding to the referee document in the training sample has the answering ability or not according to the reference answer of the question corresponding to the referee document in the training sample; and taking the whole connection of the referee document and the question of each training sample as a premise, taking the reference answer of the question corresponding to the referee document in each training sample as an assumption, inputting the assumption into the deep neural network, taking the result of whether the question corresponding to the referee document in the training sample has the answering ability as a deep neural network label, and training the deep neural network to obtain an answer verification model. In this embodiment, the deep neural network may use a long-short term memory network LSTM or BiLSTM; the link between the official document and the question refers to: the character string corresponding to the question is connected to the back of the character corresponding to the official document, so as to be connected into a whole.

The test sample acquisition module is used for acquiring a test sample, wherein the test sample is a referee document and a problem provided by the referee document;

the second word segmentation module is used for performing word segmentation by taking the vocabulary as granularity aiming at the referee document and the problem in the test sample; and then coding each word block obtained by word segmentation to obtain text characteristics corresponding to each word block of the official documents and the questions in the test sample. In this embodiment, the second word segmentation module processes referee documents and questions in the test sample by using a Chinese full word coverage model through a word segmentation device, the word segmentation device divides words into a plurality of word blocks based on a field (WordPiece) word segmentation method, and then the word blocks belonging to a complete word are covered by using prior semantic knowledge.

and the answer obtaining module is used for taking the predicted answer obtained by the prediction model as the final answer of the question provided by the referee document in the test sample under the condition that the verification module verifies that the predicted answer corresponding to the test sample has the answering property, otherwise, judging that the question provided by the referee document is not solved.

Example 3

The embodiment discloses an answer information extractor for a referee document, which comprises a word segmentation device, an encoder, an answer extractor and an answer verifier as shown in fig. 2; wherein:

the word segmentation device is used for inputting the referee document in each training sample, the problem in each training sample, the reference answer of the referee document corresponding to the problem in each training sample, the referee document in the test sample or the problem in the test sample, and performing word segmentation processing by taking vocabulary as granularity aiming at the input to obtain each word block corresponding to the input; taking each referee document and a corresponding question as each training sample, wherein the question corresponding to the referee document in the training samples is a known reference answer; taking a referee document with an answer to be extracted and a question proposed by the referee document as a test sample;

in this embodiment, the word segmenter processes referee documents and questions in the test sample by using a Chinese full word coverage model, divides words into a plurality of word blocks Token by the word segmenter based on a field (WordPiece) word segmentation method, and covers the word blocks belonging to a full word by using prior semantic knowledge. As shown in fig. 2, when the text is participated, each sentence has a tag [ CLS ] as a start, and two sentences are divided by [ SEP ] therebetween. Wherein, token1, token2, token3 represent 3 word blocks obtained after a certain sentence in the text is participled, and Token4, token5, token6 represent 3 word blocks obtained after the next sentence is participled.

The encoder is used for inputting each word block output after word segmentation processing of the word segmenter and encoding each input word block to obtain text characteristics corresponding to each input word block; in this embodiment, after the encoder encodes each word block, a word segmentation vector Token Embedding, a sentence vector Segment Embedding, and a Position vector Position Embedding are generated. In fig. 2, EToken1 and EToken2 … … represent participle vectors, EA and EB … … represent sentence pair vectors, where the sentence pair vectors of word blocks in the same sentence are the same, and E0 and E1 … … represent position vectors.

the answer verifier is used for inputting the whole connection of the referee document and the question in the test sample and is used as a premise; the input answer extractor is used for inputting the predicted answer corresponding to the test sample extracted by the answer extractor, and the predicted answer is used as a hypothesis; then, judging whether the relation exists between the preconditions and the hypothesis by adopting a natural language reasoning mode, namely judging whether the predicted answer corresponding to the test sample has the answering property; if so, taking the prediction answer obtained by the prediction model as the final answer of the question raised by the referee document in the test sample; if not, the problem proposed for the referee document is determined to be unsolved.

Wherein:

and the answer verifier inputs the whole connection of the referee document and the question in the training sample as a premise through the deep neural network, inputs the reference answer of the question corresponding to the referee document in the assumed training sample, and trains and obtains the result of whether the question corresponding to the referee document in the training sample has the answering ability as a label.

Example 4

This embodiment discloses a storage medium storing a program, which when executed by a processor, implements the method for extracting answer information of a referee document according to embodiment 1, as follows:

acquiring a plurality of referee documents, wherein each referee document corresponds to each question and is a known reference answer, and taking each referee document and one question corresponding to each referee document as each training sample;

performing word segmentation by taking vocabulary as granularity aiming at the referee document and the question in each training sample and the reference answer of the referee document corresponding to the question in the training sample; then, coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block of the judgment documents, the questions and the reference answers in the training samples are obtained;

for each training sample, determining whether the question corresponding to the referee document in the training sample has the answering ability according to the reference answer of the question corresponding to the referee document in the training sample; taking the whole connection of the referee document and the question of each training sample as a premise, taking the reference answer of the question corresponding to the referee document in each training sample as an assumption to be input into the deep neural network, taking the result of whether the question corresponding to the referee document in each training sample has the answering ability as the deep neural network output, and training the deep neural network to obtain an answer verification model;

regarding the referee document with the answer to be extracted, taking the question of the referee document for the referee document as a test sample; aiming at the referee document and the problem in the test sample, performing word segmentation by taking the vocabulary as granularity; then, coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block in the judgment documents and the questions in the test samples are obtained;

the method comprises the steps of taking the whole connection of a referee document and a question in a test sample as a premise, taking a predicted answer corresponding to the test sample as an assumption, inputting the assumption into an answer verification model, and obtaining whether the predicted answer corresponding to the test sample has the answering property or not through the answer verification model;

The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), a usb disk, a removable hard disk, or other media.

Example 5

The embodiment discloses a computing device, which comprises a processor and a memory for storing a program executable by the processor, and is characterized in that when the processor executes the program stored by the memory, the method for extracting the answer information of the referee document in the embodiment is implemented as follows:

regarding the referee document with the answer to be extracted, taking the question of the referee document for the referee document as a test sample; aiming at the referee document and the problem in the test sample, performing word segmentation by taking the vocabulary as the granularity; then, coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block in the judgment document and the question in the test sample are obtained;

if not, the problem proposed for the referee document is judged to be unsolved.

The computing device in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, or other terminal devices.

In summary, the present invention provides technical support for legal practice by using natural language processing, and legal referee documents with strict language structure provide good training data for natural language processing, fully exert natural language processing technology, and provide innovative computer application in the legal field, wherein an ERNIE model performs modeling training by using semantic knowledge entities in a large-scale corpus to obtain a speech model for predicting answers, i.e., a prediction model, which has semantic representation capability and can identify information entities in real application. The invention utilizes a deep neural network (such as a long short term memory network (LSTM)) to realize a semantic reasoning function, and the semantic reasoning is used for checking the answerability of the input question, namely judging whether the question can be searched in a referee document. If so, the question is answerable; otherwise, the problem is solved, and the semantic reasoning technology avoids the problem that the model answers the problem without solution by incorrect text segments, thereby improving the accuracy of information extraction; the invention uses the referee document and the question as the premise, uses the answer corresponding to the question as the hypothesis, uses the long-short term memory network (LSTM) to realize the classifier, judges whether the premise and the hypothesis have the implication relation, thereby checking the answer of the question. The invention is simple and convenient to use, and after the user inputs the referee document and the question to be analyzed, the answer corresponding to the question can be automatically obtained from the referee document through the invention, thereby greatly improving the efficiency of analyzing the referee document by the user. The invention combines Chinese full word coverage, knowledge enhancement of an ERNIE model and semantic reasoning, can extract the information of the text by using prior semantic knowledge, and checks the availability of the extracted information according to the semantic reasoning, thereby improving the accuracy of reading comprehension and text information extraction.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A judgment document answer information extraction method is characterized by comprising the following steps:

performing word segmentation by taking vocabulary as granularity aiming at the referee documents and the questions in each training sample and the reference answers of the referee documents corresponding to the questions in the training samples; then coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block of the referee document, the question and the reference answer in the training sample are obtained;

regarding the referee document of which the answer is to be extracted, taking the question which is proposed by the referee document aiming at the referee document as a test sample; aiming at the referee document and the problem in the test sample, performing word segmentation by taking the vocabulary as granularity; then, coding is carried out on each word block obtained by word segmentation, and text characteristics corresponding to each word block in the judgment documents and the questions in the test samples are obtained;

coding each word block of the referee document and the question in the training sample and the test sample to generate a word segmentation vector, a sentence pair vector and a position vector which are used as text characteristics corresponding to each word block in the referee document and the question in the training sample and the test sample; coding each word block of a reference answer of a question corresponding to a referee document in a training sample to generate a word segmentation vector, a sentence pair vector and a position vector which are used as text characteristics of each word block of the reference answer in the training sample;

if not, the problem proposed for the referee document is judged to be unsolved.

2. The method of claim 1, wherein the deep neural network of the trained answer verification model is long-short term memory network (LSTM) or BiLSTM.

3. The method for extracting answer information of a referee document according to claim 1, wherein after word segmentation vectors, sentence pair vectors and position vectors of each word block of a referee document and a question in a training sample and a test sample are obtained, the word segmentation vectors, the sentence pair vectors and the position vectors corresponding to all the word segmentations of the referee document are fused to obtain a first vector; fusing word segmentation vectors, sentence pair vectors and position vectors corresponding to all the word segmentations of the problem to obtain a second tensor; after word segmentation vectors, sentence pair vectors and position vectors corresponding to all word blocks of a reference answer of a question corresponding to the referee document in the training sample are obtained, the word segmentation vectors, the sentence pair vectors and the position vectors corresponding to all the word segments of the reference answer are fused to obtain a third tensor;

4. The method of claim 1, wherein for the referee documents and questions in the training samples and the test samples and the reference answers to the referee documents and questions in the training samples, a chinese full-word coverage model is used to process the referee documents and questions, word segmentation is performed with vocabulary as a granularity to obtain word blocks of the words, and then each word block is encoded to obtain word segmentation vectors, sentence pair vectors and position vectors corresponding to each word block.

5. The method of claim 1, wherein the association between the official document and the question is: and connecting the character string corresponding to the problem behind the character corresponding to the official document.

6. An official document answer information extraction device, comprising:

the first coding module is used for coding each word block obtained by word segmentation of the first word segmentation module to obtain text characteristics corresponding to each word block of the judgment documents, the questions and the reference answers in the training samples;

coding each word block of the referee document and the question in the training sample and the test sample to generate a word segmentation vector, a sentence pair vector and a position vector which are used as text characteristics corresponding to each word block in the referee document and the question in the training sample and the test sample; coding each word block of a reference answer of a question corresponding to the referee document in the training sample to generate a word segmentation vector, a sentence pair vector and a position vector which are used as text characteristics of each word block of the reference answer in the training sample;

the prediction model construction module is used for taking the text characteristics corresponding to each word block of the referee document and the question in each training sample as the input of an ERNIE model, taking the text characteristics corresponding to each word block of the reference answer in each training sample as the output of the ERNIE model, and training the ERNIE model to obtain a prediction model;

the answer verification model building module is used for determining whether the question corresponding to the referee document in the training sample has the answering ability or not according to the reference answer of the question corresponding to the referee document in the training sample; taking the whole connection of the referee document and the question of each training sample as a premise, taking the reference answer of the question corresponding to the referee document in each training sample as an assumption to be input into a neural network, taking the result of whether the question corresponding to the referee document in the training sample has the answering ability as a deep neural network label, and training the deep neural network to obtain an answer verification model;

7. An answer information extractor of a referee document is characterized by comprising a word segmentation device, a coder, an answer extractor and an answer verifier; wherein:

the word segmentation device is used for inputting the referee document in each training sample, the problem in each training sample, the reference answer of the referee document corresponding to the problem in each training sample, the referee document in the test sample or the problem in the test sample, and performing word segmentation processing by taking vocabulary as granularity aiming at the input to obtain each word block corresponding to the input; taking each referee document and a corresponding question as each training sample, wherein the question corresponding to the referee document in the training samples is a known reference answer; taking a referee document of which the answer is to be extracted and a question proposed by the referee document as a test sample;

coding each word block of the referee document and the question in the training sample and the test sample to generate word segmentation vectors, sentence pair vectors and position vectors which are used as text characteristics corresponding to each word block in the referee document and the question in the training sample and the test sample; coding each word block of a reference answer of a question corresponding to a referee document in a training sample to generate a word segmentation vector, a sentence pair vector and a position vector which are used as text characteristics of each word block of the reference answer in the training sample;

the answer extractor is used for inputting text characteristics corresponding to each word block of the referee document and the question in the test sample, extracting an answer corresponding to the question from the referee document of the test sample according to the text characteristics corresponding to each word block of the referee document and the question in the test sample, and taking the answer as a predicted answer corresponding to the test sample;

the answer verifier is used for inputting the whole connection of the referee document and the question in the test sample and is used as a premise; the input answer extractor is used for inputting the predicted answer corresponding to the test sample extracted by the answer extractor, and the predicted answer is used as a hypothesis; then, judging whether a relation exists between the preconditions and the hypothesis by adopting a natural language reasoning mode, namely judging whether the predicted answer corresponding to the test sample has the answering property or not; if so, taking the prediction answer obtained by the prediction model as the final answer of the question raised by the referee document in the test sample; if not, judging that the problem proposed by the referee document is unsolved;

wherein:

the answer extractor is used for inputting text characteristics corresponding to each word block of the referee document and the question in the training sample through the ERNIE model and training the text characteristics corresponding to each word block of the reference answer in the training sample as a label;

8. A storage medium storing a program for implementing the method of extracting answer information of a referee document according to any one of claims 1 to 5 when the program is executed by a processor.

9. A computing device comprising a processor and a memory for storing a program executable by the processor, wherein the processor implements the method for extracting answer information of official document as claimed in any one of claims 1 to 5 when executing the program stored in the memory.