CN115329058A

CN115329058A - Machine reading understanding method and device for removable insurance document and readable medium

Info

Publication number: CN115329058A
Application number: CN202210951328.6A
Authority: CN
Inventors: 成臻; 武悦娇; 任君翔
Original assignee: Pacific Insurance Technology Co Ltd
Current assignee: Pacific Insurance Technology Co Ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-11-11

Abstract

The invention provides a machine reading understanding method, a device and a computer readable medium for a removable insurance document, wherein the method comprises the following steps: structuring an insurance document, and outputting a chapter set P and a directory structure L corresponding to the insurance document; encoding and expressing a chapter set P, establishing an index I, calculating cosine similarity of the question to be answered based on the index I, and returning K insurance chapters with the highest similarity to the question to be answered; predicting each chapter, and labeling a starting position START and an ending position END of the corresponding answer in the chapter; and summing the probability p (START) of the starting position and the probability p (END) of the ending position to serve as the confidence of the answer segment, and taking the answer segment with the highest confidence as the answer. A machine reading and understanding device for the removable insurance document is also provided. By the method provided by the invention, the machine reading accuracy rate of the insurance document is effectively improved, and the application requirement is met.

Description

Machine reading understanding method and device for removable insurance document and readable medium

Technical Field

The invention belongs to the field of natural language understanding, particularly relates to a machine reading and understanding method and device for an extraction type insurance document and a computer readable medium, and comprehensively considers the structural information of various insurance documents.

Background

Machine-reading understanding refers to a technique in which a machine finds answers from a document based on input question sentences by simulating human reading of the document. Machine reading understanding is a natural language intelligent question and answer mode, can be achieved through unstructured text, does not need to rely on structured knowledge prepared in advance, and has lower question and answer knowledge base maintenance cost.

Machine reading comprehension can be divided into full-filling type machine reading comprehension, extraction type machine reading comprehension and generation type machine reading comprehension according to different objectives. The most widely used method is the extraction type machine reading understanding (hereinafter referred to as machine reading understanding). Machine reading understanding can be divided into broad machine reading understanding and narrow machine reading understanding, and most of the current research focuses on the narrow machine reading understanding:

(1) Machine-read understanding in broad terms: it is necessary to find documents that may contain answers among a large number of documents and then return answers to candidate document input questions. In the prior art, the research on screening candidate documents in the first stage is less, and a complete document set is directly traversed without screening in scenes with few documents; in scenes with a large number of documents, the prior art mostly adopts a simpler literal recall scheme, such as TF-IDF, BM25 and the like, and also adopts a word vector representation method, such as DrQA; after the candidate documents are obtained, answering the problems by adopting a plurality of deep neural network models for each candidate document in the second stage;

(2) Machine reading in the narrow sense understands: only answer return is needed to be carried out on an input question of a specified document, a depth model based on a cyclic neural network or a convolutional neural network is mostly adopted as an encoder to carry out feature extraction on the document and the question in the early stage, and then a plurality of multilayer perceptrons are adopted to predict the initial position and the end position of an answer segment; with the development of the pre-training model in recent years, the field of machine reading understanding gradually introduces the pre-training model as an underlying encoder, such as BERT, roBERTa, ELECTRA, etc., all of which adopt a transform-based fixed-length pre-training encoder.

Similar techniques are described, for example, in the documents "Reading Wikipedia to Answer Open-Domain Questions", "Sence-BERT: sence EMBEDDING using name BERT-Networks", "ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING".

Although the prior art has a certain solution to the machine reading understanding of insurance clauses, some defects still exist and the insurance clauses cannot be processed well, as follows, mainly compared with the prior art "BERT-based machine reading understanding method, device, equipment and storage medium" (chinese patent application No. 202011187381.0, publication No. CN 1124641A, applicant: zhen science and technology (shenzhen) limited):

(1) The screening capacity of the candidate chapters is insufficient: the insurance clauses of a long piece of paper can be disassembled into a plurality of chapters, currently, most of the screening of candidate chapters is some recalling schemes based on literal semantics, and the schemes cannot mine some deep semantic information, such as some special nouns in the insurance field: hesitation and quiet period, the literal similarity of the two is very low but the actually expressed semantics are the same, and the problem of unable answer caused by missed recall is easy to occur. The method introduced in the "BERT-based machine reading understanding method, apparatus, device and storage medium" in the prior art needs to rely on an external word segmentation tool, and only generates named entity feature vector information, but lacks information encoding of a complete chapter.

(2) Underutilization of insurance document structure information: the method described in the "BERT-based machine reading understanding method, apparatus, device, and storage medium" of the prior art does not involve the utilization of structural information of an insurance document. Each chapter in the insurance document has corresponding paragraph title information which plays an important role in screening candidate chapters and answering questions, but the prior art does not research on the characteristics.

(3) Insurance document paragraph text length is too long: the method introduced in the "BERT-based machine reading understanding method, apparatus, device and storage medium" in the prior art described above uses a pre-training model BERT with a fixed length, and can only process texts with a length within 512. However, for the insurance document, even if the insurance document is disassembled by the document disassembling technology, the length of each obtained chapter exceeds the processing capacity of most of the pre-training models with fixed lengths at present.

(4) There is no explicit judgment as to whether an answer is available: the method introduced in the "BERT-based machine reading understanding method, apparatus, device, and storage medium" in the prior art just selects an answer segment with the highest probability by predicting the probabilities of the start and end positions of the answer, and has no explicit judgment capability on whether the corresponding chapter can answer the corresponding question.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a machine reading and understanding method and device for an extraction type insurance document and a computer readable medium.

In order to solve the technical problems, the invention provides a machine reading understanding method of an extraction type insurance document, which is used for acquiring answers corresponding to question sentences from an insurance document according to the question sentences to be answered, and comprises the following steps:

step a: carrying out structuralization processing on an insurance document and outputting structuralization text information corresponding to the insurance document, wherein the structuralization text information at least comprises a chapter set P and a directory structure L;

step b: coding and expressing the discourse set P, establishing an index I, calculating the cosine similarity of the question to be answered based on the index I, and returning K insurance discourse with the highest similarity to the question to be answered;

step c: predicting whether each discourse in the K insurance discourse contains the answer corresponding to the question and sentence to be answered in the discourse, and labeling the initial position START and the END position END of the corresponding answer in the discourse when the discourse contains the corresponding answer;

step d: and summing the probability p (START) of the starting position and the probability p (END) of the ending position of the predicted answer segment to be used as the confidence of the answer segment, sequencing according to the confidence, and using the answer segment with the highest confidence as the corresponding answer.

In order to solve the above technical problems, the present invention provides a machine reading and understanding apparatus for removable insurance documents, comprising a memory for storing instructions executable by a processor; a processor for executing the instructions to implement the machine-reading understanding method as described above.

To solve the above technical problem, the present invention provides a computer readable medium storing computer program code, which when executed by a processor implements a machine reading understanding method as described above.

Compared with the prior art, the machine reading understanding method, the device and the computer readable medium of the removable insurance document provided by the invention can be promoted in the following aspects:

(1) A completely new reading and understanding framework:

the technical path adopted by the invention is additionally provided with the problem answering prejudgment and problem answer positioning component, the component realizes the paragraph positioning function of human-like reading understanding by using the deep learning semantic understanding capability, and the accuracy level and the interpretable capability of the machine reading understanding can be greatly improved by the function.

(2) Higher response accuracy:

compared with a pre-training model with a fixed length, the variable-length pre-training model adopted by the invention can cover more insurance chapters, and by introducing the non-answer module for judging whether chapters can be answered or not, the condition that a plurality of candidate chapters obtained by screening do not contain answers is effectively processed, and the scene of answering questions still without answers is greatly reduced. The model effect is improved from two angles of improving the answering accuracy rate from the accurate context and eliminating the non-answering scenes through preprocessing.

(3) Higher quality set of candidate chapters:

the quality of the candidate chapters determines the upper limit of the accuracy of the follow-up question answering, and the patent adopts a more advanced pre-training model to screen the candidate chapters, so that a higher recall rate is ensured under the condition of less recall number compared with a general literal semantic recall scheme. Meanwhile, the model has better understanding and expression on paragraph semantics by providing more accurate context for the model.

(4) Faster response speed:

compared with a multi-candidate set predictive answer method of a general model, the method provided by the patent not only filters unanswered questions, but also provides few candidate chapters with high accuracy for the model, and compared with a scheme of directly asking and answering without screening chapters, the method has about several times of efficiency improvement within an acceptable accuracy loss range.

(5) Better interpretability:

compared with a general model method, the model can only reflect the model capability through indexes, and the method can simultaneously present the titles of the answer chapters, can assist business personnel in understanding the model output to a certain extent, and can also provide an optimization direction for algorithm personnel.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 illustrates a flow chart of a method for machine reading and understanding of a removable insurance document in accordance with an embodiment of the present invention;

FIG. 2 is a diagram illustrating a machine-readable understanding of a removable insurance document in accordance with an embodiment of the present invention;

fig. 3 is a system block diagram of a machine-readable understanding apparatus of the removable insurance document according to an embodiment of the present invention.

Detailed Description

As understood by those skilled in the art, the invention realizes the reading accuracy by reading the specific content of the insurance document in a brand-new machine reading mode. Furthermore, the invention can improve the reading accuracy by improving the machine reading algorithm and accurately provide the machine reading result for the answer corresponding to the provided question.

Fig. 1 shows a flow diagram of a machine-readable understanding method 100 for extracting insurance documents according to an embodiment of the present invention. Specifically, in this embodiment, the machine reading process is implemented by the following steps:

step S101 is executed first, and classification is performed according to the document structure corresponding to the insurance document. Wherein the insurance documents are classified into one of at least the following categories: the type 1 is that the left side in the insurance document is a title, and the right side is the content corresponding to the title; the type 2 is a multi-level directory structure and at least comprises any one or combination of a parent title, a subtitle and subtitle content; type 3 is a title and title-corresponding content. Further, those skilled in the art will appreciate that through the generalization of the several types of documents described above, the contents of an insurance document can be more easily manipulated in preparation for subsequent work. Further, those skilled in the art will appreciate that in one variation, a hybrid type of processing may be performed for a particular insurance document, such as a combination of type 1 and type 2, while remaining within the scope of the present invention.

Then, the process proceeds to step S102: and carrying out corresponding structural processing on the insurance documents according to the different document structure classifications, namely processing the insurance documents into the structure class files.

Next, step S103 is executed to construct an input chapter-problem pair (P _ i, Q _ j) by using a twin network-based structure, wherein in a preferred embodiment, the used base encoder uses a variable length pre-training model, roFormer, and uses Pairwise Loss, and the basic distance metric function uses cosine similarity. Specifically, in one variation, a twin network-based structure is adopted to construct an input chapter-question pair (P, Q), where the base Encoder adopts a 12-layer variable length pre-training model, roFormer, adopts Pairwise Loss as an optimization Loss function of model training, and the basic distance metric function adopts cosine similarity, that is, if the chapter P includes an answer to Q, cosine (P, Q) =1, otherwise cosine (P, Q) = -1.

And then, step S104 is carried out, the input chapters P _ i and the question Q _ j to be answered are respectively coded, the representation of the embedding layer and the output of the last layer are respectively subjected to average pooling, the two representations are subjected to averaging operation, low-dimensional vector representations corresponding to the chapters and the questions are obtained, and if the P _ i contains information capable of answering the Q _ j, the chapters P _ i and the question Q _ j to be answered are determined to have similarity. Those skilled in the art will understand that, through the above steps S103 and S104, through the representation of the embedding layer and the corresponding calculation, it can be determined whether information that can answer Q _ j is contained in P _ i. Specifically, in a variation, an input chapter P and a question Q to be answered are respectively encoded to obtain an e _ P/e _ Q representation of an embedding layer (embedding layer) and an o _ P/o _ Q representation of a last layer in an Encoder model, then the two embedding layers are subjected to average pooling dimension reduction, and an averaging operation is performed on the two dimension reduction representations to obtain low-dimensional vector representations corresponding to the chapter and the question, namely r _ P = e _ P + o _ P, r _ Q = e _ Q + o _ Q; according to the optimization loss function, if the P contains information capable of answering Q, the fact that the chapters P and the question Q to be answered have cosine similarity is determined.

And next, executing step S105, encoding and representing the chapter set P, establishing an index I, calculating cosine similarity of the question to be answered based on the index I, and returning K insurance chapters with the highest similarity to the question to be answered, where K is a positive integer. Those skilled in the art will understand that in the process of this step, the encoding representation of the discourse set P and the creating of the index I step are preferably performed by using an off-line encoding representation, and the index I is constructed according to the fast vector indexing tool faiss. Further, those skilled in the art understand that the similarity to the question to be answered can be returned through the calculation of cosine similarity, so that the number of insurance chapters with the highest similarity to the question to be answered in the plurality of insurance chapters can be further determined, and K insurance chapters are determined for the subsequent steps in the preferred embodiment.

And then step S106 is entered, the question text Q and the chapter text P are spliced, and additional identifiers [ CLS ] and [ SEP ] are introduced for processing: [ [ CLS ], Q, [ SEP ], P ].

And step S107 is executed, a variable length pre-training model RoFormer is used as a basic encoder, and the splicing result of the step S106 is encoded to obtain a corresponding model representation result.

And step S108, inputting the representation result in the step S107 into three linear classifiers, and respectively outputting the probability P (no _ answer) of whether the chapter P contains the answer corresponding to the question Q, the probability P (START) of the initial position of the answer, and the probability P (END) of the END position of the answer.

And finally, executing step S109, summing the probability p (START) of the starting position and the probability p (END) of the ending position of the predicted answer segment to be used as the confidence of the answer segment, sequencing according to the confidence, and using the answer segment with the highest confidence as the corresponding answer.

Further, those skilled in the art understand that the above steps S101, S102 can be summarized as: and structuring the insurance document, and outputting the structured text information corresponding to the insurance document, wherein the structured text information at least comprises a chapter set P and a directory structure L.

Further, those skilled in the art understand that, in a variation, after the step S102, the method further includes the steps of: analyzing the line feed characteristics of different scenes, screening the scenes with line feed caused by insufficient space, confirming the precondition of abnormal line feed, deleting the abnormal line feed scene under the global structure, keeping the independent relationship between paragraphs and paragraphs in chapters, and carrying out corresponding structured processing on the insurance document.

Further, those skilled in the art will appreciate that in another variation, the steps S106-S108 may be generalized to steps: and predicting whether each chapter of the K insurance chapters contains the answer corresponding to the question to be answered or not, and labeling the starting position START and the ending position END of the corresponding answer in the chapters when the chapters contain the corresponding answer.

Further, those skilled in the art will appreciate that, in combination with the embodiment shown in fig. 1 described above, in one variation, the technical object of the present invention can be achieved by the following steps:

structuring an insurance document, and outputting structured text information corresponding to the insurance document, wherein the structured text information at least comprises a chapter set P and a directory structure L;

coding and expressing the discourse set P, establishing an index I, calculating the cosine similarity of the question to be answered based on the index I, and returning K insurance discourse with the highest similarity to the question to be answered;

predicting whether each chapter of the K insurance chapters contains the answer corresponding to the question to be answered or not, and labeling the starting position START and the ending position END of the corresponding answer in the chapters when the chapters contain the corresponding answer;

and summing the probability p (START) of the starting position and the probability p (END) of the ending position of the predicted answer segment to be used as the confidence of the answer segment, sequencing according to the confidence, and using the answer segment with the highest confidence as the corresponding answer. In the step d, answer segments which do not meet the requirements are excluded according to specific judgment conditions, answer segments which meet the requirements are reserved, and the selected answer segments which are finally read by a machine are determined according to the confidence degree sequencing sequence.

Further, as understood by those skilled in the art, in another variation, the step S106 further includes the steps of: and if the splicing result length exceeds 512, performing disassembly on the splicing result, such as disassembly into [ CLS ], Q, [ SEP ], P _1] and [ CLS ], Q, [ SEP ], P _2] \8230 [ [ CLS ], Q, [ SEP ], P _ M ], wherein P = [ P _1, P _2, 823030 \8230; [ P _ M ]. For example, if the length of the added splicing result is 1000, the added splicing result is split into two splicing results: [ [ CLS ], Q, [ SEP ], P _1] and [ [ CLS ], Q, [ SEP ], P _2], wherein preferably, the length represented by P _1 is 512 and the length represented by P _2 is 488. In other splice lengths, other combinations may occur and are within the scope of the present invention.

Further, those skilled in the art understand that, in another variation, the step S109 can also be implemented by the following steps:

step d1: summing the probability p (START) of the starting position and the probability p (END) of the ending position of the predicted answer segment as the confidence of the corresponding answer segment;

step d2: screening the predicted answer segments, and excluding the segments which do not meet the requirements;

step d3: and sequencing the confidence degrees corresponding to the remaining segments, and taking the answer segment with the highest confidence degree as the corresponding answer.

Further, the skilled person understands that in step d1 above, the actual: summing the probability p (START) of the starting position of the predicted answer segment and the probability p (END) of the ending position of the predicted answer segment to serve as the confidence coefficient of the corresponding answer segment, and sequencing according to p (START) + p (END) to respectively obtain a starting position-ending position combination (i, j) with the probability from top to bottom; and then, screening the rationality of the answers in step d2, filtering answer segments corresponding to the combinations i > j, and if i > j indicates that the initial position is behind the end position, which is not rational, namely filtering the answer segments.

Further, in order to achieve the elimination of the unsatisfactory fragments, the step d2 may be further achieved by any one or more of the following steps:

i. excluding segments of the START position START > the END position END;

comparing the probability P (no _ answer) of whether the corresponding answer of the question Q is contained in the chapter P with a first threshold P _ ANS, and if the probability P (no _ answer) is smaller than the first threshold P _ ANS, excluding the corresponding segment; or alternatively

Excluding the corresponding segment if p (START) + p (END) < the first threshold p _ ANS.

Further, those skilled in the art will understand that, in one embodiment, the steps i, ii, and iii may be performed simultaneously, or only one of them may be performed. The preferred purpose of steps i, ii, iii above is to exclude unsatisfactory segments, which are generally segments that are clearly not suitable for the purpose of carrying out the invention, e.g. the start position obtained by the algorithm provided by the invention is after the end position, indicating that the segment determined by the invention is erroneous, that there is practically no possibility of such a segment being present, or that the found segment is erroneous, etc. Similarly, similar errors exist in other cases of step i, step ii and step iii, so that the errors cannot be used as alternative document answers.

Further, it is understood by those skilled in the art that in the above embodiments and variations, the first threshold p _ ANS is preferably a probability value, for example, 0.75, and then the segments with probability p (no _ answer) greater than the threshold 0.75 are retained, and the other segments are excluded. In particular, the threshold may be adjusted according to the actual application of machine reading, and such adjustments are within the scope of the present invention.

Further, referring to the embodiment of fig. 2, which illustrates a schematic diagram of a machine-readable understanding method 200 for extracting insurance documents according to an embodiment of the present invention. Specifically, the technical solution provided by the present embodiment can be summarized into the following four main steps:

step S21: disassembling insurance files based on document parsing technology

The input of the step is the PDF file of insurance clauses to be disassembled or the DOC file of insurance underwriting, and the output is the insurance chapter set P and the directory structure L. The specific embodiment of step (1) is as follows:

1. judging the file structure: according to the existing input files, the file structure is divided into three categories, the first category is a double-column distribution file based on insurance clauses, the left side of the article is a title, and the right side of the article is content corresponding to the title; the second type is a multi-level directory structure file based on underwriting and security, and the structure in the article is displayed in a staggered mode by a mother subject, a subheading and subheading contents; the third type is that only titles and their corresponding contents are presented sequentially. In this patent, the document structure type is confirmed for the input document, followed by subsequent operations.

2. Collecting chapter information: collecting the existing text information, judging whether each text information block is a content or a title according to the information distribution characteristics and the font characteristics, judging the mapping relation of the title and the mapping relation of the title content according to the position relation among the text information blocks, and forming accurate structured text information output.

3. And (3) chapter information arrangement: analyzing the line feed characteristics of different scenes, screening the line feed scenes caused by insufficient space, confirming the precondition of abnormal line feed, deleting the abnormal line feed scenes under the global structure, keeping the independent relationship between paragraphs and paragraphs in chapters, and outputting the structured information.

Step S22: filtering insurance chapters based on deep semantic recall technology

The step inputs a complete insurance chapter set P obtained by document analysis and a question Q to be answered, and outputs K chapter sets P' which are judged to most possibly contain answers for the deep semantic recall model. The specific embodiment is as follows:

1. model training: the method is characterized in that a twin network-based structure is adopted, a basic encoder adopts a variable length pre-training model RoFormer to construct input chapter-problem pairs (P _ i, Q _ j), pairwise Loss is adopted, and cosine similarity is adopted as a basic distance measurement function. The model respectively encodes an input chapter P _ i and a question Q _ j to be answered, average pooling is respectively carried out on the representation of an embedding layer and the output of a last layer, averaging operation is carried out on the two representations to obtain low-dimensional vector representations corresponding to the chapter and the question, and if the P _ i contains information capable of answering the Q _ j, the label is 1; otherwise the label is-1.

2. Model inference: all the existing insurance discourse sets P are expressed in an off-line coding mode, an index I is constructed according to a fast vector index tool faiss, the part can be subjected to off-line storage operation, the operation is only required to be executed once when a model is updated or a document is updated, and the operation is not required when a question is accessed; and (3) carrying out real-time coding expression on the question Q _ j to be answered, calculating cosine similarity based on the index I, and returning K insurance chapters with the highest cosine similarity.

Step S23: machine reading understanding question and answer based on variable-length pre-training model

The K candidate insurance chapters P' obtained by screening the deep semantic recall model and the question Q to be answered are input in the step, and the output judges whether each insurance chapter can be answered or not and the initial and end positions of the answer fragment for the machine reading understanding model.

Specific embodiments for this step S23 are as follows:

1. splicing the question and the chapter text: [ [ CLS ], Q _ i, [ SEP ], P _ j ], a variable-length pre-training model RoFormer is adopted as a basic encoder to encode the splicing result, and if the splicing result length exceeds 512, the splicing result is disassembled, such as disassembling into [ [ CLS ], Q _ i, [ SEP ], P _ j1] and [ [ CLS ], Q _ i, [ SEP ], P _ j2], wherein P _ j = [ P _ j1, pj2];

2. adding three full-connection layers ANS, START and END on the basic encoder, wherein the three full-connection layers ANS, START and END are respectively used for predicting whether the chapters contain answers required by answers Q, the starting positions with the maximum answer fragment appearance probability and the ending positions with the maximum answer fragment appearance probability; if the chapter does not contain an answer, then the label of ANS is 0 and the labels of START, END are also 0 (i.e., [ CLS ] position); conversely, the label of ANS is 1,START, and the label of END is the corresponding start and END positions.

Step S24: answer ranking

The input of this step is to judge whether each insurance chapter can be answered by the machine reading understanding model and the starting position START and the ending position END where the answer segment is located, and the output is the answer segment with the maximum comprehensive consideration probability. The specific embodiment is as follows:

1. firstly, whether K insurance chapters contain answer segments of the question Q or not is ranked by adopting an ANS field, and if the probability value is less than 0.5, the answer is judged not to be contained, and the answer is directly filtered to obtain K' candidate answers;

2. the predicted answer fragment is then ranked according to p (START) + p (END): if argmax p (START) > argmax p (END), the predicted answer segment starting and ending position conflict is indicated, and direct filtering is performed; if p (START) + p (END) < p _ ANS, p _ ANS is a set answer position confidence, less than which indicates that the predicted answer segment is unsatisfactory, filtering may be performed;

3. and finally, restoring the article P according to the START and the END, returning an answer segment with the highest confidence coefficient, and returning the question which cannot be answered if the answer segment is empty after filtering.

Further, it is understood by those skilled in the art that the above description of the embodiment shown in fig. 2 corresponds to the embodiment shown in fig. 1 and that variations other than those of fig. 1 may be made while remaining within the scope of the present invention.

Further, referring to the embodiment and the variation shown in fig. 1 and fig. 2, it is understood by those skilled in the art that the reading and understanding method provided by the present invention can be understood as:

(1) Disassembling insurance clauses based on document parsing: the invention analyzes the title content structure, font characteristics and information distribution characteristics of the PDF file recording insurance clauses and the DOC file recording the insurance requirements, and structures a multi-level title structure according to the characteristics to finish the alignment of title contents and judge abnormal line feed scenes, thereby obtaining perfect text structured information.

(2) Introducing a depth semantic model to screen candidate chapters: aiming at the phenomena of wrong screening and screen missing which are easily caused by the existing literal semantic screening of candidate chapters, the deep semantic recall is carried out through deep semantic information in the model building insurance clause of the pre-training model, and the quality of the candidate chapters is improved, so that the reading understanding and answering accuracy of a machine is improved.

(3) Introducing insurance document structure information for answering: the invention introduces the structure information into the question-answering stage of machine reading understanding, thereby reducing the false answer rate.

(4) Adopting a variable length pre-training model and introducing an unacknowledged module: aiming at the problem that the length of the chapters obtained by disassembling the insurance document is too long, the invention introduces the pre-training model adopting the variable-length position coding and introduces the non-answer module to process the condition that the chapters do not contain answers.

Fig. 3 is a system block diagram of a machine-readable understanding apparatus for extracting insurance documents according to an embodiment of the present invention. Referring to fig. 3, the machine reading understanding apparatus 300 may include an internal communication bus 301, a processor 302, a Read Only Memory (ROM) 303, a Random Access Memory (RAM) 304, and a communication port 305. When implemented on a personal computer, the reference resolution apparatus 300 may also include a hard disk 306. The internal communication bus 301 may enable data communication among the components of the machine reading understanding device 300. The processor 302 may make the determination and issue the prompt. In some embodiments, the processor 302 may be comprised of one or more processors. The communication port 305 may enable the machine reading understanding apparatus 300 to communicate data with the outside. In some embodiments, the machine-readable understanding device 300 may send and receive information and data from a network through the communication port 305. The machine-readable understanding apparatus 300 may also include various forms of program storage units and data storage units such as a hard disk 306, read Only Memory (ROM) 303 and Random Access Memory (RAM) 304, capable of storing various data files for processing and/or communication by a computer, and possibly program instructions for execution by the processor 302. The processor executes these instructions to implement the main parts of the method. The result processed by the processor is transmitted to the user equipment through the communication port and displayed on the user interface.

The above-described operation method may be implemented as a computer program, stored in the hard disk 306, and loaded into the processor 302 for execution, so as to implement the reference resolution method of the present application.

The present invention also includes a computer readable medium having stored thereon computer program code which, when executed by a processor, implements the method of machine-reading and understanding of a removable insurance document as described above.

When the machine-readable understanding method of the removable insurance document is implemented as a computer program, the computer program can also be stored in a computer-readable storage medium as a product. For example, computer-readable storage media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips), optical disks (e.g., compact Disk (CD), digital Versatile Disk (DVD)), smart cards, and flash memory devices (e.g., electrically erasable programmable read-only memory (EPROM), card, stick, key drive). In addition, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media (and/or storage media) capable of storing, containing, and/or carrying code and/or instructions and/or data.

It should be understood that the above-described embodiments are illustrative only. The embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processor may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, and/or other electronic units designed to perform the functions described herein, or a combination thereof.

Aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. The processor may be one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital signal processing devices (DAPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, or a combination thereof. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media. For example, computer-readable media can include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic tape \8230;), optical disks (e.g., compact disk CD, digital versatile disk DVD \8230;), smart cards, and flash memory devices (e.g., card, stick, key drive \8230;).

The computer readable medium may comprise a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. The computer readable medium can be any computer readable medium that can communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, radio frequency signals, or the like, or any combination of the preceding.

Similar technical solutions to the above main idea are within the protection scope of the present invention, and are not described herein.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. An extraction type insurance document machine reading understanding method is used for obtaining answers corresponding to question sentences from an insurance document according to the question sentences to be answered, and is characterized by comprising the following steps:

a, step a: structuring an insurance document, and outputting structured text information corresponding to the insurance document, wherein the structured text information at least comprises a chapter set P and a directory structure L;

2. The machine-readable understanding method of claim 1, wherein the step of structuring the insurance document in the step a comprises the steps of:

step a1: classifying according to the document structure corresponding to the insurance document, and classifying into one of the following categories:

the type 1 is that the left side in the insurance document is a title, and the right side is the content corresponding to the title;

type 2 is a multi-level directory structure, and at least comprises any one or combination of a mother title, a subtitle and subtitle content;

type 3 is title and title corresponding content;

step a2: and carrying out corresponding structural processing on the insurance document according to different document structure classifications.

3. The machine-readable understanding method of claim 2, wherein the step of structuring the insurance document in step a further comprises the steps of:

step a3: analyzing the line feed characteristics of different scenes, screening the scenes with line feed caused by insufficient space, confirming the precondition of abnormal line feed, deleting the abnormal line feed scene under the global structure, keeping the independent relationship between paragraphs and paragraphs in chapters, and carrying out corresponding structured processing on the insurance document.

4. The machine-readable understanding method of claim 1, further comprising the following steps before the step b:

step B1: an input chapter-problem pair (P _ i, Q _ j) is constructed by adopting a twin network-based structure, wherein a basic encoder adopts a variable-length pre-training model RoFormer, a Pairwise Loss is adopted, and a basic distance measurement function adopts cosine similarity;

and step B2: encoding an input discourse P _ i and a question Q _ j to be answered respectively, performing average pooling on the representation of an embedding layer and the output of a last layer respectively, performing averaging operation on the two representations to obtain low-dimensional vector representations corresponding to the discourse and the question, and determining that the discourse P _ i and the question Q _ j to be answered have similarity if the discourse P _ i contains information capable of answering the question Q _ j.

5. The machine-readable understanding method of claim 1, wherein the step c comprises the steps of:

step c1: splicing the question text Q and the chapter text P, and introducing additional identifiers [ CLS ] and [ SEP ] for processing: [ [ CLS ], Q, [ SEP ], P ];

and c2: c, coding the splicing result of the step c1 by using a variable length pre-training model RoFormer as a basic coder to obtain a corresponding model representation result;

and c3: inputting the expression result of the step 2 into three linear classifiers, and outputting the probability P (no _ answer) of whether the corresponding answer to the question Q is contained in the chapter P, the probability P (START) of the initial position of the answer, and the probability P (END) of the END position of the answer.

6. The machine reading comprehension method of claim 5 wherein in the step c1, if the length of the splicing result exceeds 512, the splicing result is disassembled, such as disassembling into [ [ CLS ], Q, [ SEP ], P _1] and [ [ CLS ], Q, [ SEP ], P _2], \8230 [ [ CLS ], Q, [ SEP ], P _ M ] wherein P = [ P _1, P \2 \8230 \ 8230, P _ M ].

7. The machine reading understanding method of claim 1, wherein the step d comprises the steps of:

step d3: and sequencing the confidence degrees corresponding to the rest segments, and taking the answer segment with the highest confidence degree as the corresponding answer.

8. The machine reading understanding method of claim 7, wherein the step d2 comprises at least any one or more of the following steps:

i. excluding segments of the START position START > the END position END;

9. The machine-readable understanding method of any one of claims 1 to 8, wherein the encoding the discourse set P and building the index I in the step b uses an off-line encoding representation, and the index I is built according to a fast vector indexing tool faiss.

10. A machine-readable comprehension apparatus for interfold insurance documents, comprising:

a memory for storing instructions executable by the processor;

a processor for executing the instructions to implement the method of any one of claims 1-9.

11. A computer-readable medium having stored thereon computer program code which, when executed by a processor, implements the method of any of claims 1-9.