CN114662482B

CN114662482B - Extraction method and device for answer text in text form

Info

Publication number: CN114662482B
Application number: CN202210306095.4A
Authority: CN
Inventors: 利秀明; 郎凯; 胡殿明; 刘雨亮
Original assignee: Beijing Ganyi Intelligent Technology Co ltd
Current assignee: Beijing Ganyi Intelligent Technology Co ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2024-06-18
Anticipated expiration: 2042-03-25
Also published as: CN114662482A

Abstract

The invention provides a method and a device for extracting answer text in a text form, wherein the method comprises the following steps: extracting a form in text data to be processed, and acquiring a problem text vector corresponding to a problem text; acquiring a cell coordinate vector corresponding to the coordinates of the cells and a cell text vector corresponding to the text in the cells, and splicing the cell coordinate vector and the cell text vector into a cell splicing vector; inputting the cell splicing vector into an index identification model, and determining an index cell and a non-index cell; inputting cell splicing vectors of non-index cells and cell splicing vectors of index cells positioned on the left side and the upper side of the non-index cells into a feature fusion model to obtain context vectors; and after the context vector and the question text vector are spliced, inputting an answer extraction model, determining an answer cell and a non-answer cell, and determining the text in the answer cell as an answer text. The extraction method and the device for the answer text in the text form can improve the extraction precision.

Description

Extraction method and device for answer text in text form

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for extracting answer text in a text table.

Background

Text form refers to a form in text data. Currently, extracting answer text from a text form mainly includes four types of methods: a template matching method without considering a table structure, a discriminant method without considering a table structure, a generating method without considering a table structure and a template matching method of a standard two-dimensional table.

The template matching method without considering the table structure does not consider the table structure, the separator is directly used for separating the cells, then matching extraction is carried out through the rules of artificial design such as regular expression, the logical connection between the table structure information and the cells is lost, and matching is easy to be missed.

The discriminant method without considering the table structure does not consider the table structure, directly flattens the table, splices the cell content, then regards the cell content as natural language text, and then carries out subsequent processing according to discriminant tasks understood by natural language, but the information of the table structure is lost, and the semantics are incoherent, so that the extraction precision is low.

The generation method without taking the table structure into consideration directly flattens the table, and generates natural language text through the text generation model after splicing the cell content, so that the complexity of the generation model is high, the training difficulty is high, error transmission exists, and the extraction precision is affected although the consistency of the semantics is improved.

The template matching method of the standard two-dimensional form is only aimed at the standard two-dimensional form, and template matching extraction is carried out through a rule of manual design, so that various organization structures of the complex form are not considered, and mismatching or missed matching is easy. The complex structure may include: contains merging cells, index cells are not in the header but in the body, etc.

In summary, the existing extraction method of answer text in the lattice has the defect of low precision.

Disclosure of Invention

The invention provides a method and a device for extracting answer texts in a text form, which are used for solving the defect of lower extraction precision in the prior art and realizing higher-precision extraction of the answer texts in the form.

The invention provides a method for extracting answer text in a text form, which comprises the following steps:

extracting a form in text data to be processed, and acquiring a problem text vector corresponding to a problem text;

acquiring a cell coordinate vector corresponding to the coordinates of each cell in the table and a cell text vector corresponding to the text in each cell, and splicing the cell coordinate vectors into a cell splicing vector of each cell;

inputting cell splicing vectors of each cell into an index identification model respectively, classifying each cell, and determining index cells and non-index cells in each cell;

for each non-index cell, inputting a cell splicing vector of each non-index cell and cell splicing vectors of index cells positioned on the left side and the upper side of each non-index cell into a feature fusion model, and carrying out feature fusion to obtain a context vector of each non-index cell;

And respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining an answer cell and a non-answer cell in each non-index cell, and determining the text in the answer cell as an answer text corresponding to the question text.

According to the method for extracting answer text in a text table provided by the invention, the method for obtaining the cell coordinate vector corresponding to the coordinates of each cell in the table comprises the following steps:

acquiring coordinates of each cell in the table;

And for each cell, inputting the coordinate of each cell into a coordinate feature extraction model, and carrying out vectorization representation on the coordinate of each cell to obtain a cell coordinate vector corresponding to the coordinate of each cell output by the coordinate feature extraction model.

According to the method for extracting answer text in a text table provided by the invention, for each non-index cell, a cell splicing vector of each non-index cell and cell splicing vectors of index cells positioned on the left side and the upper side of each non-index cell are input into a feature fusion model, feature fusion is performed, and a context vector of each non-index cell is obtained, wherein the method comprises the following steps:

determining index cells positioned on the left side and the upper side of each non-index cell based on the coordinates of each non-index cell and the coordinates of each index cell;

And inputting the cell splicing vector of each non-index cell and the cell splicing vector of each index cell positioned on the left side and the upper side of each non-index cell into a feature fusion model, and carrying out feature fusion to obtain the context vector of each non-index cell.

According to the method for extracting the answer text in the text table provided by the invention, the method for obtaining the question text vector corresponding to the question text comprises the following steps:

And inputting the problem text into a problem text feature extraction model, and vectorizing the problem text to obtain a problem text vector corresponding to the problem text output by the problem text feature extraction model.

According to the method for extracting answer text in a text table provided by the invention, cell text vectors corresponding to texts in each cell are obtained, and the method comprises the following steps:

And inputting the text in each cell into a cell text feature extraction model, and carrying out vectorization representation on the text in each cell to obtain a cell text vector corresponding to the text in each cell output by the cell text feature extraction model.

According to the method for extracting answer text in text form provided by the invention, before extracting the form in the text data to be processed and obtaining the question text vector corresponding to the question text, the method further comprises:

and acquiring the text data to be processed and the question text.

The invention also provides a device for extracting the answer text in the text form, which comprises:

the text representation module is used for extracting a form in text data to be processed and acquiring a problem text vector corresponding to the problem text;

The characteristic splicing module is used for acquiring a cell coordinate vector corresponding to the coordinate of each cell in the table and a cell text vector corresponding to the text in each cell, and splicing the cell coordinate vector into a cell splicing vector of each cell;

The index identification module is used for respectively inputting cell splicing vectors of each cell into an index identification model, classifying each cell and determining index cells and non-index cells in each cell;

The feature fusion module is used for inputting the cell splicing vector of each non-index cell and the cell splicing vector of each index cell positioned on the left side and the upper side of each non-index cell into a feature fusion model for feature fusion to obtain the context vector of each non-index cell;

And the answer extraction module is used for respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining answer cells and non-answer cells in each non-index cell, and determining texts in the answer cells as answer texts corresponding to the question texts.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the extraction method of answer text in any one of the text tables when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of extracting answer text from a text form as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of extracting answer text from a text form as described in any one of the above.

According to the extraction method and device for the answer text in the text table, the characteristics of the coordinates of the cells and the characteristics of the text are fused to obtain the cell splicing vectors of the cells, the cell splicing vectors of the cells are classified, whether the cells are index cells or non-index cells is determined, the cell splicing vectors of the non-index cells and the cell splicing vectors of the index cells positioned on the left side and the upper side of the non-index cells are subjected to characteristic fusion to obtain the context vectors of the non-index cells, the classification is performed based on the context vectors of the non-index cells and the question text vectors corresponding to the question text, whether the non-index cells are answer cells or non-answer cells is determined, so that the answer text corresponding to the question text is extracted, the structure of various complex tables is considered, the characteristics of the cells can be represented more accurately by a unified cell characterization method, the characteristics of the tables can be obtained by utilizing the connection between the structure information of the non-index cells and the content semantics of the cells, the extraction result of the text can be extracted more accurately, the situation of the text can be extracted, the text can be extracted with less mismatching and the text can be extracted, and the accuracy of the text can be improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for extracting answer text from a text table according to the present invention;

FIG. 2 is a second flow chart of a method for extracting answer text from a text table according to the present invention;

FIG. 3 is a schematic diagram of steps for determining index cells and non-index cells provided by the present invention;

FIG. 4 is a diagram illustrating a step of obtaining a context vector of a non-index cell according to the present invention;

FIG. 5 is a schematic diagram of steps for determining answer cells and non-answer cells provided by the present invention;

FIG. 6 is a schematic diagram of a device for extracting answer text from a text table according to the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of embodiments of the present invention, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance and not order.

In describing embodiments of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" should be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in embodiments of the present invention will be understood in detail by those of ordinary skill in the art.

The following describes a method and apparatus for extracting answer text in a text table according to the present invention with reference to fig. 1 to fig. 4.

Fig. 1 is a flow chart of a method for extracting answer text in a text table according to the present invention. As shown in fig. 1, an execution body of the method for extracting answer text in a text table according to an embodiment of the present invention may be an apparatus for extracting answer text in a text table, where the method includes: step 101, step 102, step 103, step 104 and step 105.

Step 101, extracting a table in text data to be processed, and acquiring a question text vector corresponding to a question text.

Specifically, the text data to be processed may be a PDF document or a Word document (e.g., a document with suffix doc, docx, wps, or the like), or the like.

The form in the text data to be processed may be extracted by any form extraction method, such as a form extractor such as PDFMiner.

Question text refers to text for expressing a question.

The vectorized representation result of the question text can be obtained by vectorizing the question text by any of the methods for vectorizing the text in the natural language processing (Natural Language Processing, NLP) method (e.g., the bag-of-words model-based method, the word vector-based representation method, etc.). The vectorized representation of the question text results in a vector that is the question text vector corresponding to the question text.

The method based on the word bag model mainly comprises One-hot coding, TF-IDF (term frequency-reverse document frequency), n-gram model and the like.

The Word vector-based representation method mainly comprises Word2vec, doc2vec and the like.

Step 102, obtaining a cell coordinate vector corresponding to the coordinates of each cell in the table and a cell text vector corresponding to the text in each cell, and splicing the cell coordinate vectors into a cell splicing vector of each cell.

Specifically, for each cell in the table obtained in step 101, the coordinates of the cell may be converted into a vector based on any method of converting coordinates into a vector, so as to obtain a cell coordinate vector corresponding to the coordinates of the cell.

For the cell, the text in the cell can be vectorized by any vectorization representation method of the text in natural language processing, and the vectorization representation result of the text in the cell can be obtained. The vectorized representation of the text in the cell results in a vector that is the cell text vector corresponding to the text in the cell.

For any cell, after the cell coordinate vector corresponding to the coordinates of the cell and the cell text vector corresponding to the text in the cell are obtained, the cell coordinate vector and the cell text vector can be spliced, so that a cell splicing vector of the cell is obtained.

It will be appreciated that the dimensions of the cell coordinate vectors corresponding to the coordinates of each cell are the same, and the dimensions of the cell text vectors corresponding to the text in each cell are the same, and thus the dimensions of the cell splice vectors for each cell are the same.

Step 103, respectively inputting the cell splicing vectors of each cell into an index recognition model, classifying each cell, and determining index cells and non-index cells in each cell.

Specifically, the cell splice vector for each cell may be input into an index identification model to determine whether the cell is an index cell or a non-index cell, respectively.

The index identification model can be obtained after training based on cell splicing vectors of the sample cells and labels corresponding to the sample cells. And the label corresponding to the sample cell is used for indicating whether the sample cell is an index cell or a non-index cell.

It can be understood that the method for obtaining the cell splice vector of the sample cell is the same as the method for obtaining the cell splice vector of each cell in step 102, and will not be described herein. The dimension of the cell splice vector of the sample cell is the same as the dimension of the cell splice vector of each cell obtained in step 102.

Alternatively, the index recognition model may be a model built based on any of a variety of deep learning methods. The index recognition model may include a feature extractor and a classifier, an output layer of the feature extractor being coupled to an input layer of the classifier.

Illustratively, the index recognition model may be a model constructed based on any neural network (e.g., CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Networks, recurrent neural network), or transfomer, etc.). The respective representations of the output layers of the neural network are connected to a Sigmoid classifier based on a Sigmoid function, so that it is possible to classify whether a cell is an index cell or a non-index cell.

And 104, inputting the cell splicing vector of each non-index cell and the cell splicing vector of each index cell positioned on the left side and the upper side of each non-index cell into a feature fusion model for feature fusion, and obtaining the context vector of each non-index cell.

Specifically, the feature fusion model may serve as a feature fusion cage. For each non-index cell, the cell stitching vector of the non-index cell and the cell stitching vector of each index cell positioned on the left side and the upper side of the non-index cell can be input into a feature fusion model to perform feature fusion, so that the context vector of the non-index cell is obtained.

Alternatively, the feature fusion model may be a feature fusion model based on a neural network (e.g., transducer, etc.) that contains an attention mechanism.

The feature fusion model can be obtained after training based on cell splicing vectors of the first sample non-index cells, cell splicing vectors of index cells positioned on the left side and the upper side of the first sample non-index cells in the same sample table, and labels corresponding to the first sample non-index cells. The label corresponding to the first sample non-index cell is the context vector of the first sample non-index cell.

It can be understood that the method for obtaining the cell splice vector of the non-index cell of the first sample is the same as the method for obtaining the cell splice vector of each cell in step 102, and will not be described herein. The dimension of the context vector of the first sample non-index cell is the same as the dimension of the context vector of each non-index cell obtained in step 104.

And 105, respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining answer cells and non-answer cells in each non-index cell, and determining texts in the answer cells as answer texts corresponding to the question texts.

Specifically, for each non-index cell, the question text vector and the context vector of the non-index cell may be spliced to obtain the target vector of the non-index cell.

It will be appreciated that the dimensions of the context vector for each non-index cell are the same, and thus the dimensions of the target vector for each non-index cell are the same.

For each non-index cell, the target vector for that non-index cell may be input into an answer extraction model to determine whether the non-index cell is an answer cell or a non-answer cell.

Answer cell, meaning that the text in the cell is the answer text corresponding to the question text. I.e., the content of the text in the answer cell, the questions in the question text can be answered; the content of the text in the answer cell is the answer to the question in the question text.

A non-answer cell means that the text in the cell is not the answer text corresponding to the question text. I.e., the content of text in the non-answer cells, cannot answer questions in the question text; the content of the text in the non-answer cells is not an answer to the question in the question text.

The answer extraction model may be obtained after training based on the target vector of the second sample non-index cell and the label corresponding to the second sample non-index cell. And the label is used for indicating whether the second sample non-index cell is an answer cell or an answer cell.

It can be understood that the method for obtaining the target vector of the second sample non-index cell is the same as the method for obtaining the target vector of each non-index cell in step 105, and will not be described herein. The dimension of the context vector of the second sample non-index cell is the same as the dimension of the cell splice vector of each non-index cell obtained in step 104; the question text vector corresponding to the first sample question text for obtaining the target vector of the second sample non-index cell is the same dimension as the question text vector in step 101.

Alternatively, the answer extraction model may be a model constructed based on any of the deep learning methods. The answer extraction model may include a feature extractor and a classifier, an output layer of the feature extractor being connected to an input layer of the classifier.

Illustratively, the answer extraction model may be a model constructed based on any neural network (e.g., CNN, RNN, or transducer, etc.). The respective representations of the output layers of the neural network are connected to a Sigmoid classifier based on a Sigmoid function, so that it is possible to classify whether a non-index cell is an answer cell or a non-answer cell.

According to the embodiment of the invention, the characteristics of the coordinates of the cells and the characteristics of the text are fused to obtain the cell splicing vectors of the cells, the cell splicing vectors of the cells are based on the classification to determine whether the cells are index cells or non-index cells, the cell splicing vectors of the non-index cells and the cell splicing vectors of the index cells positioned on the left side and the upper side of the non-index cells are subjected to characteristic fusion to obtain the context vectors of the non-index cells, the classification is based on the context vectors of the non-index cells and the question text vectors corresponding to the question text, and whether the non-index cells are answer cells or non-answer cells is determined, so that answer texts corresponding to the question text are extracted, the characteristics of the cells can be represented more accurately by considering the various organization structures of complex tables through a unified cell characterization method and simultaneously utilizing the relation between the structural information of the tables and the content semantics of the cells, further accurate extraction results of the answer texts can be obtained, the phenomena of missing matching and mismatching can be reduced, and the extraction precision of the answer texts in the text tables can be improved.

Based on the foregoing any one of the embodiments, obtaining a cell coordinate vector corresponding to a coordinate of each cell in the table includes: the coordinates of each cell in the table are obtained.

Specifically, the upper left vertex of the first cell in the upper left corner of the table may be used as the origin of the two-dimensional coordinate system, and the left-right direction and the up-down direction may be respectively used as one coordinate axis in the two-dimensional coordinate system to establish the two-dimensional coordinate system.

Alternatively, the coordinates of the top left vertex and the coordinates of the bottom right vertex of each cell in the two-dimensional coordinate system may be obtained by a coordinate extractor, and the coordinates of the top left vertex and the coordinates of the bottom right vertex of the cell may be combined, so as to obtain the coordinates of the cell.

Alternatively, the coordinates of the lower left vertex and/or the upper right vertex of the cell may be combined on the basis of the coordinates of the upper left vertex and the coordinates of the lower right vertex of the cell, to obtain the coordinates of the cell.

For each cell, inputting the coordinate of each cell into a coordinate feature extraction model, and vectorizing the coordinate of each cell to obtain a cell coordinate vector corresponding to the coordinate of each cell output by the coordinate feature extraction model.

Specifically, the coordinate feature extraction model may be used as a feature extractor, the coordinates of each cell may be input into the coordinate feature extraction model, and feature extraction and vectorization are performed on the coordinates of the cell, so as to obtain a cell coordinate vector corresponding to the coordinates of the cell.

The coordinate feature extraction model may be obtained after training based on the sample coordinates of the cells and the labels corresponding to the sample coordinates of the cells. Sample coordinates of a cell the label corresponding to the sample coordinates of the cell may be a cell coordinate vector corresponding to the sample coordinates.

It will be appreciated that the dimension of the cell coordinate vector corresponding to the sample coordinates of the cells is the same as the dimension of the cell coordinate vector corresponding to the coordinates of each cell obtained in step 102.

Alternatively, the coordinate feature extraction model may be a model constructed based on any of the deep learning methods.

The coordinate feature extraction model may be a model constructed based on any neural network (e.g., CNN or transducer, etc.), for example.

According to the embodiment of the invention, the coordinates of each cell are vectorized through the coordinate feature extraction model, the cell coordinate vector corresponding to the coordinates of each cell is obtained, the features of the cells can be more accurately represented by utilizing the structural information of the form through a unified cell characterization method, further, more accurate extraction results of answer texts are obtained, the phenomena of missing matching, mismatching and the like can be reduced, and the extraction precision of the answer texts in the text form can be improved.

Based on the foregoing in any of the embodiments, for each non-index cell, inputting a cell stitching vector of each non-index cell and a cell stitching vector of each index cell located on the left and above each non-index cell into a feature fusion model, performing feature fusion, and obtaining a context vector of each non-index cell, including: index cells to the left and above each non-index cell are determined based on the coordinates of each non-index cell and the coordinates of the index cells.

Specifically, for the coordinates of each non-index cell, the index cells located to the left and above the non-index cell may be determined by comparing the coordinates of the non-index cell with the coordinates of the index cells.

For a certain non-index cell a, among all index cells, an index cell whose abscissa of the upper left vertex is equal to the abscissa of the upper left vertex of the non-index cell a, or whose abscissa of the lower right vertex is equal to the abscissa of the lower right vertex of the non-index cell a, and whose ordinate of the lower right vertex is smaller than the ordinate of the lower right vertex of the non-index cell a, is the index cell located to the left of the non-index cell a.

For a certain non-index cell a, among all index cells, an index cell whose upper left vertex has an ordinate equal to the ordinate of the upper left vertex of the non-index cell a or whose lower right vertex has an ordinate equal to the ordinate of the lower right vertex of the non-index cell a, and whose upper left vertex has an abscissa smaller than the abscissa of the upper left vertex of the non-index cell a, is an index cell located above the non-index cell a.

Specifically, after determining each index cell located on the left side and the upper side of each non-index cell, the cell splicing vector of the non-index cell and the cell splicing vector of each index cell located on the left side and the upper side of the non-index cell may be input into a feature fusion model, and feature fusion is performed to obtain the context vector of the non-index cell.

According to the embodiment of the invention, the index cells positioned on the left and above the non-index cells are determined based on the coordinates of the non-index cells and the coordinates of the index cells, so that the index cells positioned on the left and above the non-index cells can be more accurately determined, the characteristics of the cells can be more accurately represented by utilizing the relation between the structural information of the form and the semantics of the content of the cells, more accurate extraction results of answer texts can be obtained, the phenomena of missed matching, mismatching and the like can be reduced, and the extraction precision of the answer texts in the text form can be improved.

Based on the content of any one of the above embodiments, obtaining a question text vector corresponding to the question text includes: and inputting the question text into a question text feature extraction model, and carrying out vectorization representation on the question text to obtain a question text vector corresponding to the question text output by the question text feature extraction model.

Specifically, the feature extraction model of the question text can be used as a feature extractor, and the vectorization representation is carried out on the question text through the feature extraction model of the question text, so that a question text vector corresponding to the question text is obtained.

The question text feature extraction model may be obtained after training based on the second sample question text and a question text vector corresponding to the second sample question text.

It will be appreciated that the dimension of the question text vector corresponding to the second sample question text is the same as the dimension of the question text vector in step 101.

Alternatively, the question text feature extraction model may be a model built based on any of a variety of deep learning methods.

Illustratively, the question text feature extraction model may be a model built based on any neural network (e.g., CNN, RNN, or transducer, etc.).

According to the embodiment of the invention, the problem text is vectorized through the problem text feature extraction model to obtain the problem text vector corresponding to the problem text, so that the feature of the problem text can be more accurately represented, further, a more accurate extraction result of the answer text can be obtained, the phenomena of missing matching, mismatching and the like can be reduced, and the extraction precision of the answer text in the text form can be improved.

Based on the content of any of the foregoing embodiments, obtaining a cell text vector corresponding to text in each cell includes: and inputting the text in each cell into a cell text feature extraction model, and vectorizing the text in each cell to obtain a cell text vector corresponding to the text in each cell output by the cell text feature extraction model.

Specifically, the cell text feature extraction model may be used as a feature extractor, and the text in each cell is represented in a vectorization manner by the cell text feature extraction model, so as to obtain a cell text vector corresponding to the text in the cell.

The cell text feature extraction model may be obtained after training based on the sample text and a cell text vector corresponding to the sample text.

It will be appreciated that the dimensions of the cell text vector corresponding to the sample text are the same as the dimensions of the cell text vector corresponding to the text in each cell in step 102.

Alternatively, the cell text feature extraction model may be a model constructed based on any of the deep learning methods.

Illustratively, the cell text feature extraction model may be a model built based on any neural network (e.g., CNN, RNN, or transducer, etc.).

According to the embodiment of the invention, the text in each cell is vectorized through the cell text feature extraction model, so that the cell text vector corresponding to the text in the cell is obtained, the features of the text in the cell can be more accurately represented, further, a more accurate answer text extraction result is obtained, the phenomena of missing matching, mismatching and the like can be reduced, and the extraction precision of the answer text in the text table can be improved.

Based on the foregoing content of any one of the foregoing embodiments, before extracting a table in text data to be processed and obtaining a question text vector corresponding to a question text, the method further includes: and acquiring text data to be processed and a question text.

Specifically, the text data to be processed input by the user may be obtained, or the text data to be processed sent by other electronic devices may be received.

Question text entered by the user may be obtained, or question text sent by other electronic devices may be received.

According to the embodiment of the invention, the text data to be processed and the question text are acquired, so that the answer text in the text form can be extracted more conveniently.

FIG. 2 is a second flowchart of a method for extracting answer text from a text table according to the present invention. Illustratively, as shown in fig. 2, the extraction method of answer text in the text table may include the following steps:

Step 201, acquiring text data to be processed and a question text.

Step 202, extracting a table through a table extractor based on the text data to be processed.

Step 203, based on the question text, obtaining a question text vector corresponding to the question text through a feature extractor.

Step 204, based on the table, the coordinates of each cell are obtained by a coordinate extractor, and the cell coordinate vector corresponding to the coordinates of each cell is obtained by a feature extractor.

Step 205, based on the table, obtaining, by a feature extractor, a cell text vector corresponding to the text in each cell.

And 206, splicing the cell coordinate vector corresponding to the coordinates of each cell and the cell text vector corresponding to the text in the cell, obtaining the cell splice vector of the cell, and classifying the cell splice vector by connecting a feature extractor with a classifier to obtain an index cell and a non-index cell.

Classifying the cells, determining index cells and non-index cells may be as shown in fig. 3.

Step 207, fusing the cell splicing vector of each non-index cell and the cell splicing vector of each index cell positioned on the left side and the upper side of each non-index cell by a feature fusion device to obtain the context vector of the non-index cell.

The process of obtaining the context vector for the non-indexed cell through feature fusion may be as shown in fig. 4.

The index cell positioned on the left side of the non-index cell is the left index cell of the non-index cell; the index cell located above the non-index cell is the upward index cell of the non-index cell.

And step 208, based on the context vector of each non-index cell, the question text vectors are spliced, and the question text vectors are classified by connecting a feature extractor with a classifier to obtain answer cells and non-answer cells, wherein the text in the answer cells is the answer text corresponding to the question text.

The non-index cells are classified and the process of determining answer cells and non-answer cells may be as shown in fig. 5.

The device for extracting the answer text in the text table provided by the invention is described below, and the device for extracting the answer text in the text table described below and the method for extracting the answer text in the text table described above can be correspondingly referred to each other.

Fig. 6 is a schematic structural diagram of an answer text extracting device in a text table provided by the invention. Based on the content of any of the foregoing embodiments, as shown in fig. 6, the apparatus includes a text representation module 601, a feature stitching module 602, an index recognition module 603, a feature fusion module 604, and an answer extraction module 605, where:

The text representation module 601 is configured to extract a table in text data to be processed, and obtain a question text vector corresponding to a question text;

The feature stitching module 602 is configured to obtain a cell coordinate vector corresponding to the coordinate of each cell in the table and a cell text vector corresponding to the text in each cell, and stitch the cell coordinate vector into a cell stitching vector of each cell;

The index recognition module 603 is configured to input the cell stitching vector of each cell into an index recognition model, classify each cell, and determine an index cell and a non-index cell in each cell;

the feature fusion module 604 is configured to input, for each non-index cell, a cell stitching vector of each non-index cell and a cell stitching vector of each index cell located on the left and above each non-index cell into a feature fusion model, and perform feature fusion to obtain a context vector of each non-index cell;

The answer extraction module 605 is configured to splice the context vector of each non-index cell and the question text vector, input an answer extraction model, classify each non-index cell, determine an answer cell and a non-answer cell in each non-index cell, and determine a text in the answer cell as an answer text corresponding to the question text.

Specifically, the text representation module 601, the feature stitching module 602, the index recognition module 603, the feature fusion module 604, and the answer extraction module 605 may be electrically connected in sequence.

The text representation module 601 may extract the form in the text data to be processed by any form extraction method.

The text representation module 601 may further perform vectorization representation on the question text by using any vectorization representation method of the text in the natural language processing method, to obtain a question text vector corresponding to the question text.

For each cell in the table extracted by the text representation module 601, the feature stitching module 602 may convert the coordinate of the cell into a vector based on any method for converting the coordinate into a vector, so as to obtain a cell coordinate vector corresponding to the coordinate of the cell; the text in the cell can be vectorized through any vectorization representation method of the text in the natural language processing, so that a cell text vector corresponding to the text in the cell can be obtained; and the cell coordinate vector and the cell text vector can be spliced, so that a cell splicing vector of the cell is obtained.

The index identification module 603 may input the cell stitching vector for each cell into an index identification model to determine whether the cell is an index cell or a non-index cell, respectively.

For each non-index cell, the feature fusion module 604 may input the cell stitching vector of the non-index cell and the cell stitching vector of each index cell located on the left and above the non-index cell into the feature fusion model to perform feature fusion, thereby obtaining the context vector of the non-index cell.

For each non-index cell, the answer extraction module 605 may splice the question text vector and the context vector of the non-index cell to obtain a target vector of the non-index cell; the target vector of the non-index cell may also be input into an answer extraction model to determine whether the non-index cell is an answer cell or a non-answer cell.

Alternatively, the feature stitching module 602 may include:

The coordinate feature extraction unit is used for obtaining the coordinates of each cell in the table; for each cell, inputting the coordinate of each cell into a coordinate feature extraction model, and vectorizing the coordinate of each cell to obtain a cell coordinate vector corresponding to the coordinate of each cell output by the coordinate feature extraction model.

Alternatively, the feature fusion module 604 may be specifically configured to:

determining index cells located to the left and above each non-index cell based on the coordinates of each non-index cell and the coordinates of each index cell;

Alternatively, the text representation module 601 may include:

the problem text feature extraction unit is used for inputting the problem text into the problem text feature extraction model, vectorizing the problem text, and obtaining a problem text vector corresponding to the problem text output by the problem text feature extraction model.

Alternatively, the feature stitching module 602 may include:

The cell text feature extraction unit is used for inputting the text in each cell into the cell text feature extraction model, carrying out vectorization representation on the text in each cell, and obtaining a cell text vector corresponding to the text in each cell output by the cell text feature extraction model.

Optionally, the extracting device of answer text in the text table may further include:

the data acquisition module is used for acquiring text data to be processed and the problem text.

The device for extracting the answer text in the text table provided by the embodiment of the invention is used for executing the method for extracting the answer text in the text table, the implementation mode of the device is consistent with the implementation mode of the method for extracting the answer text in the text table provided by the invention, the same beneficial effects can be achieved, and the description is omitted here.

The extraction device of the answer text in the text table is used for the extraction method of the answer text in the text table in the foregoing embodiments. Therefore, the description and definition in the extraction method of answer text in the text table in the foregoing embodiments may be used for understanding each execution module in the embodiments of the present invention.

Fig. 7 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 7, the electronic device may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a method for extracting answer text from a text form, the method comprising: extracting a form in text data to be processed, and acquiring a problem text vector corresponding to a problem text; acquiring a cell coordinate vector corresponding to the coordinates of each cell in the table and a cell text vector corresponding to the text in each cell, and splicing the cell coordinate vectors into cell splicing vectors of each cell; inputting cell splicing vectors of each cell into an index recognition model respectively, classifying each cell, and determining index cells and non-index cells in each cell; inputting cell splicing vectors of each non-index cell and cell splicing vectors of index cells positioned on the left side and the upper side of each non-index cell into a feature fusion model for feature fusion to acquire context vectors of each non-index cell; and respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining answer cells and non-answer cells in each non-index cell, and determining the text in the answer cells as the answer text corresponding to the question text.

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The processor 710 in the electronic device provided by the embodiment of the present application may call the logic instruction in the memory 730, and its implementation manner is consistent with the implementation manner of the answer text extraction method in the text table provided by the present application, and may achieve the same beneficial effects, which are not described herein again.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing a method for extracting answer text from a text table provided by the above methods, the method comprising: extracting a form in text data to be processed, and acquiring a problem text vector corresponding to a problem text; acquiring a cell coordinate vector corresponding to the coordinates of each cell in the table and a cell text vector corresponding to the text in each cell, and splicing the cell coordinate vectors into cell splicing vectors of each cell; inputting cell splicing vectors of each cell into an index recognition model respectively, classifying each cell, and determining index cells and non-index cells in each cell; inputting cell splicing vectors of each non-index cell and cell splicing vectors of index cells positioned on the left side and the upper side of each non-index cell into a feature fusion model for feature fusion to acquire context vectors of each non-index cell; and respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining answer cells and non-answer cells in each non-index cell, and determining the text in the answer cells as the answer text corresponding to the question text.

When the computer program product provided by the embodiment of the present application is executed, the method for extracting the answer text in the text table is implemented, and the specific implementation manner of the method is consistent with the implementation manner recorded in the embodiment of the method, and the same beneficial effects can be achieved, which is not repeated here.

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-described method of extracting answer text from each provided text form, the method comprising: extracting a form in text data to be processed, and acquiring a problem text vector corresponding to a problem text; acquiring a cell coordinate vector corresponding to the coordinates of each cell in the table and a cell text vector corresponding to the text in each cell, and splicing the cell coordinate vectors into cell splicing vectors of each cell; inputting cell splicing vectors of each cell into an index recognition model respectively, classifying each cell, and determining index cells and non-index cells in each cell; inputting cell splicing vectors of each non-index cell and cell splicing vectors of index cells positioned on the left side and the upper side of each non-index cell into a feature fusion model for feature fusion to acquire context vectors of each non-index cell; and respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining answer cells and non-answer cells in each non-index cell, and determining the text in the answer cells as the answer text corresponding to the question text.

When the computer program stored on the non-transitory computer readable storage medium provided by the embodiment of the present application is executed, the method for extracting the answer text in the text table is implemented, and the specific implementation manner is consistent with the implementation manner recorded in the embodiment of the foregoing method, and the same beneficial effects can be achieved, which is not repeated here.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for extracting the answer text in the text form is characterized by comprising the following steps:

Respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining an answer cell and a non-answer cell in each non-index cell, and determining the text in the answer cell as an answer text corresponding to the question text;

And inputting the cell splicing vector of each non-index cell and the cell splicing vector of each index cell positioned on the left side and the upper side of each non-index cell into a feature fusion model for feature fusion, and obtaining the context vector of each non-index cell, wherein the feature fusion comprises the following steps:

2. The method for extracting answer text from text table according to claim 1, wherein said obtaining cell coordinate vectors corresponding to coordinates of each cell in the table comprises:

acquiring coordinates of each cell in the table;

3. The method for extracting answer text from text table according to claim 1, wherein the obtaining a question text vector corresponding to a question text comprises:

4. The method for extracting answer text from text table according to claim 1, wherein obtaining cell text vectors corresponding to the text in each cell comprises:

5. The method for extracting answer text from text tables according to any one of claims 1 to 4, wherein before extracting a table in text data to be processed and obtaining a question text vector corresponding to a question text, the method further comprises:

and acquiring the text data to be processed and the question text.

6. An apparatus for extracting answer text from a text form, comprising:

The answer extraction module is used for respectively splicing the context vector of each non-index cell with the question text vector, inputting an answer extraction model, classifying each non-index cell, determining answer cells and non-answer cells in each non-index cell, and determining texts in the answer cells as answer texts corresponding to the question texts;

Wherein the device is further for:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method for extracting answer text in a text form according to any one of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of extracting answer text from a text form according to any one of claims 1 to 5.

9. A computer program product comprising a computer program which, when executed by a processor, implements a method of extracting answer text from a text form as claimed in any one of claims 1 to 5.