CN114444465A

CN114444465A - Information extraction method, device, equipment and storage medium

Info

Publication number: CN114444465A
Application number: CN202210121630.9A
Authority: CN
Inventors: 冯博豪; 许海洋; 陈禹燊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2022-05-06

Abstract

The disclosure provides an information extraction method, an information extraction device, information extraction equipment and a storage medium, and relates to the technical field of big data, the technical field of natural language processing and the technical field of deep learning in data processing. The specific implementation scheme is as follows: acquiring a document to be processed, wherein the document to be processed comprises a region to be processed, and analyzing the region to be processed to obtain analysis information corresponding to the region to be processed; and extracting target information corresponding to the target key name from the analysis information corresponding to the to-be-processed area based on the target key name. The scheme realizes the purpose of extracting the information of the document to be processed, and reduces the manual extraction cost and the maintenance difficulty.

Description

Information extraction method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of big data technology in data processing, and in particular, to an information extraction method, apparatus, device, and storage medium.

Background

With the increasing popularity of the internet, the presentation of information has increased explosively, and a large amount of important data is presented in the form of text words, such as public information of various companies. If valuable structured data can be automatically analyzed, filtered and extracted according to actual requirements, researchers can be helped to quickly obtain investment clues, and therefore timely and accurate decisions can be made.

In the related art, the extraction method of the notice information is mainly to extract the key information in the notice based on a preset rule set and/or a dictionary set, so as to obtain valuable structured data. However, the pre-established rule set or dictionary set may not cover all the situations of the bulletin text, and needs to be updated and maintained continuously, which has the problems of high labor cost and difficult maintenance.

Disclosure of Invention

The disclosure provides an information extraction method, an information extraction device, information extraction equipment and a storage medium.

According to a first aspect of the present disclosure, there is provided an information extraction method, including:

acquiring a document to be processed, wherein the document to be processed comprises a region to be processed;

analyzing the area to be processed to obtain analysis information corresponding to the area to be processed;

and extracting target information corresponding to the target key name from the analysis information corresponding to the to-be-processed area based on the target key name.

According to a second aspect of the present disclosure, there is provided an information extraction apparatus including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a document to be processed, and the document to be processed comprises a region to be processed;

the analysis unit is used for analyzing the area to be processed to obtain analysis information corresponding to the area to be processed;

and the extraction unit is used for extracting target information corresponding to the target key name from the analysis information corresponding to the to-be-processed area based on the target key name.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, execution of the computer program by the at least one processor causing the electronic device to perform the method of the first aspect.

According to the technical scheme, automatic extraction of information is achieved, labor cost is reduced, and extraction accuracy is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic view of an application scenario to which an information extraction method provided in the embodiment of the present disclosure is applied;

FIG. 2 is a schematic diagram of an architecture of an information extraction system provided by an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of an information extraction method according to a first embodiment of the disclosure;

fig. 4 is a flowchart illustrating an information extraction method according to a second embodiment of the disclosure;

FIG. 5 is a schematic flow chart of extracting target information in a machine-readable manner;

fig. 6 is a flowchart illustrating an information extraction method according to a third embodiment of the disclosure;

FIG. 7 is a schematic illustration of a form region in a document to be processed before and after parsing;

fig. 8 is a schematic flowchart of an information extraction method according to a fourth embodiment of the disclosure;

FIG. 9 is a diagram illustrating a structure of a form check for one page of a document to be processed;

fig. 10 is a flowchart illustrating an information extraction method according to a fifth embodiment of the disclosure;

fig. 11 is a schematic structural diagram of an information extraction apparatus provided in an embodiment of the present disclosure;

FIG. 12 shows a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In recent years, with the increasing popularity of the internet, the presentation of information has increased explosively, and abundant and diverse information is presented using the internet as a carrier. Under the background of the modern big data age, a large amount of important data are presented in the form of text characters, and the text data can be divided into three types according to the structure: structured, unstructured and semi-structured. Structured information refers to information that can be represented and stored in a relational database; unstructured information has no fixed structure; semi-structured information is information that is intermediate between structured and unstructured information, which has a structure that is implicit, irregular, or incomplete.

In a possible example, with the close integration of the financial field and the internet, a great amount of financial texts are generated every day through the network, and the characteristics of content dispersion, data sparseness, redundant information disorder and the like are highlighted. In order to quickly and efficiently find useful information from a large amount of unstructured text, a system for information extraction is needed. For an information extraction system, the main task is to input various original texts with natural semantics, output structured information in a fixed format, and integrate the structured information in a background database in a unified manner, so that the information extraction system is convenient to look up and analyze.

In another possible example, with the advent of the internet financial era, some companies have published a huge amount of documents, such as bulletins, through information disclosure websites every day, and information contained in the bulletins plays a crucial role in investment analysis, enterprise interests, market impact, and socio-economic resource allocation. The company information disclosure means that the company reports accounting information such as self financial management to a supervision department according to legal requirements and discloses the accounting information to public investors in the form of bulletins. The company information disclosure bulletin generally includes annual reports, quarterly reports, and the like, and particularly, the information of each major event includes asset replacement, associated transaction, share pledge, investment financing, and the like.

In the investment research process, company financial bulletins are important reference materials of investors, important information in the bulletins is mined to be very critical, and important influences are brought to the market, the management layer, the company and the investors.

In practical application, a company can release massive documents through an information disclosure website every day, massive announcement information is difficult to load, information is traditionally acquired mainly in a manual extraction mode, for example, a special rule set or a dictionary library is firstly formulated according to a specific task, and then key information is extracted based on the rule set and the dictionary library, but the scheme has the following problems: firstly, the established rule set and the dictionary database are difficult to cover all situations, and need to be continuously updated and maintained, so that the labor cost is high; secondly, as more and more rules are available in the rule set or the dictionary base, the rules are easy to conflict with each other, so that the maintenance is difficult and the generalization capability is poor; third, the document may contain the plain text and the table at the same time, but the existing manual extraction method often can only obtain partial information in the plain text, and cannot extract information in the plain text and the table at the same time. That is, the existing manual information extraction method not only needs to spend huge cost, but also cannot adapt to the ever-changing actual demands of people, and has the problems of high labor cost and difficult maintenance.

Aiming at the technical problems, the technical conception process of the technical scheme disclosed by the invention is as follows: in view of the problems of high labor cost and difficult maintenance in the manual information extraction mode, if the machine can automatically analyze, filter and extract valuable structured data according to actual requirements, researchers can be helped to quickly obtain investment clues, and therefore the most timely and accurate decisions can be made. In addition, since the information disclosure bulletin is a type of unstructured text, the information distribution is scattered, the interference of redundant information is large, the traditional information extraction system has many limitations, and it is difficult to extract the key information of the bulletin quickly, efficiently and accurately, and if the named entity identification, machine reading understanding and table parsing technology are utilized, the key information extraction of the information disclosure bulletin can be completed.

Based on the technical concept process, the embodiment of the present disclosure provides an information extraction method, where a to-be-processed document including a to-be-processed region is acquired, the to-be-processed region is analyzed to obtain analysis information corresponding to the to-be-processed region, and finally, based on a target key name, target information corresponding to the target key name is extracted from the analysis information corresponding to the to-be-processed region. According to the technical scheme, the area to be processed can be automatically analyzed, the target information can be quickly positioned and extracted based on the target key name, the labor cost is reduced, and the extraction accuracy is improved.

Exemplarily, fig. 1 is a schematic view of an application scenario to which the information extraction method provided by the embodiment of the present disclosure is applied. As shown in fig. 1, the application scenario may include: an information extraction device 11, a network 12, and a server 13. Wherein, the information extraction device 11 can obtain the document to be processed from the server 13 through the network 12. Optionally, the application scenario shown in fig. 1 may further include a data storage device 14 connected to the information extraction device 11 and/or the server 13.

For example, in the application scenario shown in fig. 1, the information extraction device 11 may obtain the document to be processed from the server 13 through the network 12 by using the obtained document Uniform Resource Locator (URL) address, or may directly obtain the document to be processed uploaded locally.

As an example, the information extraction device 11 may execute the information extraction method provided by the present disclosure to extract information of the acquired document to be processed to obtain target information.

As another example, the information extraction device 11 may store the acquired document to be processed in the data storage device 14, and then directly use the acquired document to be processed in information extraction of a subsequent document to be processed.

In this embodiment, the data storage device 14 may include at least one database, and each database may store at least one task type data, so that after the information extraction device 11 extracts information of a document to be processed to obtain target information, the target information may be further stored in the at least one database of the data storage device 14 based on task requirements.

It should be noted that fig. 1 is only a schematic diagram of an application scenario provided by the embodiment of the present disclosure, and the embodiment of the present disclosure does not limit the devices included in fig. 1, nor does it limit the positional relationship between the devices in fig. 1, for example, in fig. 1, the data storage device 14 may be an external memory with respect to the server 13, and in other cases, the data storage device 14 may also be disposed in the server 13.

In the application scenario shown in fig. 1, the information extraction device 11 is a device with data extraction capability, and may be implemented by a server or a terminal device. In the embodiments of the present disclosure, the server or the terminal device for performing the data extraction task may be collectively referred to as an electronic device. Optionally, the information extraction method provided by the embodiment of the present disclosure is explained by using an electronic device as an execution subject.

Exemplarily, fig. 2 is a schematic diagram of an architecture of an information extraction system provided by an embodiment of the present disclosure. As shown in fig. 2, the information extraction system is divided into functions, and mainly includes: a document analysis module 201, a table information extraction module 202 and a text information extraction module 203.

In the embodiment of the present disclosure, the document parsing module 201 is mainly configured to parse a document to be processed to obtain parsing information corresponding to the document to be processed. The document to be processed may be a Word document or a PDF document. That is, the document parsing module 201 is mainly used for parsing a Word document or a PDF document, and the obtained parsing information is a text area and/or a table area.

It is understood that the document to be processed may also be a text document. If the document to be processed is a text document, the document to be processed does not need to be processed by the document parsing module 201.

The table information extraction module 202 mainly performs table analysis and table information extraction on a table area obtained by analyzing a document to be processed, and further obtains target information from the table area.

The plain text information extraction module 203 mainly performs paragraph division, key information positioning and plain text information extraction on a plain text region obtained by analyzing a document to be processed, and further acquires target information from the plain text region.

In one possible design, the information extraction system may further include: a display module 204 and/or a storage module 205. The display module 204 is mainly used for displaying the processing result of the document parsing module 201 and/or the table information extraction module 202 and/or the plain text information extraction module 203. The storage module 205 is mainly used for storing the processing result of the document parsing module 201 and/or the table information extraction module 202 and/or the plain text information extraction module 203.

It can be understood that the modules of the embodiment of the disclosure are communicated with each other, and the accuracy and efficiency of document interpretation can be greatly improved.

It is understood that the embodiments of the present disclosure do not limit the specific components of the information extraction system, and may be added or deleted according to the actual application scenario, which is not described herein again.

The present disclosure provides an information extraction method, apparatus, device and storage medium, which are applied to the technical field of big data, the technical field of Natural Language Processing (NLP) and the technical field of deep learning in data processing, so as to achieve the purpose of extracting information from a document to be processed, and reduce the manual extraction cost and the maintenance difficulty.

It can be understood that in the technical scheme of the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related users all meet the regulations of related laws and regulations and do not violate the customs of the public order.

The following specific examples illustrate the technical solutions of the present disclosure and how the technical solutions of the present disclosure solve the above technical problems in detail. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

Fig. 3 is a flowchart illustrating an information extraction method according to a first embodiment of the disclosure. The method of the present embodiment may be performed by the electronic device of fig. 1. As shown in fig. 3, the information extraction method of this embodiment may include the following steps:

s301, a to-be-processed document is obtained, and the to-be-processed document comprises a to-be-processed area.

In the embodiment of the disclosure, the electronic device may acquire documents to be processed from a plurality of data sources, and the method of the embodiment may be executed for each document to be processed, so as to achieve the purpose of information extraction. Each document to be processed may include a region to be processed, and the processing region may be a plain region composed of text information or a table region carrying text information in a table.

In practical applications, the document to be processed is usually a bulletin of company information disclosure, which is mostly a word document or pdf document. Therefore, the document analysis is first required to determine the region to be processed in the document to be processed.

It can be understood that, based on the business requirements, sometimes only the plain text region in the document to be processed needs to be extracted, sometimes only the table region in the document to be processed needs to be extracted, and also the table region and the plain text region need to be extracted at the same time, and thus, the region to be processed may include: a plain area and/or a table area.

In another possible example of the present disclosure, the to-be-processed region may further include other regions, for example, a picture region. The specific content included in the region to be processed is not limited in the embodiments of the present disclosure, and may be set based on actual requirements.

S302, analyzing the area to be processed to obtain analysis information corresponding to the area to be processed.

In this embodiment, when parsing the to-be-processed area of the to-be-processed document, the attribute of the to-be-processed area may be retained based on the format of the to-be-processed document. Illustratively, text information can be obtained by analyzing a plain text region in a Word document or a PDF document; the table area in the Word document or PDF document is analyzed to obtain structured information and the like.

As an example, if the document to be processed is a Word document, a plain text area and/or a table area included in the Word document may be first determined, and then the plain text area in the Word document is converted into text information by using a Word document parser, and the table area in the Word document is converted into structured information while retaining text attributes, such as a title, a body, a header, a footer, and the like.

Optionally, if the document to be processed is a PDF document, the PDF document may be first converted into a word document, and then the plain text region and/or the table region included in the word document may be analyzed to obtain analysis information.

For example, in the processing of the document to be processed, the document parsing may be page parsing in order to preserve page number information.

And S303, extracting target information corresponding to the target key name from the analysis information corresponding to the to-be-processed area based on the target key name.

For example, the key information in the document to be processed usually exists in the form of key-value pairs (key-values), and thus, for the analysis information corresponding to the region to be processed, the target key name (target key) may be used as an index, the position where the target information is located, and then the target information corresponding to the target key name (i.e., the value corresponding to the target key) is obtained from the located position. Optionally, in some scenarios, the target information may also be referred to as key information, which is not limited herein.

For example, for a document reporting the number of people in a grade of a school, the key of "number of people in a grade" may be first located, and then a specific value corresponding to the key, for example, 50 people, etc., may be determined.

In the embodiment of the disclosure, the to-be-processed document including the to-be-processed region is acquired, the to-be-processed region is analyzed to obtain the analysis information corresponding to the to-be-processed region, and finally, the target information corresponding to the target key name is extracted from the analysis information corresponding to the to-be-processed region based on the target key name. According to the technical scheme, the area to be processed can be automatically analyzed, the target information can be quickly positioned and extracted based on the target key name, the labor cost is reduced, and the extraction accuracy is improved.

On the basis of the embodiment shown in fig. 3, the information extraction method provided by the embodiment of the present disclosure is described in more detail.

Fig. 4 is a flowchart illustrating an information extraction method according to a second embodiment of the disclosure. In this embodiment, the to-be-processed region includes: and correspondingly, the analysis information corresponding to the to-be-processed area comprises text information corresponding to the plain area, and the text information comprises a directory title and text information of the to-be-processed document. Thus, the method of the present embodiment may be implemented as a possible implementation manner of S303 in fig. 3. As shown in fig. 4, the method of this embodiment may include:

s401, determining a target paragraph where the target key name is located based on the target key name and the directory title.

In an embodiment of the disclosure, the electronic device parsing the flat text region may preserve text attributes of the text region, such as format information of a directory title, a body, a header, and a footer. Therefore, when extracting information from text information corresponding to a text region, a target paragraph where the target key name is located may be first located in a directory title included in the text information based on the target key name, and then the target information may be located in the target paragraph.

For example, since company bulletin documents are often fixed in format, and in order to accurately locate information, a paragraph where a target key name is located may be determined according to a directory title and the target key name, and then target information may be found in a preset manner.

S402, extracting target information from the text information of the target paragraph.

Alternatively, information extraction can be performed in a machine reading and understanding manner. In this embodiment, the electronic device can understand the context semantics of the text information in the target paragraph, and further extract the target information from the text information in the target paragraph accurately.

In one possible design of the embodiment of the present disclosure, fig. 5 is a schematic flowchart of extracting target information in a machine-readable understanding manner. As shown in fig. 5, an implementation scheme for extracting target information from the text information of the target paragraph by using a machine reading understanding manner is as follows:

a1, performing sentence coding on the text information in the target paragraph to obtain a paragraph embedded text, a mark embedded text and a positioning embedded text of each sentence.

In this step, the electronic device may first perform text embedding on the body information in the target paragraph. For example, the text information is mapped into a word vector, each inputted sentence is encoded, and a paragraph embedded text (segment embedding), a mark embedded text (token embedding) and a position embedded text (position embedding) of the sentence are sequentially obtained.

Optionally, in the embodiment of the present disclosure, the ALBERT model is used for machine reading understanding, and the tokenization (tokenization) process in token embeddings includes both token-level tokenization and word-level tokenization, which improves the sentence coding effect.

Optionally, in this step, the process of text embedding includes both the keyword to be extracted and the plain text portion of the document itself.

A2, determining semantic expression information of each sentence based on the paragraph embedded text of each sentence, marking the embedded text, locating the original characteristics of the embedded text and the paragraph embedded text of each sentence, marking the embedded text, locating the context characteristics of the embedded text.

In the disclosure, the electronic device may perform context feature fusion, which mainly fuses the original features of the paragraph embedded text, the mark embedded text and the positioning embedded text of each sentence, and combines the context features of the paragraph embedded text, the mark embedded text and the positioning embedded text in the forward direction and the backward direction of each sentence, thereby enriching semantic expression information of each sentence, and improving the accuracy of subsequent semantic matching.

A3, determining a similarity matching matrix of the target key name and each statement according to the target key name and the semantic expression information of each statement;

illustratively, the purpose of this step is to perform semantic matching. Specifically, the attention mechanism is adopted to calculate the similarity matching degree of each word in the target key name and each word in each sentence, and a similarity matching matrix of the target key name and each sentence is obtained.

A4, extracting the target information from the text information of the target paragraph according to the similarity matching matrix.

Optionally, after the similarity matching matrix is obtained, the electronic device may perform information prediction to determine a prediction result of the target information. Specifically, the similarity matching matrix is used to respectively determine the starting character sequence number and the ending character sequence number of the target information in each sentence.

In the embodiment of the present disclosure, a target paragraph where a target key name is located is determined based on the target key name and a directory title, and target information is extracted from text information of the target paragraph in a machine reading understanding manner. According to the technical scheme, the target information in the text information can be automatically determined, batch processing is supported, the processing speed is increased, and the extraction accuracy is improved by utilizing the technologies of machine reading understanding, semantic understanding and the like.

Fig. 6 is a flowchart illustrating an information extraction method according to a third embodiment of the present disclosure. In this embodiment, the to-be-processed region includes: and correspondingly, the analysis information corresponding to the to-be-processed area comprises the structural information corresponding to the table area. In practical applications, the information analysis of the table area is mainly to obtain the information row and column number of the table area. Thus, the method of this embodiment may be implemented as S302 in fig. 3. As shown in fig. 6, the method of this embodiment may include:

s601, extracting a mask map of the table area by using a semantic segmentation model.

In the present embodiment, a mask map (mask map) of a table region is extracted using a semantic segmentation model. Here, the semantic segmentation model is deep v3+, which uses dilation convolution to increase the perception field of the extracted mask map while ensuring the image size is unchanged.

Optionally, the network of the semantic segmentation model is in a structure of "coding module-decoding module". The encoding module includes a connected-depth-convolutional neural network (DCNN) for extracting features from a table region and a hole-convolved spatial pyramid pooling (ASPP) for extracting and fusing multi-scale features of an image. The decoding module comprises an upsampling unit which is used for obtaining a segmented mask graph through upsampling.

In the step, as the deep v3+ model introduces multi-scale information, the bottom-layer features and the high-layer features of the table region can be further fused, and the accuracy of boundary segmentation of the mask graph is greatly improved.

S602, determining a character detection box of the table area based on the mask diagram.

For example, based on the mask map obtained in S601, that is, the result of semantic segmentation is used to obtain the text detection box, the text detection box is processed by using the connected domain of the opencv library and the canny edge detection algorithm, so as to obtain the coordinates of the edge of the text detection box, and for the coordinates of the edge of the text detection box, four-point coordinates (x _ min, x _ max, y _ min, y _ max) of the text detection box are finally obtained by taking the maximum value and the minimum value, specifically, the coordinates of the four points are (x _ min, y _ min), (x _ min, y _ max), (x _ max, y _ min), and (x _ max, y _ max), respectively.

S603, completing row-column alignment by using the coordinates of the character detection box to obtain a table row header and a table column header of the table area.

The steps are mainly realized based on four-point coordinates (x _ min, x _ max, y _ min, y _ max) of the character detection box. Specifically, column alignment of the character detection frame is realized by determining column division coordinates from four-point coordinates of the character detection frame, and a table row header of the table area is obtained. Similarly, the division coordinates of the rows are determined from the four-point coordinates of the character detection frame, and the row alignment of the character detection frame is realized by using the division coordinates of the rows to obtain the table list head of the table area.

Illustratively, fig. 7 is a schematic diagram of a table area in a document to be processed before and after parsing. As shown in fig. 7, a flow meter of company a is exemplified. Assuming that the table area is a table with 4 columns and 4 rows, illustratively, the first column is an item, the second column is a comment number, the third column is the current date, and the fourth column is the same date as the previous year, and the items in the first column include: xxxxx1, Xxxxx2 and Xxxxx3, wherein the number of current stages corresponding to Xxxxx1 is Yyy1, the number of current stages corresponding to previous year is zzz1, the number of current stages corresponding to Xxxxx2 is Yyy2, the number of previous year is zzz2, the number of current stages corresponding to Xxxxx3 is Yyy3, and the number of previous year is zzz 3. Referring to fig. 7, the analyzed table area is marked with row and column information, for example, the row and column information corresponding to the item in the first row is 0_0, the row and column information corresponding to the comment number is 0_1, the row and column information corresponding to the current season number is 0_2, and the row and column information corresponding to the previous year season number is 0_ 3.

It is understood that the information shown in fig. 7 is an exemplary illustration, and the specific content of the table area is not limited by the embodiment of the disclosure.

And S604, determining the structural information corresponding to the table area based on the table row header and the table column header.

For example, in this embodiment, the information of the table area may be converted into structured information according to the table row header and the table column header, for example, the structured information in json format, and accordingly, the structured information may be stored in the database for subsequent query and use.

Illustratively, the following information is a partial example of the structured information corresponding to the table region in FIG. 7:

accordingly, S303 can be implemented by the following steps:

based on the target key name, extracting target information corresponding to the target key name from the structured information corresponding to the table area.

In the present embodiment, the analysis of the table area is completed through the above steps, but in order to use the information of the table area, it is necessary to extract the target information from the table area. For example, if "xxxxxx 1" is to be acquired, it is necessary to match the json file obtained by analyzing the table area with the target key name (key), for example, the target information corresponding to the target key name "current period" is "yy 1", and the target information corresponding to the target key name "last year same period" is "zzz 1".

For example, in this embodiment, the similarity matching model is used when the target key name is matched with the json file corresponding to the table area.

In the embodiment of the disclosure, a mask map of a table region is extracted by using a semantic segmentation model, a text detection box of the table region is further determined, row-column alignment is completed by using coordinates of the text detection box, a table row header and a table column header of the table region are obtained, and finally, structured information corresponding to the table region is determined based on the table row header and the table column header, so that the purpose of analyzing the table region and extracting target information is achieved, and the accuracy of information extraction is improved.

Optionally, in a possible design of the embodiment of the present disclosure, after the step S301 (acquiring the document to be processed), the information extraction method further includes:

and performing layout analysis on the document to be processed, and determining a region to be processed in the document to be processed.

In practical applications, the structure of the document to be processed is diversified, for example, a Word document may include both a plain text region and a table region, and may also include a picture region. For example, when a Word document includes a table area in addition to a plain area, and a Word parser is directly used for parsing, a table structure cannot be retained, which may further cause that information in the table area cannot be extracted.

As an example, the area to be processed includes: a plain area and the table area. Optionally, fig. 8 is a schematic flow chart of an information extraction method according to a fourth embodiment of the present disclosure. In this embodiment, as shown in fig. 8, performing layout analysis on the document to be processed to determine the region to be processed in the document to be processed may include:

s801, processing the document to be processed by using a table detection model of a coder-decoder architecture, and determining a table area in the document to be processed.

In the process of extracting the target information, the processing modes of the contents of the plain region and the table region are different, and therefore, the plain region and the table region in the document to be processed need to be classified.

Illustratively, the algorithm applied for table detection is TableNet, which is implemented mainly based on the inherent interdependency between the two tasks of table detection and table structure identification. The TableNet uses a coder-decoder architecture, in a coder network, a VGG-19 layer is used as a basic network, and meanwhile, in the coder network, detection is carried out by simultaneously using table information and column information, so that a table area can be better found. In the decoder network, a series of stepped convolutional layers are used to boost the perceptual field of view of the image, and finally, the outputs of the two branches of the image are calculated to generate a mask of table and column regions. Because column information coding and table information coding are considered at the same time, the detection effect of the TableNet model is superior to that of a table detection model of the same type.

In the embodiment of the disclosure, the TableNet can be used for accurately positioning to the table area in the document to be processed, and distinguishing the plain text area and the table area of the document to be processed. Since the word document and pdf document are parsed in pages, the table detection process also performs table location positioning on a page-by-page basis.

Illustratively, fig. 9 is a schematic structural diagram of table detection performed on a page of a document to be processed. As shown in fig. 9, the table detection in this step can detect the areas where tables 1 and 2 are located.

S802, determining a plain text area in the document to be processed according to the document to be processed and the table area.

Optionally, for the document to be processed, after the table area in the document to be processed is determined, the area excluding the table area is the plain text area.

In the embodiment of the disclosure, a table detection model of a coder-decoder framework is utilized to process a document to be processed, a table area in the document to be processed is determined, and a plain text area in the document to be processed is determined according to the document to be processed and the table area, so that the division of the table area and the plain text area is realized, and a foundation is laid for the subsequent improvement of information extraction accuracy.

Optionally, in a possible design of the embodiment of the present disclosure, the information extraction method may further include the following steps:

b1, checking the target information to obtain a checking result;

and B2, responding to the verification result indicating that the error information exists in the target information, and correcting the error information in the target information.

In this embodiment, after the target information in the document to be processed is obtained, the accuracy of the target information may also be checked, and the error information in the target information may be corrected.

Illustratively, the error information in the target information may come from multiple parts, such as: the document content itself has error information, the document analysis introduces error information, the table analysis introduces error information, and the like. The present embodiment does not limit the cause of the error information.

Optionally, in the embodiment of the present disclosure, the content for verifying the target information may include, but is not limited to: subject name verification, numerical consistency and unit verification.

The subject name verification means that when information is extracted by using the method provided by the embodiment of the present disclosure, there may be more or less extracted characters, or an error caused by a mistake introduced by the document itself. In this case, the subject name can be verified using proper nouns in the existing information base to ensure that the subject name is error-free. For example, "xx stock control, Inc", the proper name would be "xx stock control, Inc", and so forth.

The numerical value consistency check refers to the check of numerical values by using a common numerical value calculation formula, so that the accuracy of extracting the numerical values is ensured. Such as: total asset turnover is net revenue/total average asset.

The unit check can be obtained by checking the units in the target information based on preset check rules. E.g., by the principle of consistency across fronts and backs, etc. For example, in the information of the document to be processed, there may be a case where the units do not match due to a stroke error or incomplete extraction. For example, the total assets of the publishers at the end of 2018 and 2019 and 9 months are 1,584,216.71 ten thousand yuan, 1,610,215.89 ten thousand yuan, 1,671,892.87 ten thousand yuan and 1,895,207.46', respectively, and at this time, the last amount 1,895,207.46 is seen to lack the unit "ten thousand yuan".

In the embodiment of the disclosure, the accuracy of the obtained information can be ensured by checking the target information.

In one possible design of the embodiment of the present disclosure, the information extraction method may further include:

c1, obtaining an information storage instruction, wherein the information storage instruction comprises at least one storage position, and each storage position is used for storing at least one task type data;

c2, storing the target information to at least one storage position based on the information storage instruction.

In practical application, the extraction result of each type of information has significance for supporting a background database and updating subsequent enterprise portrait data, so that the obtained target information can be stored in different storage positions, for example, the target information is stored in different databases based on subsequent task types, so that the target information can be read from a specified storage position based on task requirements in the subsequent process.

Illustratively, the target information after document analysis can be stored in a training database and can be used as an original corpus of a training task; important data such as various kinds of structured information and the like are stored in another database to serve as background data support of various kinds of tasks, and some special data such as company names and high management names are stored in a non-relational database to serve as an information base for information verification.

It is understood that the embodiments of the present disclosure are explained by some examples, which may be determined according to actual requirements and are not described herein again.

In a possible design of the embodiment of the present disclosure, the step S301 (obtaining the document to be processed) may be implemented by:

acquiring a document network address, and loading a document to be processed corresponding to the document network address;

or

And acquiring the locally uploaded document to be processed.

Illustratively, a user inputs a document network address, for example, a file url address, on a human-computer interaction interface of the electronic device, so that the electronic device can load a document to be processed corresponding to the document network address into an information extraction system of the electronic device based on the document network address.

Illustratively, the user can also import the local document to be processed into the information extraction system of the electronic device by uploading the file.

Optionally, after the electronic device extracts the information of the acquired document to be processed to obtain the target information, the electronic device may also display the information extraction result of the document to be processed.

According to the content of each of the above-mentioned disclosed embodiments, the disclosed embodiments can apply the artificial intelligence technology to life, can assist business personnel to quickly analyze document information, and has strong generalization ability and expandability, so that reading of documents is realized, full automation is completed, batch processing is supported, processing speed is high, and when document information is automatically read, cost is low, efficiency is high, extraction accuracy is improved by utilizing the technologies of machine reading understanding, semantic understanding, table parsing and the like, and accuracy of information can be further ensured by including a verification part.

Optionally, the method of the embodiment of the present disclosure may be applied to information extraction of various documents such as company bulletins, financial reports, and the like, and meanwhile, the method can support extraction of plain text information and extraction of form information.

By way of example, the following explains the complete flow of the embodiments of the present disclosure by one embodiment. Fig. 10 is a flowchart illustrating an information extraction method according to a fifth embodiment of the disclosure. As shown in fig. 10, the information extraction method may include the steps of:

s1001, acquiring a document to be processed.

Illustratively, a user inputs a document to be processed into the system through the front-end interface, or the user inputs a network address corresponding to the document into the system through the front-end interface, so that the system automatically loads the document to be processed.

And S1002, analyzing the document to be processed.

Optionally, the electronic device performs document preprocessing on the document to be processed, and determines a target processing portion of the document to be processed.

S1003, determining document layout analysis through table detection.

Optionally, the electronic device performs layout analysis on a target processing portion of the document to be processed, distinguishes a text region, a table region, a title region, and the like, and forms a document layout analysis result.

For example, it is assumed that the electronic device performs information extraction only on the plain text area and the table area in the document layout analysis result.

And S1004 to S1007, for the plain text region, determining a target paragraph where the target key name is located, and extracting the plain text target information based on the machine reading understanding model.

And S1008 to S1010, performing table analysis on the table area, and extracting the table target information.

Illustratively, the information extraction may further include:

s1011, checking and checking.

Optionally, the electronic device may also use the verification module to correct the extracted target information.

And S1012, displaying the result.

For example, the electronic device may display the extracted target information through the display module, so that the user can view the extraction result.

The implementation of each step in the embodiments of the present disclosure may refer to the record in each embodiment, and is not described herein again.

Fig. 11 is a schematic structural diagram of an information extraction device according to an embodiment of the present disclosure. The information extraction device provided in this embodiment may be the electronic device in fig. 1 or a device in the electronic device. As shown in fig. 11, the information extraction apparatus 1100 according to the present embodiment includes:

an obtaining unit 1101 configured to obtain a document to be processed, where the document to be processed includes a region to be processed;

an analyzing unit 1102, configured to analyze the to-be-processed region to obtain analysis information corresponding to the to-be-processed region;

an extracting unit 1103, configured to extract, based on a target key name, target information corresponding to the target key name from analysis information corresponding to the to-be-processed area.

In one possible implementation, the to-be-processed region includes: the analysis information corresponding to the to-be-processed area comprises text information corresponding to the plain text area, and the text information comprises a directory title and text information of the to-be-processed document;

correspondingly, the extracting unit 1103 includes:

a first extraction module, configured to determine, based on the target key name and the directory title, a target paragraph where the target key name is located;

and the second extraction module is used for extracting the target information from the text information of the target paragraph.

Optionally, the second extraction module includes:

the coding submodule is used for carrying out sentence coding on the text information in the target paragraph to obtain a paragraph embedded text, a mark embedded text and a positioning embedded text of each sentence;

the first extraction submodule is used for determining semantic expression information of each sentence based on the paragraph embedded text, the mark embedded text and the original characteristic of the positioning embedded text of each sentence and the context characteristic of the paragraph embedded text, the mark embedded text and the positioning embedded text of each sentence;

the second extraction submodule is used for determining a similarity matching matrix of the target key name and each statement according to the target key name and the semantic expression information of each statement;

and the third extraction submodule is used for extracting the target information from the text information of the target paragraph according to the similarity matching matrix.

In one possible implementation manner, the to-be-processed region includes: the analysis information corresponding to the to-be-processed area comprises structural information corresponding to the table area;

correspondingly, the parsing unit 1102 includes:

the first analysis module extracts a mask map of the table area by utilizing a semantic segmentation model;

determining a character detection box of the table area based on the mask image;

the second analysis module is used for completing row-column alignment by utilizing the coordinates of the character detection box to obtain a table row header and a table column header of the table area;

and the third analysis module is used for determining the structural information corresponding to the table area based on the table row header and the table column header.

In a possible implementation manner, the apparatus further includes:

a determining unit (not shown) for performing layout analysis on the document to be processed and determining the region to be processed in the document to be processed.

Optionally, the area to be processed includes: a plain area and a table area;

correspondingly, the determining unit includes:

a first determining module, configured to process the to-be-processed document by using a table detection model of a coder-decoder architecture, and determine a table region in the to-be-processed document;

and the second determining module is used for determining the plain text area in the document to be processed according to the document to be processed and the table area.

In a possible implementation manner, the apparatus further includes:

a verification unit (not shown) for verifying the target information to obtain a verification result;

a correcting unit (not shown) for correcting the error information in the target information in response to the verification result indicating that the error information exists in the target information.

In a possible implementation manner, the obtaining unit 1101 is further configured to:

acquiring an information storage instruction, wherein the information storage instruction comprises at least one storage position, and each storage position is used for storing at least one task type data;

storing the target information to the at least one storage location based on the information storage instruction.

In a possible implementation manner, the obtaining unit 1101 is specifically configured to:

acquiring a document network address;

loading a document to be processed corresponding to the document network address;

or

And acquiring the locally uploaded document to be processed.

The information extraction device provided in this embodiment may be configured to execute the information extraction method of any of the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

FIG. 12 shows a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 performs the respective methods and processes described above, such as the information extraction method. For example, in some embodiments, the information extraction method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the information extraction method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the information extraction method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An information extraction method, comprising:

2. The method of claim 1, wherein the area to be treated comprises: the analysis information corresponding to the to-be-processed area comprises text information corresponding to the plain text area, and the text information comprises a directory title and text information of the to-be-processed document;

correspondingly, the extracting, based on the target key name, the target information corresponding to the target key name from the analysis information corresponding to the to-be-processed area includes:

determining a target paragraph in which the target key name is located based on the target key name and the directory title;

and extracting the target information from the text information of the target paragraph.

3. The method of claim 2, wherein said extracting target information from body information of the target paragraph comprises:

sentence coding is carried out on the text information in the target paragraph, and paragraph embedded texts, mark embedded texts and positioning embedded texts of all sentences are obtained;

determining semantic expression information of each sentence based on the paragraph embedded text, the mark embedded text, the original characteristic of the positioning embedded text and the context characteristic of the paragraph embedded text, the mark embedded text and the positioning embedded text of each sentence;

determining a similarity matching matrix of the target key name and each statement according to the semantic expression information of the target key name and each statement;

and extracting the target information from the text information of the target paragraph according to the similarity matching matrix.

4. A method according to any one of claims 1-3, wherein the area to be treated comprises: the analysis information corresponding to the to-be-processed area comprises structural information corresponding to the table area;

correspondingly, the analyzing the to-be-processed area to obtain analysis information corresponding to the to-be-processed area includes:

extracting a mask map of the table area by using a semantic segmentation model;

completing row-column alignment by using the coordinates of the character detection frame to obtain a table row header and a table column header of the table area;

and determining the structural information corresponding to the table area based on the table row header and the table column header.

5. The method of any of claims 1-4, after the obtaining a document to be processed, the method further comprising:

and performing layout analysis on the document to be processed to determine the region to be processed in the document to be processed.

6. The method of claim 5, wherein the area to be treated comprises: a plain area and a table area;

correspondingly, the performing layout analysis on the document to be processed to determine the region to be processed in the document to be processed includes:

processing the document to be processed by using a table detection model of a coder-decoder architecture, and determining a table area in the document to be processed;

and determining the plain text area in the document to be processed according to the document to be processed and the table area.

7. The method of any of claims 1-6, further comprising:

checking the target information to obtain a checking result;

and correcting the error information in the target information in response to the verification result indicating that the error information exists in the target information.

8. An information extraction apparatus comprising:

9. The apparatus of claim 8, wherein the area to be treated comprises: the analysis information corresponding to the to-be-processed area comprises text information corresponding to the plain text area, and the text information comprises a directory title and text information of the to-be-processed document;

correspondingly, the extraction unit comprises:

the first extraction module is used for determining a target paragraph where the target key name is located based on the target key name and the directory title;

10. The apparatus of claim 9, wherein the second decimation module comprises:

11. The apparatus according to any one of claims 8-10, wherein the area to be treated comprises: the analysis information corresponding to the to-be-processed area comprises structural information corresponding to the table area;

correspondingly, the parsing unit includes:

12. The apparatus of any of claims 8-11, further comprising:

the determining unit is used for performing layout analysis on the document to be processed and determining the region to be processed in the document to be processed.

13. The apparatus of claim 12, wherein the area to be treated comprises: a plain area and a table area;

correspondingly, the determining unit includes:

14. The apparatus of any of claims 8-13, further comprising:

the verification unit is used for verifying the target information to obtain a verification result;

and the correcting unit is used for correcting the error information in the target information in response to the verification result indicating that the error information exists in the target information.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 7.