CN114863455A

CN114863455A - Method and device for extracting information

Info

Publication number: CN114863455A
Application number: CN202210589436.3A
Authority: CN
Inventors: 施振辉; 夏源; 王春宇; 许嘉宇; 代小亚; 黄海峰; 王磊; 陆超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-05

Abstract

The application discloses a method and a device for extracting information, and relates to the technical field of image recognition and structured information processing. The method comprises the following steps: acquiring a picture to be identified; acquiring elements in the picture to be recognized, wherein the elements comprise a plurality of character entities in the picture to be recognized; determining an entity tag for each textual entity of the plurality of textual entities; and extracting the text information in the picture to be identified according to the entity label of each text entity. By adopting the method, the efficiency and the accuracy of extracting the character information in the picture can be improved.

Description

Method and device for extracting information

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of image recognition and structured information processing technologies, and in particular, to a method and an apparatus for extracting information.

Background

The method has the advantages that the character information in the picture is extracted in a structured mode, and the efficient utilization and the effective maintenance of the information are facilitated. The existing method for carrying out structured extraction on text information in a picture comprises the following steps: recognizing the content of a preset field in the picture by adopting an OCR (optical character recognition) technology, and splicing according to a certain rule; or, the entities are spliced after the character entities in the picture are identified by adopting a deep learning technology.

However, the conventional method for extracting the text information in the picture in a structured manner has the problems of low information extraction efficiency and inaccuracy.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, and a computer-readable storage medium for extracting information.

According to a first aspect, there is provided a method for extracting information, the method comprising: acquiring a picture to be identified; acquiring elements in the picture to be recognized, wherein the elements comprise a plurality of character entities in the picture to be recognized; determining an entity tag of each of a plurality of textual entities; and extracting the text information in the picture to be identified according to the entity label of each text entity.

According to a second aspect, there is provided an apparatus for extracting information, the apparatus comprising: an acquisition unit configured to acquire a picture to be recognized; the identification unit is configured to acquire elements in the picture to be identified, wherein the elements comprise a plurality of word entities in the picture to be identified; a marking unit configured to determine an entity label of each of a plurality of textual entities; and the extracting unit is configured to extract the text information in the picture to be identified according to the entity label of each text entity.

According to a third aspect, embodiments of the present disclosure provide an electronic device, comprising: one or more processors: a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a method for extracting information as provided in the first aspect.

According to a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having a computer program stored thereon, where the program, when executed by a processor, implements the method for extracting information provided by the first aspect.

The method and the device for extracting the information acquire the picture to be identified; acquiring elements in the picture to be recognized, wherein the elements comprise a plurality of character entities in the picture to be recognized; determining an entity tag for each textual entity of a plurality of textual entities; according to the entity label of each character entity, the character information in the picture to be identified is extracted, and the efficiency and the accuracy of extracting the character information in the picture can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which embodiments of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for extracting information, according to the present application;

FIG. 3 is a flow diagram of another embodiment of a method for extracting information according to the present application;

FIG. 4 is a schematic illustration of a region of a preset location in one embodiment of a method for extracting information according to the present application;

FIG. 5 is a flow diagram of yet another embodiment of a method for extracting information according to the present application;

FIG. 6 is a schematic diagram of calculating coverage between detection boxes in an embodiment of a method for extracting information according to the present application;

FIG. 7 is a schematic diagram of BIO sequence labeling in an application scenario of a method for extracting information according to the present application;

FIG. 8 is a schematic illustration of regions in a same row or column in an application scenario of a method for extracting information according to the present application;

FIG. 9 is a schematic diagram of a positional relationship between entities of key-value pair relationships in one application scenario of a method for extracting information according to the application;

FIG. 10 is a block diagram of one embodiment of an apparatus for extracting information according to the present application;

fig. 11 is a block diagram of an electronic device for implementing a method for extracting information according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for extracting information or the apparatus for extracting information of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be user terminal devices on which various client applications may be installed, for example, data reading type applications, image type applications, video type applications, search type applications, financial type applications, etc.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting receiving server messages, including but not limited to smartphones, tablets, e-book readers, electronic players, laptop portable computers, desktop computers, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, various electronic devices may be used, and when the

terminal devices

101, 102, and 103 are software, the electronic devices may be installed in the above-listed electronic devices. It may be implemented as multiple pieces of software or software modules (e.g., multiple software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may obtain a picture to be recognized; acquiring elements in the picture to be recognized, wherein the elements comprise a plurality of character entities in the picture to be recognized; determining an entity tag for each textual entity of a plurality of textual entities; and extracting the text information in the picture to be identified according to the entity label of each text entity.

It should be noted that the method for determining the training data provided by the embodiment of the present disclosure may be performed by the server 105, and accordingly, the apparatus for determining the training data may be disposed in the server 105.

It should be understood that the number of devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of devices, networks, and servers, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for extracting information in accordance with the present disclosure is shown. Method for extracting information, comprising the steps of:

step 201, acquiring a picture to be identified.

In this embodiment, an executing body (for example, a server shown in fig. 1) of the method for extracting information acquires a picture to be recognized in a wired or wireless manner, where the picture to be recognized may be any picture containing text information, such as a medical document, an advertisement, a warehouse document, and the like.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Step 202, obtaining elements in the picture to be recognized, wherein the elements comprise a plurality of text entities in the picture to be recognized.

In this embodiment, a to-be-recognized picture may be recognized by using a character recognition technology (e.g., an OCR technology, a pre-trained natural language recognition model, or a pre-trained vocabulary recognition model), and a plurality of character entities in the recognized to-be-recognized picture are obtained, where the character entities may be a vocabulary or a character, for example, the "gender" in the medical document may be a character entity, and the "woman" in the medical document may also be a character entity.

Step 203, determining an entity tag of each of the plurality of textual entities.

In this embodiment, an entity tag of each of the plurality of text entities may be determined, where the entity tag may be any predefined tag such as "head up", "title", "question", "response", "header of table", "content of table", and so on.

Specifically, OCR technology and a predefined tag library may be employed to assign the recognized text to a predefined entity tag; a label prediction model trained in advance can be adopted to predict the label of the character entity based on the characteristics of the character entity; the entity labels of each textual entity may be traversed using a predetermined table of correspondences between entities and labels, and so on.

And step 204, extracting the text information in the picture to be identified according to the entity label of each text entity.

In this embodiment, the text information in the picture to be recognized may be extracted according to the entity tag of each text entity. For example, the text entity with the "question" label and the text entity with the "answer" label can be extracted as the text information in the picture to be recognized. For another example, the text entity with the "header of table" label and the text entity with the "content of table" can be extracted as text information in the table in the picture to be recognized, and the like.

The method for extracting information provided by the embodiment acquires a picture to be identified; acquiring elements in the picture to be recognized, wherein the elements comprise a plurality of character entities in the picture to be recognized; determining an entity tag of each of a plurality of textual entities; according to the entity label of each character entity, the character information in the picture to be recognized is extracted, the character information in the picture to be recognized can be extracted based on the label of the character entity, the efficiency of recognizing the character information in the picture can be improved, each character entity in the picture is endowed with the label, information extraction is carried out based on the label, information omission can be avoided, the problem that the adjacent character entities are integrally output due to the fact that the positions of the adjacent character entities are close to each other to cause errors can be avoided, and the accuracy of extracting the information can be improved.

Optionally, acquiring an element in the picture to be recognized includes: and inputting the picture to be recognized into the multi-mode information extraction pre-training model to obtain elements in the picture to be recognized output by the multi-mode information extraction pre-training model.

In this embodiment, the multi-modal information extraction pre-training model is used to extract features of elements in the picture to be recognized, where the elements include characters in the picture, positions of the characters, and the like. The multi-mode information extraction pre-training model can be a model such as a LayoutXLM and the like, can organize document visual clues by using fields, builds matching relations of the fields, aligns image and text features, builds a mask type visual language model, a field length prediction task and a field direction prediction pre-training task, promotes cross-mode feature interaction, helps information correlation between model learning modes, and increases comprehensive understanding capacity of the document.

With continued reference to FIG. 3, a flow 300 of another embodiment of a method for extracting information in accordance with the present disclosure is shown. Method for extracting information, comprising the steps of:

step 301, acquiring a picture to be identified.

The description of step 301 in this embodiment is the same as that of step 201, and is not repeated here.

Step 302, obtaining elements in the picture to be recognized, wherein the elements include a plurality of text entities in the picture to be recognized and the position of each text entity in the picture to be recognized.

In this embodiment, a picture to be recognized may be recognized by using a character recognition technology (e.g., an OCR technology, a pre-trained natural language recognition model, a pre-trained vocabulary recognition model, or the like), and a plurality of character entities in the recognized picture to be recognized and a position of each character entity in the picture are obtained.

Step 303, determining an entity label of each of a plurality of textual entities; wherein the entity tag comprises: question and answer.

In this embodiment, an entity tag may be determined for each of a plurality of textual entities, where the entity tag may be "question" or "answer". For example, the entity tag for the textual entity "gender" may be "question," the entity tag for the textual entity "woman" may be "answer," and as another example, the entity tag for the textual entity "date" may be "question," and the entity tag for the textual entity "2022, 1 month, 1 day" may be "answer.

Step 304, for the text entity whose entity label is a question, the text entity and the text entity which is located at the preset position of the text entity and whose entity label is a reply are determined as candidate key value pairs.

In this embodiment, for a text entity labeled with a "question", based on the position relationship, a candidate answer from the text entity is determined, that is, the text entity and a text entity located at a preset position of the text entity and labeled with a "answer" are determined as candidate key-value pairs. For example, as shown in fig. 4, the preset position may be set as a right-side (positive x-axis) sector area of the text entity, and the text entity in the graph and three text entities in the sector area may be combined into three pairs of candidate key-value pairs.

Step 305, determining whether the text entities included in the candidate key-value pair are in a key-value relationship according to the vector representation of the text entities included in the candidate key-value pair.

In this embodiment, after a plurality of candidate key value pairs are determined, for each of the candidate key value pairs, according to the vector characterization of the text entity included in the candidate key value pair, it is determined whether the text entity included in the candidate key value pair is in a key value relationship, that is, a target key value pair is determined from the candidate key value pair.

Specifically, a head vector representation of each word entity in the candidate key value pair may be taken, and the head vector representation is input into a pre-trained classification model to obtain a classification result, where the pre-trained classification model is obtained by training based on a head vector representation splicing result with a sample as an entity and training data with a sample label as a relationship between the entities. For example, if "gender" and "name" form a candidate key-value pair, then the vector representation of "gender" and the vector representation of "surname" are taken, and the input to the classification model after splicing is carried out, so that the classification result is "irrelevant".

Specifically, an average vector characterization of each word entity in the candidate key value pair may be taken, and the average vector may be input into a pre-trained classification model to obtain a classification result, where the pre-trained classification model is obtained by training based on training data of a relationship between an average vector characterization in which a sample is an entity and a sample label is an entity.

Step 305, determining the text entities with the key value relationship as text information in the picture to be identified.

In this embodiment, the text entity having the key value relationship is determined as the text information in the picture to be identified.

Compared with the method in the embodiment described in fig. 2, the method for extracting information provided in this embodiment specifically defines that when the entity label is "question", the text entity having the "answer" entity label and constituting the candidate key value pair with the text entity having the "question" entity label is determined based on the position relationship, and the key value pair having the key value relationship is determined from the candidate key value pair based on the classification model, and the key value pair having the key value relationship is used as the extraction information, so that the accuracy of extracting information can be improved.

With continued reference to FIG. 5, a flow 500 of yet another embodiment of a method for extracting information in accordance with the present disclosure is shown. Method for extracting information, comprising the steps of:

step 501, obtaining a picture to be identified.

The description of step 501 in this embodiment is the same as that of step 201, and is not repeated here.

Step 502, obtaining elements in the picture to be recognized, wherein the elements include a plurality of text entities in the picture to be recognized and the position of each text entity in the picture to be recognized.

Step 503, determining an entity label of each text entity in the plurality of text entities; wherein, the entity label also includes: header of the table.

In this embodiment, an entity tag of each of the plurality of text entities may be determined, where the entity tag may be "a header of a table" or "a content of the table".

Step 5041, extracting the text entities in the table having the same column relationship with the text entity for each text entity with the entity label as the table header of the table; and extracting the character entities with the same row relation with each character entity in the character entities with the same column relation.

In this embodiment, for a text entity whose entity label is "table head of table", the same column elements (text entities) under the table head may be found out first according to the table head of the table, and sorted according to the y coordinate (upper and lower positions), and then each column may be sequentially traversed from the column with the most elements according to the number of the elements in each column, so as to obtain the elements (text entities) in the same row, thereby implementing structured extraction of table information.

Step 5042, extracting the text entities in the table which have the same row relation with the text entity aiming at each text entity with the entity label as the table head of the table; and extracting the character entities with the same column relation with each character entity in the character entities with the same row relation.

In this embodiment, for a text entity whose entity label is "table header of table", the elements (text entities) in the same row under the table header are found out according to the table header of the table, and are sorted according to the x coordinate (left and right positions), and then each row can be sequentially traversed from the row with the most elements according to the number of the elements in each row, so as to obtain the elements (text entities) in the same row, thereby implementing structured extraction of table information.

Compared with the method in the embodiment described in fig. 2, the method for extracting information provided in this embodiment specifically defines that when the entity label of the text entity is distinguished as "header of table" or "content of table", the text entity is extracted based on the row/column relationship, and the accuracy of extracting information can be improved.

Optionally, the method for determining whether at least two elements have the same row relationship includes: determining whether the first element and the second element have the same row relation or not based on the coverage rate between the transverse projection of the area where the first element is located and the transverse projection of the area where the second element is located; or, the method for judging whether at least two elements have the same column relationship comprises the following steps: and determining whether the third element and the fourth element have the same column relation or not based on the coverage rate between the longitudinal projection of the area where the third element is located and the longitudinal projection of the area where the fourth element is located.

In this embodiment, since the picture to be recognized is taken vertically instead of in the forward direction, and the taken picture to be recognized has perspective, when two elements in the detection table have the same row relationship (i.e., are located in the same row) or the same column relationship (i.e., are located in the same column), whether the elements are located in the same horizontal line or the same vertical line may be determined according to the coverage rate of the positions of adjacent detection frames (the detection frames are used for detecting text entities) to improve the accuracy of determining the same row/column. For example, the coverage in the vertical direction is calculated as shown in fig. 6, where the coverage is area C/area a, where the numerator is the overlapped part of a and B after alignment, and the denominator is the smaller one of the areas of a and B, and at this time, whether a and B are on the same straight line (row/column) may be determined according to the threshold, for example, when the coverage is greater than 0.5, it is determined that both are on the same straight line. A, B are all text entity detection/identification boxes.

In some application scenarios, the picture to be recognized is a medical document, when extracting the structural information in the medical document, firstly, the medical document is obtained, after the medical document is obtained, the OCR technology can be adopted to recognize the medical document, obtain the text entities and the positions of the text entities in the medical document, and judge the entity labels of the text entities based on the model.

Secondly, entity classification is carried out, elements (including characters, positions and the like) identified by the OCR technology are input into a LayoutXLM pre-training model to obtain the characteristics of each character entity, then the LayoutXLM pre-training model carries out BIO sequence marking on the characters (as shown in figure 7), wherein the BIO marking is used for marking each element as 'B', 'I', 'O', 'B' to indicate that the segment where the element is located belongs to the starting position, I 'to indicate that the segment where the element is located belongs to the middle position, and O' to indicate that the segment where the element is located does not belong to any type. Then, an entity label of each word entity is obtained, for example, the OCR technology detects that the text box "gender: male, the predicted label is "QQQA" (Q stands for "resolution"/"QUESTION", and a stands for "ANSWER"/"ANSWER"), and in order to solve the problem that OCR technology recognizes key values with smaller intervals in documents into the same text box, the decoding algorithm is adapted so that the decoding result is ("gender:", QQQ), ("male", a). According to the decoding rule, the condition that key values in the document are identified in the same detection frame/identification frame by the OCR technology can be avoided.

Because key value information and TABLE information in medical record documents need to be processed simultaneously, 6 entity labels (quality, ANSWER, HEADER, TABLE, ELEMENT, and OTHER) are predefined for facilitating subsequent decoding and structured output, and the specific meanings are as follows:

thirdly, entity connection is carried out: according to the word entities and the entity labels, whether the two word entities can form a key value pair or not is judged, or whether the two word entities are in the same row or the same column in the table or not is judged, namely, the classification model can classify the relationship between the two word entities into the following categories: key value relation, same row relation in table, column relation in table, no relation.

In the text entity preset area of the resolution label (as shown in fig. 4 and fig. 8), the text entities of the ANSWER label are traversed. The preset area may traverse candidate text entities in a sector area in the schematic diagram shown in fig. 4 according to the position of the recognition frame recognized by the OCR technology, form candidate key pairs by text entities of the query label and the candidate text entities, input text entities included in each pair of candidate key pairs into the layout xlm pre-training model, take out head vector representations (such as vectors of "gender" and "gender" removed words) of each pair of entities, input the head vector representations into the classification model/prediction model after splicing, and determine whether the text entities in each pair of candidate key pairs have a key value relationship based on the classification model/prediction model to determine the text entities having the key value relationship.

For cross-row key-value pair matching, the text entities E1, E2, E3 in fig. 9 are actually a value entity, but since there is less data in the training data and the model has difficulty in learning the position relationship feature, the model cannot connect both the text entity E1 and the text entity E3 with the text key entity. At this time, for the entities (type is ANSWER) having no key-value pair connection relationship with the key, the literal entity E1 and the literal entity E3, the most recent entity of type is query, which is in the vicinity of the predefined location relationship, can be judged according to the predefined location relationship (for example, key is on the left, value is on the right, or key is on the top and value is on the bottom), and the upper connection relationship is supplemented.

A LayoutXLM pre-training model is adopted, the layout of the medical documents is classified in a coarse-grained mode, the medical documents are divided into a table area and a non-table area, and the accuracy of entity classification is improved. The entity tag of the text entity in the TABLE area is fixed to only one of TABLE and ELEMENT. For elements located in the same row or column of the table, the table is "restored", that is, the table is structured, according to the relative position information of the characters. Specifically, according to the header of the table, finding out the elements in the same column under the header, and sorting according to the y coordinate (upper and lower positions); and then finding out the columns with the most elements, and traversing in sequence to obtain the elements in the same row so as to complete the structuring of the table.

With further reference to fig. 10, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for extracting information, which corresponds to the method embodiments shown in fig. 2, fig. 3 or fig. 5, and which may be applied in various electronic devices in particular.

As shown in fig. 10, the apparatus 1000 for extracting information of the present embodiment includes: acquisition unit 1001, recognition unit 1002, marking unit 1003, and extraction unit 1004. The acquiring unit 1001 is configured to acquire a picture to be identified; the identification unit is configured to acquire elements in the picture to be identified, wherein the elements comprise a plurality of word entities in the picture to be identified; a marking unit configured to determine an entity label of each of a plurality of textual entities; and the extraction unit is configured to extract the text information in the picture to be identified according to the entity label of each text entity.

In some embodiments, the identification unit comprises: and the recognition module is configured to input the picture to be recognized into the multi-mode information extraction pre-training model to obtain elements in the picture to be recognized output by the multi-mode information extraction pre-training model.

In some embodiments, the element further includes position information of each text entity in the picture to be recognized, and the entity tag includes: question and answer; an extraction unit comprising: the determining module is configured to determine a text entity with an entity label as a question as a candidate key-value pair aiming at the text entity with the entity label as a reply, wherein the text entity is positioned at a preset position of the text entity; the judging module is configured to determine whether the character entities included in the candidate key value pairs are in a key value relationship according to the vector representation of the character entities included in the candidate key value pairs; the first extraction module is configured to determine the text entities with the key value relationship as text information in the picture to be identified.

In some embodiments, the element further includes position information of each text entity in the picture to be recognized, and the entity tag further includes: a header of the table; an extraction module comprising: the second extraction module is configured to extract the character entities which have the same column relationship with the character entities in the table aiming at each character entity of which the entity label is the table head of the table; extracting the character entities with the same row relation with each character entity in the character entities with the same column relation; or, the third extraction module is configured to extract, for each text entity of which the entity tag is a header of the table, text entities in the table having a same row relationship with the text entity; and extracting the character entities with the same column relation with each character entity in the character entities with the same row relation.

In some embodiments, a method for determining whether at least two elements have a same row relationship comprises: determining whether the first element and the second element have the same row relation or not based on the coverage rate between the transverse projection of the area where the first element is located and the transverse projection of the area where the second element is located; or, the device for judging whether at least two elements have the same column relationship comprises: and determining whether the third element and the fourth element have the same column relation or not based on the coverage rate between the longitudinal projection of the area where the third element is located and the longitudinal projection of the area where the fourth element is located.

The units in the apparatus 1000 described above correspond to the steps in the method described with reference to fig. 2, 3 or 5. Thus, the operations, features and technical effects described above for the method for extracting information are also applicable to the apparatus 1000 and the units included therein, and are not described herein again.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 11, the device 1100 comprises a computing unit 1101, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM 1102, and the RAM1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106 such as a keyboard, mouse, or the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108 such as a magnetic disk, optical disk, or the like; and a communication unit 1105 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1105 allows the device 1100 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above, such as a method for extracting information. For example, in some embodiments, the method for extracting information may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1105. When the computer program is loaded into RAM1103 and executed by the computing unit 1101, one or more steps of the method for extracting information described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured by any other suitable means (e.g., by means of firmware) to perform the method for extracting information.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired data according to the technical aspects disclosed in the present application can be realized.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for extracting information, comprising:

acquiring a picture to be identified;

acquiring elements in the picture to be recognized, wherein the elements comprise a plurality of word entities in the picture to be recognized;

determining an entity tag for each textual entity of the plurality of textual entities;

and extracting the text information in the picture to be identified according to the entity label of each text entity.

2. The method of claim 1, wherein the obtaining the element in the picture to be recognized comprises:

and inputting the picture to be recognized into a multi-mode information extraction pre-training model to obtain elements in the picture to be recognized output by the multi-mode information extraction pre-training model.

3. The method of claim 1, wherein the element further includes position information of each text entity in the picture to be recognized, and the entity tag includes: questions and answers;

the extracting the text information in the picture to be identified according to the entity label of each text entity comprises the following steps:

aiming at a character entity with an entity label as a question, determining the character entity and a character entity which is positioned at a preset position of the character entity and has the entity label as a reply as a candidate key value pair;

determining whether the character entities included in the candidate key value pairs are in a key value relationship according to the vector representation of the character entities included in the candidate key value pairs;

and determining the text entity with the key value relationship as the text information in the picture to be identified.

4. The method of claim 1, wherein the element further includes position information of each text entity in the picture to be recognized, and the entity tag further includes: a header of the table;

extracting the character entities which have the same column relation with the character entities in the table aiming at each character entity of which the entity label is the table head of the table; extracting the character entities with the same row relation with each character entity in the character entities with the same column relation; or,

extracting the character entities in the table which have the same row relation with the character entities aiming at each character entity of which the entity label is the table head of the table; and extracting the character entities with the same column relation with each character entity in the character entities with the same row relation.

5. The method of claim 4, wherein determining whether at least two elements have the same row relationship comprises:

determining whether the first element and the second element have the same row relation or not based on the coverage rate between the transverse projection of the area where the first element is located and the transverse projection of the area where the second element is located; or,

the method for judging whether at least two elements have the same column relationship comprises the following steps:

and determining whether the third element and the fourth element have the same column relation or not based on the coverage rate between the longitudinal projection of the area where the third element is located and the longitudinal projection of the area where the fourth element is located.

6. An apparatus for extracting information, comprising:

an acquisition unit configured to acquire a picture to be recognized;

the identification unit is configured to acquire elements in the picture to be identified, wherein the elements comprise a plurality of word entities in the picture to be identified;

a marking unit configured to determine an entity label of each of the plurality of textual entities;

and the extracting unit is configured to extract the text information in the picture to be identified according to the entity label of each text entity.

7. The apparatus of claim 6, wherein the identifying unit comprises:

and the recognition module is configured to input the picture to be recognized into a multi-mode information extraction pre-training model to obtain elements in the picture to be recognized output by the multi-mode information extraction pre-training model.

8. The apparatus of claim 6, wherein the element further includes position information of each text entity in the picture to be recognized, and the entity tag includes: questions and answers;

the extraction unit includes:

the determining module is configured to determine a text entity with an entity label as a question as a candidate key-value pair aiming at the text entity with the entity label as a reply, wherein the text entity is positioned at a preset position of the text entity;

the judging module is configured to determine whether the word entities included in the candidate key value pairs are in a key value relationship according to the vector representation of the word entities included in the candidate key value pairs;

the first extraction module is configured to determine the text entities with the key value relationship as text information in the picture to be identified.

9. The apparatus of claim 6, wherein the element further includes position information of each text entity in the picture to be recognized, and the entity tag further includes: a header of the table;

the extraction module comprises:

the second extraction module is configured to extract the character entities which have the same column relationship with the character entities in the table aiming at each character entity of which the entity label is the table head of the table; extracting the character entities with the same row relation with each character entity in the character entities with the same column relation; or,

the third extraction module is configured to extract the text entities in the table, which have the same row relation with the text entities, for each text entity of which the entity label is the table header of the table; and extracting the character entities with the same column relation with each character entity in the character entities with the same row relation.

10. The apparatus of claim 9, wherein the means for determining whether at least two elements have a same row relationship comprises:

apparatus for determining whether at least two elements have a same column relationship, comprising:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.