CN117197825A

CN117197825A - Information extraction method, device, equipment and storage medium

Info

Publication number: CN117197825A
Application number: CN202311214003.0A
Authority: CN
Inventors: 乔梁
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-08

Abstract

The application provides an information extraction method, an information extraction device, information extraction equipment and a storage medium, which relate to the technical field of computers and are used for improving the accuracy of electronic invoice information extraction and reducing the extraction cost, wherein the method comprises the following steps: acquiring characters in an electronic invoice and position information of each character in the electronic invoice, and carrying out character space clustering on each character according to the position information and a preset character expansion strategy to obtain a plurality of character blocks; the character block comprises at least one character, and the distance between any two adjacent characters in the same character block is smaller than or equal to a preset distance; determining the position relation among the character blocks through a pre-trained graph network model; the position relation among the character blocks is used as a splicing sequence, and a plurality of character blocks are spliced to obtain text content of the electronic invoice; and carrying out semantic extraction on the text content to obtain information of the electronic invoice.

Description

Information extraction method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an information extraction method, apparatus, device, and storage medium.

Background

Along with the increasing popularization of electronic invoices, the application range of electronic invoice format files is wider, and many business systems need to extract ticket information after receiving the electronic invoice format files. Therefore, the quick and accurate extraction of invoice information is a key technology that each business system can effectively utilize electronic invoice format files.

In the prior art, a template matching mode is generally adopted to extract information from the electronic invoice. Specifically, the invoice area needs to be divided (for example, an area A-ticket header, an area B-buyer information, an area C-tax-application detail and total, an area D-price tax total, an area E-seller information and an area F-ticket tail), and then a positioning module is used for searching element information labels of the electronic invoice according to the area division. However, the formats of various notes can be changed frequently in practical application, and the template matching mode is used, so that the template needs to be maintained regularly, and time and labor are wasted.

Disclosure of Invention

Based on the technical problems, the application provides an information extraction method, an information extraction device and a storage medium, which can improve the accuracy of electronic invoice information extraction and reduce the extraction cost.

In a first aspect, the present application provides an information extraction method, the method comprising: acquiring characters in an electronic invoice and position information of each character in the electronic invoice, and carrying out character space clustering on each character according to the position information and a preset character expansion strategy to obtain a plurality of character blocks; the character block comprises at least one character, and the distance between any two adjacent characters in the same character block is smaller than or equal to a preset distance; determining the position relation among the character blocks through a pre-trained graph network model; the position relation among the character blocks is used as a splicing sequence, and a plurality of character blocks are spliced to obtain text content of the electronic invoice; and carrying out semantic extraction on the text content to obtain information of the electronic invoice.

In a possible implementation manner, determining the position relationship between the character blocks through a pre-trained graph network model includes: and extracting the position features, text features and visual features of each character block through a pre-trained graph network model, and predicting the position relationship among each character block according to the position features, the text features and the visual features.

In one possible implementation, the graph network model includes a language vector encoder, a multi-layer perceptron, and an image feature extractor; extracting the position features, text features and visual features of each character block through a pre-trained graph network model, wherein the method comprises the following steps of: for any character block, inputting characters in the character block into a language vector encoder to obtain text characteristics corresponding to the character block; inputting the original coordinates and the mapping coordinates of the character blocks on the electronic invoice into a multi-layer perceptron to obtain the position characteristics corresponding to the character blocks; the mapping coordinates are coordinates obtained by linear transformation of the original coordinates; inputting the corresponding image of the character block on the electronic invoice into an image feature extractor to obtain local image features, and taking the local image features as visual features corresponding to the character block.

In one possible implementation manner, predicting the position relation among the character blocks according to the position features, the text features and the visual features comprises taking one character block as a graph node, and generating edge features corresponding to the graph nodes according to the position features, the text features and the visual features of the graph nodes; according to the edge characteristics, determining the connection relation between the nodes of each graph, and taking the connection relation as the position relation between the character blocks; the similarity of the edge characteristics of any two adjacent image nodes is greater than or equal to the preset similarity.

In a possible implementation manner, according to the location information and a preset character expansion strategy, performing character space clustering on each character to obtain a plurality of character blocks, including: determining a target character; the target character is any character in the electronic invoice; taking the target character as a starting point, performing transverse search on the electronic invoice according to the preset width, and performing longitudinal search on the electronic invoice according to the preset height to obtain a search result; and forming the characters corresponding to the search result into a character block to obtain a plurality of character blocks.

In one possible implementation manner, the method for performing lateral search on the electronic invoice according to the preset width includes: transversely expanding the target character by a preset width on the electronic invoice by taking the target character as a starting point, and if a new character exists in the expanded area, continuing transversely expanding by taking the new character as the starting point until the expanded area does not exist the new character; performing longitudinal search on the electronic invoice according to a preset height, including: and taking the target character as a starting point, longitudinally expanding the target character by a preset height on the electronic invoice, and if a new character exists in the expanded area, continuously longitudinally expanding the target character by taking the new character as the starting point until the expanded area does not exist the new character.

In the application, the electronic equipment firstly extracts the characters in the electronic invoice and the position information of each character in the electronic invoice so as to splice the characters with the same attribution area according to the position information to obtain a plurality of character strings. For the electronic invoice, different areas reflect different contents, character stitching is carried out according to the attribution area, and the obtained semantics of the character string are more attached to the real content of the invoice. Furthermore, the electronic equipment splices a plurality of character strings based on the position relation among the attribution areas to obtain the complete text content of the electronic invoice, and further semantic extraction can be carried out on the text content to obtain the information of the electronic invoice. Compared with the prior art that the information extraction is carried out on the electronic invoice in a template matching mode, the method does not need to rely on a fixed template, and based on the difference of the positions of the characters on the electronic invoice, the characters on the electronic invoice are spliced into text contents, and the text contents are subjected to semantic extraction. Therefore, no matter how the format of the electronic invoice changes, the information extraction result is not affected, and the template is not required to be maintained regularly, so that the information extraction cost is reduced.

In a second aspect, the present application provides an information extraction apparatus, the apparatus including an acquisition unit and a processing unit; the acquiring unit is used for acquiring the characters in the electronic invoice and the position information of each character in the electronic invoice; the processing unit is used for carrying out character space clustering on each character according to the position information and a preset character expansion strategy to obtain a plurality of character blocks; the character block comprises at least one character, and the distance between any two adjacent characters in the same character block is smaller than or equal to a preset distance; the processing unit is also used for determining the position relation among the character blocks through a pre-trained graph network model; the processing unit is also used for splicing the plurality of character blocks by taking the position relation among the character blocks as a splicing sequence to obtain the text content of the electronic invoice; and the processing unit is also used for carrying out semantic extraction on the text content to obtain the information of the electronic invoice.

In a possible implementation manner, the processing unit is specifically configured to: and extracting the position features, text features and visual features of each character block through a pre-trained graph network model, and predicting the position relationship among each character block according to the position features, the text features and the visual features.

In one possible implementation, the graph network model includes a language vector encoder, a multi-layer perceptron, and an image feature extractor; the processing unit is specifically used for: for any character block, inputting characters in the character block into a language vector encoder to obtain text characteristics corresponding to the character block; inputting the original coordinates and the mapping coordinates of the character blocks on the electronic invoice into a multi-layer perceptron to obtain the position characteristics corresponding to the character blocks; the mapping coordinates are coordinates obtained by linear transformation of the original coordinates; inputting the corresponding image of the character block on the electronic invoice into an image feature extractor to obtain local image features, and taking the local image features as visual features corresponding to the character block.

In a possible implementation manner, the processing unit is specifically configured to: taking a character block as a graph node, and generating edge characteristics corresponding to each graph node according to the position characteristics, text characteristics and visual characteristics of each graph node; according to the edge characteristics, determining the connection relation between the nodes of each graph, and taking the connection relation as the position relation between the character blocks; the similarity of the edge characteristics of any two adjacent image nodes is greater than or equal to the preset similarity.

In a possible implementation manner, the processing unit is specifically configured to: determining a target character; the target character is any character in the electronic invoice; taking the target character as a starting point, performing transverse search on the electronic invoice according to the preset width, and performing longitudinal search on the electronic invoice according to the preset height to obtain a search result; and forming the characters corresponding to the search result into a character block to obtain a plurality of character blocks.

In a possible implementation manner, the processing unit is specifically configured to: transversely expanding the target character by a preset width on the electronic invoice by taking the target character as a starting point, and if a new character exists in the expanded area, continuing transversely expanding by taking the new character as the starting point until the expanded area does not exist the new character; the processing unit is specifically used for: and taking the target character as a starting point, longitudinally expanding the target character by a preset height on the electronic invoice, and if a new character exists in the expanded area, continuously longitudinally expanding the target character by taking the new character as the starting point until the expanded area does not exist the new character.

In a third aspect, the present application provides an electronic device comprising: a processor and a memory; the memory stores instructions executable by the processor; the processor is configured to execute the instructions to cause the electronic device to implement the method of the first aspect as described above.

In a fourth aspect, the present application provides a computer program product for, when run in an electronic device, causing the electronic device to perform the method of the first aspect described above, to carry out the method of the first aspect described above.

In a fifth aspect, the present application provides a computer readable storage medium comprising: a software instruction; the software instructions, when executed in an electronic device, cause the electronic device to implement the method of the first aspect described above.

The advantages of the second to fifth aspects described above may refer to the first aspect, and are not repeated here.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an information extraction system according to an embodiment of the present application;

fig. 2 is a schematic diagram of an electronic device according to an embodiment of the present application;

Fig. 3 is a flow chart of an information extraction method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an effect of layout division of an electronic invoice according to an embodiment of the present application;

FIG. 5 is a schematic diagram of performing character search on an electronic invoice according to an embodiment of the present application;

fig. 6 is a schematic diagram of a positional relationship of a home area according to an embodiment of the present application;

fig. 7 is a schematic diagram of an information extraction process according to an embodiment of the present application;

fig. 8 is a schematic diagram of the composition of an information extraction device according to an embodiment of the present application.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions of the present application, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

In addition, in the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may mean a or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present application, "plurality" means two or more than two.

Before explaining the embodiments of the present application in detail, some related terms and related techniques related to the embodiments of the present application are described.

Along with the increasing popularization of electronic invoices, the application range of electronic invoice format files is wider and wider, and a plurality of business systems need to extract ticket information after receiving the electronic invoice format files, and the typical application occasions are two, namely, in an enterprise reimbursement system, the electronic invoice information needs to be collected for simplifying the workload of manual input; and secondly, in the enterprise electronic accounting file, the invoice data is analyzed for full text retrieval. Therefore, the quick and accurate extraction of invoice information is a key technology that each business system can effectively utilize the electronic invoice format file.

Electronic invoices are increasingly in use. The domestic third party electronic invoice service platform basically has the capability of issuing the portable file format (Portable Document Format, PDF) electronic invoice, so that the vast majority of the current electronic invoices are in the PDF format. In addition, electronic invoices in an Open-layout Document (OFD) format have also become popular in recent years.

The current electronic invoice information extraction method is that PDF/OFD electronic invoice format files are firstly converted into image formats through an image system, then information extraction is carried out on the electronic invoice in a template matching mode, namely, an information label is fixed for a characteristic area in advance, and character information of the area is identified through an optical character identification (Optical Character Recognition, OCR) technology.

For example, in a template matching manner, the invoice area needs to be divided (for example, area a-ticket header, area B-buyer information, area C-tax details and total, area D-price tax total, area E-seller information, area F-ticket tail), and then the module is positioned according to the area division, so as to search for the element information tag of the electronic invoice.

However, the formats of various notes often change during practical application. Once the format of the electronic invoice changes, the original template will fail. For example, in the original format, the area a is the ticket header, the area B is the buyer information, the area a is the buyer information after the format is changed, and the area B is the ticket header. In this way, if the original template is still adopted to extract information, the extracted ticket header information is actually the information of the purchaser, so that the information extraction is inaccurate.

Therefore, the information extraction mode of the template matching is poor in generalization, and the templates are required to be updated and maintained regularly, so that time and labor are wasted.

In view of this, the application provides an information extraction method, device, equipment and storage medium, which splice characters on an electronic invoice into text contents based on different positions of the characters on the electronic invoice, so that the text contents can be extracted semantically without depending on a fixed template, and the information extraction result is not affected no matter how the format of the electronic invoice is changed.

The information extraction method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.

The information extraction method provided by the embodiment of the application can be applied to an electronic invoice information extraction system (hereinafter referred to as an information extraction system), and fig. 1 shows a schematic structure of the information extraction system. As shown in fig. 1, the information extraction system 10 includes an information extraction device 11 and a server 12. Wherein the information extraction device 11 is connected to a server 12. The information extraction device 11 and the server 12 may be connected by a wired manner or may be connected by a wireless manner, which is not limited in the embodiment of the present application.

The information extraction device 11 may acquire various electronic invoices from the server 12 and extract information of the electronic invoices. The specific information extraction process may refer to the information extraction method described in the following method embodiments, which is not described herein.

The information extraction means 11 may be an electronic device having a calculation processing function. For example, the information extraction device 11 may be a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), a desktop computer, a cloud server, etc., and the embodiment of the present application is not limited to the specific type of the electronic device.

The server 12 may be a single server or may be a server cluster including a plurality of servers. In some implementations, the server cluster may also be a distributed cluster. Optionally, the server may also be implemented on a cloud platform, which may include, for example, a private cloud, public cloud, hybrid cloud, community cloud (community cloud), distributed cloud, inter-cloud (inter-cloud), multi-cloud (multi-cloud), and the like, or any combination thereof. The embodiments of the present application are not limited in this regard.

In fig. 1, the information extraction device 11 and the server 12 are described as separate devices, and alternatively, the information extraction device 11 and the server 12 may be combined into one device. For example, the server 12 or its corresponding function, and the information extraction device 11 or its corresponding function may be integrated in one device. The embodiments of the present application are not limited in this regard.

The execution subject of the information extraction method provided by the embodiment of the present application may be the information extraction apparatus 11 described above. As described above, the information extraction device 11 may be an electronic device having a calculation processing function, such as a computer or a server. Alternatively, the information extraction means 11 may be a processor (e.g., a central processing unit (central processing unit, CPU)) in the aforementioned electronic device; alternatively, the information extraction device 11 may be an Application (APP) having a model training function installed in the electronic apparatus; alternatively, the information extraction device 11 may be a functional module or the like having a model training function in the electronic apparatus. The embodiments of the present application are not limited in this regard.

For simplicity of description, the information extraction apparatus 11 will be described below by taking an electronic device as an example.

Fig. 2 is a schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 2, the electronic device may include: processor 20, memory 21, communication line 22, and communication interface 23, and input-output interface 24.

The processor 20, the memory 21, the communication interface 23, and the input/output interface 24 may be connected by a communication line 22.

A processor 20 for executing instructions stored in a memory 21 to implement a fault analysis method provided in the following embodiments of the present application. The processor 20 may be a CPU, general purpose processor network processor (network processor, NP), digital signal processor (digital signal processing, DSP), microprocessor, microcontroller (micro control unit, MCU), programmable logic device (programmable logic device, PLD), or any combination thereof. The processor 20 may also be any other device having processing functions, such as a circuit, a device, or a software module, as embodiments of the present application are not limited in this respect. In one example, processor 20 may include one or more CPUs, such as CPU0 and CPU1 in fig. 2. As an alternative implementation, the electronic device may include multiple processors, for example, processor 25 (illustrated in phantom in fig. 2) in addition to processor 20.

A memory 21 for storing instructions. For example, the instructions may be a computer program. Alternatively, the memory 21 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and/or instructions, an access memory (random access memory, RAM) or other type of dynamic storage device capable of storing information and/or instructions, or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc (compact disc read-only memory, CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage medium, or other magnetic storage device, etc., which are not limited in this respect.

It should be noted that, the memory 21 may exist separately from the processor 20 or may be integrated with the processor 20. The memory 21 may be located within the electronic device or may be located outside the electronic device, which is not limited in this regard by the embodiment of the application.

Communication lines 22 for conveying information between components included in the electronic device.

A communication interface 23 for communicating with other devices (e.g., the image capturing apparatus 100 described above) or other communication networks. The other communication network may be an ethernet, a radio access network (radio access network, RAN), a wireless local area network (wireless local area networks, WLAN), etc. The communication interface 23 may be a module, a circuit, a transceiver, or any device capable of enabling communication.

And an input-output interface 24 for enabling human-machine interaction between the user and the electronic device. Such as enabling action interactions or information interactions between a user and an electronic device.

The input/output interface 24 may be a mouse, a keyboard, a display screen, or a touch-sensitive display screen, for example. The action interaction or information interaction between the user and the electronic equipment can be realized through a mouse, a keyboard, a display screen, a touch display screen or the like.

It should be noted that the structure shown in fig. 4 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown in fig. 4, or a combination of some components, or a different arrangement of components.

The information extraction method provided by the embodiment of the application is described below.

Fig. 3 is a flow chart of an information extraction method according to an embodiment of the present application. Alternatively, the method may be performed by an electronic device having the above-described hardware structure shown in fig. 2, and as shown in fig. 3, the method includes S301 to S304.

S301, acquiring characters in the electronic invoice and position information of the characters in the electronic invoice.

As one possible implementation, the electronic device obtains the electronic invoice file to be identified from the server. Further, the server determines the format of the electronic invoice file. Aiming at different formats, the electronic equipment can adopt different identification modes to extract characters in the electronic invoice and position information of each character in the electronic invoice.

The character extracted from the electronic invoice by the electronic device can be characters, numerical values, punctuation marks and the like, and the position information can be coordinate values corresponding to the character in the electronic invoice.

In an exemplary case that the format of the electronic invoice is an image format, the electronic device may obtain the characters of the electronic invoice and the position information of each character in the electronic invoice by performing optical character recognition on the electronic invoice. Under the condition that the format of the electronic invoice is a non-image format, the electronic equipment can analyze the electronic invoice through a preset analysis tool to obtain characters of the electronic invoice and position information of the characters in the electronic invoice.

In practical application, since the characters in the PDF file are equivalent to those already embedded in the file, the electronic device can directly implement parsing by using some specific parsing tools, and the parsing process can keep all the character contents and the position information of each character in the electronic invoice. If some characters are separated from each other by spaces, they are separated by spaces.

In some embodiments, if the electronic invoice is in a picture format, the electronic device may extract text content and coordinates by invoking an OCR engine, and map the text content and coordinates onto a solid white background for viewing by the user.

It will be appreciated that the process of using OCR introduces some recognition noise, while direct PDF parsing can avoid noise generation.

S302, carrying out character space clustering on each character according to the position information and a preset character expansion strategy to obtain a plurality of character blocks.

Wherein, a character block comprises at least one character, and the distance between any two adjacent characters in the same character block is smaller than or equal to the preset distance.

As a possible implementation manner, the electronic device spatially clusters each character based on the location information, so that the character strings with the same attribution area in the electronic invoice form a character block, and further obtain a plurality of character blocks.

It should be noted that, the electronic device may implement character clustering in different manners, and a clustering manner that may be involved is described below.

Exemplary one: the electronic device may perform layout division on the electronic invoice to obtain a plurality of areas. Wherein one region corresponds to one position range. Further, the electronic device matches the position information of each character with the position range of each area to determine the attribution area of each character. The electronic device concatenates the characters in the same home region into one character string (also called a character block) to obtain a plurality of character strings.

In practical application, the electronic device may perform layout division on the electronic invoice according to the pre-trained layout division model, so as to obtain a plurality of areas. The layout segmentation model is obtained through training based on a plurality of sample electronic invoices of the pre-divided areas.

In some embodiments, the electronic device performs layout division on the electronic invoice as a target detection task, and the layout division model may be a target detection model (e.g., YOLO, fast-RCNN, etc.).

As shown in fig. 4, the electronic device trains a model by constructing a certain amount of training data, and the model obtained by training directly predicts on the converted image, so that different detection frames can be obtained. Furthermore, the electronic device can combine the characters falling into the same detection frame into a character string according to the single character position obtained by PDF analysis.

For obtaining the semantically coherent character string, for the characters in the same detection frame, the electronic device can splice the characters into a character string according to the sequence from left to right and from top to bottom. For example, in the first detection frame of fig. 4, "third class b traditional Chinese medicine hospital", the electronic device is firstly spliced from left to right to obtain "third class b traditional Chinese medicine", and then the next row is further spliced to obtain "third class b traditional Chinese medicine hospital".

Exemplary two: for any character (called target character) in the electronic invoice, the electronic equipment takes the target character as a starting point, performs transverse search on the electronic invoice according to the preset width, and performs longitudinal search on the electronic invoice according to the preset height, so as to obtain a search result. Further, the electronic device composes the characters corresponding to the search result into a character string, and a plurality of character strings are obtained.

In practical application, when the electronic device performs the lateral search, the electronic invoice can be laterally expanded by a preset width with the target character as a starting point, and if a new character exists in the expanded area, the electronic invoice continues to be laterally expanded with the new character as the starting point until the expanded area does not have the new character. Similarly, when the electronic device performs the longitudinal search, the target character can be used as a starting point, the electronic invoice is longitudinally expanded by a preset height, if a new character exists in the expanded area, the new character is continuously longitudinally expanded by using the new character as the starting point until the expanded area does not have the new character.

As shown in fig. 5, the electronic device may select a character with the smallest abscissa as a starting point, expand to the right and below, and expand a certain number of pixels (typically set to the width or height of a character). If the expanded region has a character falling therein, the character is absorbed into the character block. The previous expansion is then continued with the newly absorbed character as a reference until no new character can be found to be absorbed. Further, the electronic device selects the next upper left character to perform the same operation until all the characters in the figure are searched.

S303, determining the position relation among the character blocks through a pre-trained graph network model, and splicing the plurality of character blocks by taking the position relation among the character blocks as a splicing sequence to obtain the text content of the electronic invoice.

As a possible implementation manner, the electronic device extracts the position feature, the text feature and the visual feature of each character block through a pre-trained graph network model, and predicts the position relationship among each character block according to the position feature, the text feature and the visual feature. Furthermore, the electronic equipment uses the position relation among the character blocks as a splicing sequence to splice the plurality of character blocks, so as to obtain the text content of the electronic invoice.

The graph network model comprises a language vector encoder, a multi-layer perceptron and an image feature extractor.

In practical application, for any character block, the electronic device may input the characters in the character block into the language vector encoder to obtain the text features corresponding to the character block. The embodiment of the application adopts a language vector encoder to encode the text sequence, namely, the electronic equipment inputs the text into a pre-trained language vector encoder (such as BERT) to obtain text characteristics, wherein the output of the BERT generates an embedded vector for each character or word, and the electronic equipment averages the embedded vectors of all words in the text string to be used as the characteristic vector of the whole text string. It will be appreciated that text features encoded by the language vector encoder will have more semantic information.

The electronic equipment can input the original coordinates and the mapping coordinates of the character block on the electronic invoice into the multi-layer perceptron to obtain the position characteristics corresponding to the character block. Where the original coordinates are the coordinates of the position of each character block on the electronic invoice, such as the upper left and lower left corner points (x 1, y1, x2, y 2), and are fixed values. The mapping coordinates are coordinates obtained by linear transformation of the original coordinates. In the embodiment of the application, a leachable dynamic position coding mechanism is adopted to code the position characteristics.

Illustratively, assuming that the original coordinates b= (x 1, y1, x2, y 2) of the character block are first mapped by a learnable Linear operation as B' =atan (Linear (B))+b, where Linear operation is Linear transformation (wx+b), where W and B are both learnable parameters. Further, the electronic device converts the original coordinates and mapped coordinates into 256 dimensions through an encoding layer composed of a multi-layer perceptron (MLP), wherein the original 4-dimension feature dimensions are:

f=mlp (B) 4-dimensional

F '=mlp (B') 256 dimensions

Further, the electronic device will weight the final location feature from two different types of location codes:

P＝αF+βF’

wherein the weighting coefficients α and β are learnable parameters.

It can be appreciated that the co-located character blocks can dynamically scale or adjust the mapping coordinates during the training process, so that the text with relatively close semantics has a certain ability to be pulled in the feature space, for example: there is some text that is very close in space, in fact, the semantic relationship is not large (e.g., text in that many columns, text in the same line but text in different columns are not so much in fact, semantic association), and through dynamic learning, the above position coding can pull features apart in space.

The electronic equipment can input the corresponding image of the character block on the electronic invoice into the image feature extractor to obtain local image features, and the local image features are used as visual features corresponding to the character block.

In some embodiments, the electronic device may use a character block as a graph node, and generate edge features corresponding to each graph node according to the position features, text features, and visual features of each graph node. Further, the electronic equipment determines the connection relation between the nodes of each graph according to the edge characteristics, and takes the connection relation as the position relation between the character blocks; the similarity of the edge characteristics of any two adjacent image nodes is greater than or equal to the preset similarity.

It should be noted that, each edge in the graph network model is formed by two nodes, and the coordinates of the edges may be defined as b1_b2= [ x1_1, y1_1, x1_2, y1_2] and b2= [ x2_1, y2_1, x2_2, y2_2] if the coordinate frames of the nodes are defined as b1= [ x1_1-x2_1, y1_1-y2_1, x1_2-x2_2, y1_2-y2_2]. The learning dynamic position coding operation is used for mapping the learning dynamic position coding operation into a new feature vector, namely the edge position feature of each edge. Assuming that the electronic device obtains the location information f= { F1, F2, … Fn } of N nodes, and the n×n edge location features e= { e_11, e_12, …, e_nn }, the electronic device may update the node information through a location-aware self-attention mechanism, so that each node may be associated with all other nodes by establishing spatial relative location information:

furthermore, the electronic device can splice the node characteristics updated by every two nodes to obtain a final edge characteristic set, and classify the edge characteristics according to the types of the edge characteristics.

For example, the graph network model may take each character block as a point and predict the connection of adjacent edges using the three features described above. If one character block is a prepositive node of another character block, the graph network model predicts that a connection relationship exists between the two character blocks, and the prediction result is 1, otherwise, the prediction result is 0. And after the final model is predicted, a complete path is formed to connect all character blocks in series, and the electronic equipment can splice the character blocks according to the sequence, so that the text content of the electronic invoice can be obtained.

As another possible implementation, the electronic device first determines the row information of each character block. Wherein the row information is used to reflect the row level of each character block and the position of each character block in the row. Further, the electronic equipment splices the character blocks at the same row level according to the position sequence to obtain a plurality of row splicing results. And the electronic equipment splices the plurality of row splicing results according to the row level to obtain the text content of the electronic invoice.

In some embodiments, to obtain the row information of each character block, for any two character blocks (e.g., a first character block and a second character block), the electronic device determines a first ordinate of the first character block in the electronic invoice and a second ordinate of the second character block in the electronic invoice. And under the condition that the deviation between the first ordinate and the second ordinate is smaller than or equal to a preset threshold, the electronic equipment determines the two character blocks as the same row level, and determines the positions of the two character blocks in the row according to the transverse position relation of the two character blocks, so as to obtain row information of each character block.

It should be noted that the deviation between the first ordinate and the second ordinate may reflect the height overlapping range of the first character block and the second character block. The height overlapping range may be a ratio of a length of the character blocks overlapping in the longitudinal direction to a height value of the smaller one of the two heights.

As shown in fig. 6, it is assumed that the ordinate ranges of the two character blocks (i.e., rectangular boxes in the figure) are (y 1, y 2) and (y 3-y 4), respectively. They exist in a number of possible spatial relationships including relationship 1, relationship 2, and relationship 3 in fig. 6. The height overlap range can be calculated by the following formula:

the electronic device may determine two character blocks in the relationship 1 as the same row, determine two character blocks in the relationship 2 as the same row, and determine two character blocks in the relationship 3 as different rows.

The electronic device, for example, uses one character block with the smallest ordinate as a reference, and sequentially determines all the remaining character blocks. If there is a character block whose overlapping range in height with the current character block is greater than a certain threshold (e.g., 50%), the character block is considered to belong to the same row, and the character block is merged into a row set. Further, the electronic device continues to determine the next character block until the line set cannot be expanded, and considers that the character blocks of one line are all found. The electronic device then performs the next round of searching for the smallest ordinate of the remaining character blocks and searching for the character block in the same row as it. For each row, the character blocks of the electronic device are ordered from small to large. And finally, the electronic equipment sequentially splices the rows according to the row level, so that the character strings can be spliced, and the text content of the electronic invoice is obtained.

S304, carrying out semantic extraction on the text content to obtain the information of the electronic invoice.

As a possible implementation manner, the electronic device performs semantic extraction on the text content through a named entity recognition (Named Entity Recognition, NER) technology to obtain information of the electronic invoice.

It should be noted that the information of the electronic invoice to be extracted may be set according to actual requirements, for example, information such as serial numbers, account information, payment amounts of the electronic devices may be extracted.

For example, the electronic device may use a pre-trained language model (e.g., bi-directional encoder (Bidirectional Encoder Representations from Transformers, BERT)) to semantically extract text content. The input of the model can be the text content of the electronic invoice obtained by splicing a plurality of character strings, and the text content is output as the category of each character on the character strings. For language models, the remaining characters, except for preset key characters, can be considered as background classes. The language model obtains final structured information through the obtained character combination of the corresponding category.

In some embodiments, the electronic device may set an appropriate data augmentation policy to enrich training samples of the model, thereby improving accuracy of language model information extraction,

For example, in the training process of the model, the electronic device may randomly exchange the positions of the two character strings according to the probability of the preset proportion (for example, 50%), so that the model may obtain as many training samples of different character block combinations as possible in the training process, and further may simulate various situations where the character stitching sequence may exist.

For another example, the electronic device may also randomly insert irrelevant text between strings with a probability of a preset ratio to simulate background text noise and improve noise immunity of the language model.

In one design, as shown in fig. 7, the electronic device may implement information extraction of the electronic invoice through an parsing module, a text splicing module, and a semantic extraction module. Specifically, the electronic equipment can analyze the electronic invoice in the PDF/OFD format through the analysis module, and extract the characters in the electronic invoice and the position information of the characters in the electronic invoice to obtain a result displayed on the right side of the analysis module. The electronic device can perform word space aggregation through the word stitching module, namely, aggregate the characters belonging to the same word block (or close distance) into one word block. After obtaining the independent text blocks, the electronic device can also sequentially splice different text blocks through the text splicing module so as to obtain complete text contents. Further, the electronic equipment performs semantic recognition on the text content through the semantic extraction module, extracts key information in the electronic invoice, and outputs the key information.

The foregoing description of the solution provided by the embodiments of the present application has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. The technical aim may be to use different methods to implement the described functions for each particular application, but such implementation should not be considered beyond the scope of the present application.

In an exemplary embodiment, the embodiment of the application further provides an image stitching device. Fig. 8 is a schematic diagram of the composition of an information extraction device according to an embodiment of the present application. As shown in fig. 8, the information extraction apparatus includes: an acquisition unit 201 and a processing unit 202.

An acquiring unit 201, configured to acquire characters in an electronic invoice and position information of each character in the electronic invoice; the processing unit 202 is configured to perform character space clustering on each character according to the location information and a preset character expansion policy, so as to obtain a plurality of character blocks; the character block comprises at least one character, and the distance between any two adjacent characters in the same character block is smaller than or equal to a preset distance; the processing unit 202 is further configured to determine a positional relationship between each character block through a pre-trained graph network model; the processing unit 202 is further configured to splice the plurality of character blocks with the position relationship between the character blocks as a splicing order, so as to obtain text content of the electronic invoice; the processing unit 202 is further configured to perform semantic extraction on the text content to obtain information of the electronic invoice.

In a possible implementation manner, the processing unit 202 is specifically configured to: and extracting the position features, text features and visual features of each character block through a pre-trained graph network model, and predicting the position relationship among each character block according to the position features, the text features and the visual features.

In one possible implementation, the graph network model includes a language vector encoder, a multi-layer perceptron, and an image feature extractor; the processing unit 202 is specifically configured to: for any character block, inputting characters in the character block into a language vector encoder to obtain text characteristics corresponding to the character block; inputting the original coordinates and the mapping coordinates of the character blocks on the electronic invoice into a multi-layer perceptron to obtain the position characteristics corresponding to the character blocks; the mapping coordinates are coordinates obtained by linear transformation of the original coordinates; inputting the corresponding image of the character block on the electronic invoice into an image feature extractor to obtain local image features, and taking the local image features as visual features corresponding to the character block.

In a possible implementation manner, the processing unit 202 is specifically configured to: taking a character block as a graph node, and generating edge characteristics corresponding to each graph node according to the position characteristics, text characteristics and visual characteristics of each graph node; according to the edge characteristics, determining the connection relation between the nodes of each graph, and taking the connection relation as the position relation between the character blocks; the similarity of the edge characteristics of any two adjacent image nodes is greater than or equal to the preset similarity.

In a possible implementation manner, the processing unit 202 is specifically configured to: determining a target character; the target character is any character in the electronic invoice; taking the target character as a starting point, performing transverse search on the electronic invoice according to the preset width, and performing longitudinal search on the electronic invoice according to the preset height to obtain a search result; and forming the characters corresponding to the search result into a character block to obtain a plurality of character blocks.

In a possible implementation manner, the processing unit 202 is specifically configured to: transversely expanding the target character by a preset width on the electronic invoice by taking the target character as a starting point, and if a new character exists in the expanded area, continuing transversely expanding by taking the new character as the starting point until the expanded area does not exist the new character; the processing unit 202 is specifically configured to: and taking the target character as a starting point, longitudinally expanding the target character by a preset height on the electronic invoice, and if a new character exists in the expanded area, continuously longitudinally expanding the target character by taking the new character as the starting point until the expanded area does not exist the new character.

It should be noted that the division of the modules in fig. 8 is schematic, and is merely a logic function division, and other division manners may be implemented in practice. For example, two or more functions may also be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional units.

In an exemplary embodiment, the application also provides a computer readable storage medium comprising software instructions which, when run on an electronic device, cause the electronic device to perform any of the methods provided by the above embodiments.

In an exemplary embodiment, the application also provides a computer program product comprising computer-executable instructions which, when run on an electronic device, cause the electronic device to perform any of the methods provided by the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer-executable instructions. When the computer-executable instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer-executable instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, from one website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a Solid State Disk (SSD), etc.

Although the application is described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "Comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Although the application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

The foregoing is merely illustrative of specific embodiments of the present application, and the scope of the present application is not limited thereto, but any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. An information extraction method, characterized in that the method comprises:

acquiring characters in an electronic invoice and position information of each character in the electronic invoice, and carrying out character space clustering on each character according to the position information and a preset character expansion strategy to obtain a plurality of character blocks; the character block comprises at least one character, and the distance between any two adjacent characters in the same character block is smaller than or equal to a preset distance;

determining the position relation among the character blocks through a pre-trained graph network model;

the position relation among the character blocks is used as a splicing sequence, and the character blocks are spliced to obtain the text content of the electronic invoice;

and carrying out semantic extraction on the text content to obtain the information of the electronic invoice.

2. The method according to claim 1, wherein determining the positional relationship between the character blocks through the pre-trained graph network model comprises:

And extracting the position features, the text features and the visual features of each character block through a pre-trained graph network model, and predicting the position relationship among each character block according to the position features, the text features and the visual features.

3. The method of claim 2, wherein the graph network model comprises a language vector encoder, a multi-layer perceptron, an image feature extractor; the extracting the position feature, the text feature and the visual feature of each character block through a pre-trained graph network model comprises the following steps:

for any character block, inputting characters in the character block into the language vector encoder to obtain text characteristics corresponding to the character block;

inputting the original coordinates and the mapping coordinates of the character blocks on the electronic invoice into the multilayer perceptron to obtain the position features corresponding to the character blocks; the mapping coordinates are coordinates obtained by linear transformation of the original coordinates;

and inputting the corresponding image of the character block on the electronic invoice into the image feature extractor to obtain local image features, and taking the local image features as visual features corresponding to the character block.

4. A method according to claim 2 or 3, wherein predicting the positional relationship between character blocks from the positional feature, the text feature and the visual feature comprises:

taking a character block as a graph node, and generating edge characteristics corresponding to each graph node according to the position characteristics, text characteristics and visual characteristics of each graph node;

according to the edge characteristics, determining the connection relation between the nodes of each graph, and taking the connection relation as the position relation between the character blocks; the similarity of the edge characteristics of any two adjacent image nodes is greater than or equal to the preset similarity.

5. The method of claim 1, wherein the performing character spatial clustering on each character according to the location information and a preset character expansion policy to obtain a plurality of character blocks includes:

determining a target character; the target character is any character in the electronic invoice;

taking the target character as a starting point, performing transverse search on the electronic invoice according to a preset width, and performing longitudinal search on the electronic invoice according to a preset height to obtain a search result;

And forming the characters corresponding to the search result into a character block to obtain a plurality of character blocks.

6. The method of claim 5, wherein said performing a lateral search on said electronic invoice by a preset width comprises:

transversely expanding the preset width on the electronic invoice by taking the target character as a starting point, and if a new character exists in the expanded area, continuing transversely expanding by taking the new character as the starting point until no new character exists in the expanded area;

the step of carrying out longitudinal search on the electronic invoice according to the preset height comprises the following steps:

and taking the target character as a starting point, longitudinally expanding the preset height on the electronic invoice, and if a new character exists in the expanded area, continuously and longitudinally expanding the new character as the starting point until no new character exists in the expanded area.

7. An information extraction apparatus, characterized in that the apparatus comprises an acquisition unit and a processing unit;

the acquiring unit is used for acquiring characters in the electronic invoice and position information of the characters in the electronic invoice;

the processing unit is used for carrying out character space clustering on each character according to the position information and a preset character expansion strategy to obtain a plurality of character blocks; the character block comprises at least one character, and the distance between any two adjacent characters in the same character block is smaller than or equal to a preset distance;

The processing unit is also used for determining the position relation among the character blocks through a pre-trained graph network model;

the processing unit is further used for splicing the plurality of character blocks by taking the position relation among the character blocks as a splicing sequence to obtain the text content of the electronic invoice;

and the processing unit is also used for carrying out semantic extraction on the text content to obtain the information of the electronic invoice.

8. The apparatus according to claim 7, wherein the processing unit is specifically configured to:

extracting position features, text features and visual features of each character block through a pre-trained graph network model, and predicting the position relationship among each character block according to the position features, the text features and the visual features;

the graph network model comprises a language vector encoder, a multi-layer perceptron and an image feature extractor; the processing unit is specifically configured to:

Inputting the corresponding image of the character block on the electronic invoice into the image feature extractor to obtain local image features, and taking the local image features as visual features corresponding to the character block;

and the processing unit is specifically configured to:

according to the edge characteristics, determining the connection relation between the nodes of each graph, and taking the connection relation as the position relation between the character blocks; the similarity of the edge characteristics of any two adjacent image nodes is greater than or equal to the preset similarity;

and the processing unit is specifically configured to:

forming a character block from characters corresponding to the search result to obtain a plurality of character blocks;

and the processing unit is specifically configured to:

the processing unit is specifically configured to:

9. An electronic device, comprising: a processor and a memory;

the memory stores instructions executable by the processor;

the processor is configured to, when executing the instructions, cause the electronic device to implement the method of any one of claims 1-6.

10. A computer-readable storage medium, the readable storage medium comprising: a software instruction;

when executed in an electronic device, causes the electronic device to carry out the method according to any one of claims 1-6.