CN114495133A - Electronic invoice information extraction method and device, electronic equipment and storage medium - Google Patents

Electronic invoice information extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114495133A
CN114495133A CN202210067279.XA CN202210067279A CN114495133A CN 114495133 A CN114495133 A CN 114495133A CN 202210067279 A CN202210067279 A CN 202210067279A CN 114495133 A CN114495133 A CN 114495133A
Authority
CN
China
Prior art keywords
cell
information
cells
electronic invoice
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210067279.XA
Other languages
Chinese (zh)
Inventor
刘东煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202210067279.XA priority Critical patent/CN114495133A/en
Publication of CN114495133A publication Critical patent/CN114495133A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides an electronic invoice information extraction method, an electronic invoice information extraction device, electronic equipment and a storage medium, wherein the method comprises the following steps: identifying the electronic invoice image, and performing semantic segmentation on an identification result to obtain cell information of a plurality of cells; performing label classification based on the cell information of the plurality of cells to obtain a label of each cell; identifying a plurality of labels of a plurality of cells, and determining the mapping relation among the cells; normalizing the first target text information of each cell in the electronic invoice image to obtain second target text information; and extracting the electronic invoice information from the second target text information according to the mapping relation among the cells. According to the invention, the semantic segmentation is carried out on the recognition result, and the electronic invoice information is extracted after the two overlapped boundary frames are segmented, so that the accuracy of the extracted electronic invoice information is improved.

Description

Electronic invoice information extraction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an electronic invoice information extraction method and device, electronic equipment and a storage medium.
Background
Charging items in the medical electronic invoice are generally presented in a table form, the table mode is relatively fixed, and in order to extract information in the electronic invoice, in the prior art, structured processing is usually performed on a medical charging bill in a picture form, and then acquisition and management of bill information are performed.
However, the tables in the medical electronic invoice may overlap, and if the medical charging bill is directly extracted after being structured, it cannot be ensured that the extracted charging items match the charging amount, so that the extracted electronic invoice information is disordered and has low accuracy.
Therefore, there is a need for a method for accurately extracting electronic invoice information.
Disclosure of Invention
In view of the above, it is necessary to provide an electronic invoice information extraction method, an electronic invoice information extraction device, an electronic device, and a storage medium, where the electronic invoice information extraction method, the electronic device, and the storage medium extract electronic invoice information by performing semantic segmentation on the recognition result and segmenting two overlapped bounding boxes, and then performing electronic invoice information extraction respectively, so as to improve the accuracy of the extracted electronic invoice information.
The first aspect of the invention provides an electronic invoice information extraction method, which comprises the following steps:
receiving an electronic invoice image of a text to be extracted, and identifying the electronic invoice image to obtain an identification result;
performing semantic segmentation on the recognition result to obtain cell information of a plurality of cells of the electronic invoice image;
performing label classification based on the cell information of the plurality of cells to obtain a label of each cell;
identifying a plurality of labels of the plurality of cells, and determining the mapping relation among the cells;
normalizing the first target text information of each cell in the electronic invoice image to obtain second target text information of each cell in the electronic invoice image;
and according to the mapping relation among the cells, extracting the electronic invoice information from a plurality of second target text information of a plurality of cells of the electronic invoice image.
Optionally, the semantic segmentation of the recognition result to obtain the cell information of the multiple cells of the electronic invoice image includes:
carrying out sequence labeling on the first text information of each boundary box in the recognition result;
inputting the first text information with the sequence mark into a sequence mark model trained in advance for recognition to obtain second text information of each bounding box;
and identifying the label in the second text information of each boundary box, and performing semantic segmentation on each corresponding boundary box to obtain the cell information of a plurality of cells of the electronic invoice image.
Optionally, the identifying the label in the second text information of each bounding box and performing semantic segmentation on each corresponding bounding box to obtain the cell information of the multiple cells of the electronic invoice image includes:
when a plurality of labels in the second text message of any one of the plurality of bounding boxes are identified, segmenting the any one bounding box according to the labels to obtain a plurality of cells;
performing coordinate conversion on the first coordinate information of any one of the bounding boxes to obtain second coordinate information of each of the plurality of cells;
determining first target text information of each cell according to the second coordinate information of each cell in the plurality of cells;
and updating the identification result according to the second coordinate information and the first target text information of each unit cell in the plurality of unit cells to obtain the unit cell information of the plurality of unit cells of the electronic invoice image.
Optionally, the performing coordinate conversion on the first coordinate information of any one of the bounding boxes to obtain the second coordinate information of each of the multiple cells includes:
identifying character types to which all characters in the second text information in any one bounding box belong;
determining a standard character of each character according to the character type of each character;
converting all characters in the second text information of any one bounding box into standard characters according to the character types to which all the characters belong and the corresponding standard characters of each character, calculating the sum of the number of the standard characters, and determining the sum as the sum of the number of the standard characters of the second text information of any one bounding box;
and calculating the coordinate information of each character in any boundary box by adopting a preset formula according to the sum of the first coordinate information of any boundary box and the number of the standard characters of the second text information of any boundary box, and calculating the second coordinate information of each cell in the multiple cells according to the coordinate information of each character.
Optionally, the performing label classification based on the cell information of the multiple cells to obtain a label of each cell includes:
inputting the cell information into a label classification model trained in advance to obtain a label of each cell, wherein the training process of the label classification model comprises the following steps:
obtaining historical cell information;
extracting historical text information and coordinate information of each cell from the historical cell information;
determining the basic feature of each cell according to the historical text information of each cell, and determining the column alignment area feature, the relative position feature and the row adjacent cell feature of each cell according to the coordinate information of each cell;
associating the basic feature, the column alignment area feature, the relative position feature and the line adjacent cell feature of each cell to obtain a target feature of each cell;
determining a training set and a test set from the target features of the plurality of cells;
training a preset fine adjustment model based on the training set to obtain a label classification model;
inputting the test set into the label classification model for testing, and calculating the test passing rate;
if the test passing rate is greater than or equal to a preset passing rate threshold value, determining that the training of the label classification model is finished; and if the test passing rate is smaller than the preset passing rate threshold value, increasing the number of the training sets, and re-training the label classification model.
Optionally, the determining, according to the coordinate information of each cell, the column alignment area feature, the relative position feature, and the row adjacent cell feature of each cell includes:
randomly selecting any cell in any column as a target cell, and calculating the column alignment feature of the next row of cells of the target cell, wherein the calculating the column alignment feature of the next row of cells of the target cell comprises: starting recursion from the target cell, sequentially traversing the next row of cells of the target cell, and calculating the column height of the next row of cells of the target cell to obtain the column height; calculating the column distance difference between the target cell and the next row of cells to obtain the column distance difference; calculating the overlapping rate between the target cell and the next row of cells to obtain the row overlapping rate; when the quotient of the column distance difference and the column height is smaller than or equal to a preset first threshold value and the row overlapping rate is larger than or equal to a preset second threshold value, determining that the target unit cell and a unit cell in the next row of the target unit cell form an alignment area; when the target cell and the next row of cells of the target cell form an alignment area, calculating an average value of basic features of the next row of cells of the target cell, and determining the average value as a column alignment feature of the next row of cells of the target cell; repeatedly executing the calculation of the column alignment features of the next row of the target unit cells until the column alignment features of all the unit cells are extracted;
identifying the label of each cell to determine a title cell and an information cell, calculating the relative row distance between each information cell and each title cell, extracting the relative row distance, and determining the relative row distance as the relative position characteristic of the corresponding information cell;
and merging the basic features and the basic features of the left adjacent cell and the right adjacent cell of each cell to obtain merged basic features, and determining the merged basic features as the row adjacent cell features of each cell.
Optionally, the normalizing the first target text information of each cell in the electronic invoice image to obtain the second target text information of each cell in the electronic invoice image includes:
extracting a plurality of first keywords hitting a preset dictionary from the first target text information of each cell;
extracting a plurality of second keywords with a preset number from the preset dictionary according to the plurality of first keywords;
calculating the similarity between any one first keyword and any one second keyword;
and selecting the second keyword with the maximum similarity from the calculated similarities to determine the second keyword as the second target text information of the corresponding cell.
A second aspect of the present invention provides an electronic invoice information extraction apparatus, the apparatus comprising:
the receiving and identifying module is used for receiving the electronic invoice image of the text to be extracted and identifying the electronic invoice image to obtain an identification result;
the segmentation module is used for performing semantic segmentation on the recognition result to obtain cell information of a plurality of cells of the electronic invoice image;
the classification module is used for performing label classification based on the cell information of the plurality of cells to obtain a label of each cell;
the determining module is used for identifying a plurality of labels of the plurality of cells and determining the mapping relation among the cells;
the normalization processing module is used for performing normalization processing on the first target text information of each cell in the electronic invoice image to obtain second target text information of each cell in the electronic invoice image;
and the extraction module is used for extracting the electronic invoice information from a plurality of second target text information of a plurality of cells of the electronic invoice image according to the mapping relation among the cells.
A third aspect of the present invention provides an electronic device, which includes a processor and a memory, wherein the processor is configured to implement the electronic invoice information extraction method when executing a computer program stored in the memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the electronic invoice information extraction method.
In summary, according to the electronic invoice information extraction method, the electronic invoice information extraction device, the electronic device and the storage medium, the electronic invoice image is identified, the identification result is subjected to semantic segmentation, and the cell information of a plurality of cells of the electronic invoice image is obtained, so that the problem that one bounding box comprises a plurality of cells is solved, and the accuracy of the subsequently extracted electronic invoice information is improved. And classifying the labels based on the cell information of the plurality of cells to obtain the label of each cell, so that the labels of different types are prevented from being divided into the same group, the accuracy of label classification is improved, and the accuracy of each label is ensured. And performing normalization processing on the first target text information of each cell in the electronic invoice image, wherein the normalization processing is to update the first target text information of each cell, so that the extracted electronic invoice information is ensured to be a standard value, and the readability of the electronic invoice information is improved. According to the mapping relation among the cells, the electronic invoice information is extracted from the second target text information of the cells of the electronic invoice image, the phenomenon of electronic invoice information confusion caused by independent electronic invoice information is avoided, and the accuracy and efficiency of electronic invoice information extraction are improved.
Drawings
Fig. 1 is a flowchart of an electronic invoice information extraction method according to an embodiment of the present invention.
Fig. 2 is a structural diagram of an electronic invoice information extraction apparatus according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example one
Fig. 1 is a flowchart of an electronic invoice information extraction method according to an embodiment of the present invention.
In this embodiment, the electronic invoice information extraction method may be applied to an electronic device, and for an electronic device that needs to extract electronic invoice information, the electronic device may directly integrate the function of extracting electronic invoice information provided by the method of the present invention, or operate in the electronic device in the form of a Software Development Kit (SDK).
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and the like.
As shown in fig. 1, the electronic invoice information extraction method specifically includes the following steps, and the order of the steps in the flowchart may be changed, and some steps may be omitted according to different requirements.
And S11, receiving the electronic invoice image of the text to be extracted, and identifying the electronic invoice image to obtain an identification result.
In this embodiment, when a user extracts electronic invoice information, the electronic invoice image of a text to be extracted is sent to the client through the client, specifically, the client may be a smart phone, an IPAD, or other existing intelligent devices, the server may be an electronic invoice information extraction subsystem, and in the electronic invoice information extraction process, if the client can send the electronic invoice image of the text to be extracted to the electronic invoice information extraction subsystem, the electronic invoice image is identified when the electronic invoice information extraction subsystem receives the electronic invoice image of the text to be extracted.
In this embodiment, in the digital medical technology field, the electronic invoice image may be a payment item invoice, a hospital charging electronic invoice, an outpatient service charging bill, or a physical examination report, for example, a blood routine examination report, a urine routine examination report, or other physical examination reports.
In an optional embodiment, the identifying the electronic invoice image to obtain an identification result includes:
recognizing the electronic invoice image by adopting an OCR (optical character recognition), and obtaining information of a plurality of boundary boxes, wherein each boundary box information comprises first coordinate information, confidence and first text information of each boundary box;
and determining the information of the plurality of bounding boxes as a recognition result.
In this embodiment, the first coordinate information of each bounding box includes coordinate information of an upper left corner, coordinate information of a lower left corner, coordinate information of an upper right corner, and coordinate information of a lower right corner of each bounding box.
In this embodiment, for the outpatient service charging ticket, the first text information may be other information such as a project name, an amount, a remark, and the like.
And S12, performing semantic segmentation on the recognition result to obtain cell information of a plurality of cells of the electronic invoice image.
In this embodiment, the cell information includes the second coordinate information and the first target text information of each cell, the cells are generally divided through a blank area, one boundary box represents one table unit, that is, one cell, and there are also special cases, when a cell text is long, spatial adhesion may occur with adjacent cells, there is a phenomenon that a plurality of cells are included in one boundary box, semantic division needs to be performed on the cells, the problem that one boundary box includes a plurality of cells is solved, and the accuracy of electronic invoice information extracted subsequently is improved.
In an optional embodiment, the semantic segmentation of the recognition result to obtain the cell information of the multiple cells of the electronic invoice image includes:
carrying out sequence labeling on the first text information of each boundary box in the recognition result;
inputting the first text information with the sequence mark into a sequence mark model trained in advance for recognition to obtain second text information of each bounding box;
and identifying the label in the second text information of each boundary box, and performing semantic segmentation on each corresponding boundary box to obtain the cell information of a plurality of cells of the electronic invoice image.
In this embodiment, sequence labeling may be performed on the first text information of each bounding box by adopting BIO sequence labeling, where the BIO sequence labeling is the prior art and is not repeated in this case.
Further, the identifying the label in the second text information of each bounding box and performing semantic segmentation on each corresponding bounding box to obtain the cell information of the multiple cells of the electronic invoice image includes:
when a plurality of labels in the second text message of any one of the plurality of bounding boxes are identified, segmenting the any one bounding box according to the labels to obtain a plurality of cells;
performing coordinate conversion on the first coordinate information of any one of the bounding boxes to obtain second coordinate information of each of the plurality of cells;
determining first target text information of each cell according to the second coordinate information of each cell in the plurality of cells;
and updating the identification result according to the second coordinate information and the first target text information of each unit cell in the plurality of unit cells to obtain the unit cell information of the plurality of unit cells of the electronic invoice image.
In the embodiment, semantic segmentation is performed on each boundary box, electronic invoice information can be extracted after two overlapped boundary boxes are segmented, the problem that electronic invoice information is extracted in a disordered manner due to the fact that electronic invoice information is extracted directly from the boundary boxes is solved, and the accuracy of the extracted electronic invoice information is improved.
Further, the coordinate converting the first coordinate information of any one of the bounding boxes to obtain the second coordinate information of each of the plurality of cells includes:
identifying the character types to which all characters in the second text information in any one bounding box belong;
determining a standard character of each character according to the character type of each character;
converting all characters in the second text information of any one bounding box into standard characters according to the character types to which all the characters belong and the corresponding standard characters of each character, calculating the sum of the number of the standard characters, and determining the sum as the sum of the number of the standard characters of the second text information of any one bounding box;
and calculating the coordinate information of each character in any boundary box by adopting a preset formula according to the sum of the first coordinate information of any boundary box and the number of the standard characters of the second text information of any boundary box, and calculating the second coordinate information of each cell in the multiple cells according to the coordinate information of each character.
Specifically, calculating the coordinate information of the upper left corner and the coordinate information of the upper right corner of the nth character in any one of the bounding boxes by using the following preset formulas includes:
Figure BDA0003480620220000091
Figure BDA0003480620220000092
xn_0=xn-1_3
yn_0=yn-1_3
Figure BDA0003480620220000093
Figure BDA0003480620220000094
x0_0=xzri_0
y0_0=yori_0
wherein the content of the first and second substances,
Figure BDA0003480620220000095
coordinate information indicating the upper left corner of any one of the bounding boxes,
Figure BDA0003480620220000096
coordinate information indicating the upper right corner of any bounding box, all _ nThe ormal _ char represents the sum of standard characters of the second text information of the arbitrary bounding box; normal _ charnThe number of standard characters representing the nth character, (x)n_0,yn_0) Coordinate information representing the upper left corner of the nth character, (x)n_3,yn_3) Coordinate information representing the upper right corner of the nth character.
Illustratively, the standard character of a Chinese character is 1, the standard character of an uppercase English character is 0.75, the standard character of a lowercase English character is 0.5, the standard character of a punctuation mark is 0.5, normal _ charnThe standard character representing the nth character, for example, the second text message is "Helen eye drops hydrochloric acid levofloxacin krypton", and the nth character is "salt", then normal _ charnThe reference character 1 represents the 6 th character.
Specifically, the principle of calculating the coordinate information of the lower left corner and the coordinate information of the lower right corner of the nth character in the arbitrary bounding box is the same as the principle of calculating the coordinate information of the upper left corner and the coordinate information of the upper right corner of the nth character in the arbitrary bounding box.
In this embodiment, in order to ensure the accuracy of semantic segmentation, the first text information in each bounding box is subjected to sequence labeling and then subjected to semantic segmentation, and meanwhile, the first text information in each bounding box is converted into standard characters and then subjected to coordinate conversion, so that the cell information of multiple cells of the electronic invoice image is obtained.
And S13, performing label classification based on the cell information of the plurality of cells to obtain the label of each cell.
In this embodiment, after the cell information is obtained, since the cell information includes the second coordinate information and the first target text information of each cell, the label of each cell can be obtained according to the first target text information of each cell.
In an optional embodiment, the performing label classification based on the cell information of the plurality of cells to obtain a label of each cell includes:
and inputting the cell information of the plurality of cells into a label classification model trained in advance to obtain the label of each cell.
Specifically, the training process of the label classification model includes:
obtaining historical cell information;
extracting historical text information and coordinate information of each cell from the historical cell information;
determining the basic feature of each cell according to the historical text information of each cell, and determining the column alignment area feature, the relative position feature and the row adjacent cell feature of each cell according to the coordinate information of each cell;
associating the basic feature, the column alignment area feature, the relative position feature and the line adjacent cell feature of each cell to obtain a target feature of each cell;
determining a training set and a test set from the target features of the plurality of cells;
training a preset fine adjustment model based on the training set to obtain a label classification model;
inputting the test set into the label classification model for testing, and calculating the test passing rate;
if the test passing rate is larger than or equal to a preset passing rate threshold value, determining that the training of the label classification model is finished; and if the test passing rate is smaller than the preset passing rate threshold value, increasing the number of the training sets, and re-training the label classification model.
In this embodiment, a fine adjustment model may be preset, where the fine adjustment model is preset based on the basic features, the column alignment region features, the relative position features, and the row adjacent cell features of the history cells.
Further, the determining the basic feature of each cell according to the historical text information of each cell includes:
when the historical text information of each cell hits a keyword in a preset dictionary, determining that the first basic feature of each cell is 1; or when the historical text information of each cell does not hit the keywords in the preset dictionary, determining that the first basic feature of each cell is 0;
counting the first times of the keywords in the preset dictionary when any two words in the historical text information of each cell are ordered and hit, and determining the first times as the second basic characteristic of each cell;
counting a second time of hitting the keywords in the preset dictionary by each word in the historical text information of each cell, and determining the second time as a third basic feature of each cell;
when the historical text information of each cell is matched with a preset amount regular expression, determining that the fourth basic characteristic of each cell is 1; or when the historical text information of each cell is not matched with the preset amount regular expression, determining that the fourth basic characteristic of each cell is 0;
when the numerical value in the historical text information of each cell is 1, determining that the fifth basic characteristic of each cell is 1; alternatively, when the numerical value in the history text information of each cell is not 1, it is determined that the fifth basic feature of each cell is 0.
Further, the determining the column alignment area feature, the relative position feature and the row adjacent cell feature of each cell according to the coordinate information of each cell includes:
randomly selecting any cell in any column as a target cell, and calculating column alignment characteristics of a next row of cells of the target cell, wherein the calculating of the column alignment characteristics of the next row of cells of the target cell comprises: starting recursion from the target cell, sequentially traversing the next row of cells of the target cell, and calculating the column height of the next row of cells of the target cell to obtain the column height; calculating the column distance difference between the target cell and the next row of cells to obtain the column distance difference; calculating the overlapping rate between the target cell and the next row of cells to obtain the row overlapping rate; when the quotient of the column distance difference and the column height is smaller than or equal to a preset first threshold value and the row overlapping rate is larger than or equal to a preset second threshold value, determining that the target unit cell and a unit cell in the next row of the target unit cell form an alignment area; when the target cell and the next row of cells of the target cell form an alignment area, calculating an average value of basic features of the next row of cells of the target cell, and determining the average value as a column alignment feature of the next row of cells of the target cell; repeatedly executing the calculation of the column alignment features of the next row of the target unit cells until the column alignment features of all the unit cells are extracted;
identifying the label of each cell to determine a title cell and an information cell, calculating the relative row distance between each information cell and each title cell, extracting the relative row distance, and determining the relative row distance as the relative position characteristic of the corresponding information cell;
and merging the basic features and the basic features of the left adjacent cell and the right adjacent cell of each cell to obtain merged basic features, and determining the merged basic features as the row adjacent cell features of each cell.
In this embodiment, the preset first threshold may be 2, and the preset second threshold may be 0.8, which is not limited herein.
In this embodiment, the column height refers to a distance between lower boundary lines of upper boundary lines of each cell; the column distance difference is the difference between the minimum y-axis coordinate of the next cell of the target cell and the maximum y-axis coordinate of the target cell; the line overlapping rate is a quotient of a difference between an upper boundary line of the target cell and a lower boundary line of a next cell of the target cell divided by a sum of a column height of the target cell and a column height of the next cell of the target cell.
Further, after determining that the target cell and the cell next to the target cell form an alignment area, the method further includes:
calculating the quotient of the difference between the minimum value of the next row of the target unit cells and the minimum value of the target unit cells and the column height to be smaller than a preset third threshold value, and determining that the target unit cells and the next row of the target unit cells form left alignment; or
Calculating that the quotient of the difference between the maximum value of the row of the next row of the target cells and the maximum value of the row of the target cells and the row height is smaller than the preset third threshold value, and determining that the target cells and the next row of the target cells form right alignment; or
And calculating that the quotient of the difference between the row middle value of the next row of the target cell and the row middle value of the target cell and the row height is less than the preset third threshold value, and determining that the target cell and the next row of the target cell form middle alignment.
In this embodiment, the preset third threshold may be 1, and this embodiment is not limited herein.
Further, the method further comprises:
and calculating that the quotient of the difference between the minimum value of the row of the next row of the target cell and the minimum value of the row of the target cell and the row height is greater than or equal to a preset third threshold value, or calculating that the quotient of the difference between the maximum value of the row of the next row of the target cell and the maximum value of the row of the target cell and the row height is greater than or equal to the preset third threshold value, or calculating that the quotient of the difference between the middle value of the row of the next row of the target cell and the middle value of the row of the target cell and the row height is greater than or equal to the preset third threshold value, and determining that the target cell and the next row of the target cell do not form row alignment.
In the embodiment, the position coordinates of the corresponding cells can be adjusted by determining the left alignment, the right alignment and the middle alignment, and when the electronic invoice information is extracted at the later stage, the adjusted position coordinates of the cells are taken into consideration, so that the accuracy of extracting the electronic invoice information is improved.
In this embodiment, when performing label classification on cell information of a plurality of cells, consideration is given to four aspects, namely, a basic feature, a column alignment area feature, a relative position feature, and a row adjacent cell feature, in the cell information of each cell, and specifically, the basic feature is determined from five dimensions, namely, whether historical text information of each cell hits a key in a preset dictionary, the number of times that any two words hit a keyword in the preset dictionary in a sorted manner, the number of times that each word hits the keyword in the preset dictionary, whether the word matches a preset money regular expression, and whether a numerical value is 1, so that the integrity of the basic feature is ensured; the column alignment area features are considered from whether each cell is a column, so that the obtained cells are ensured not to have a dislocation phenomenon; the relative position feature considers the relative position of each cell and the title cell under the condition of ensuring the alignment of the columns, and further ensures the accuracy of the position coordinate of each cell; the row adjacent cell features expand the base features of each cell, increasing the features of each cell.
In the embodiment, the labels of each cell are determined by considering from multiple dimensions, so that the labels of different classes are prevented from being divided into the same group, the accuracy of label classification is improved, and the accuracy of each label is ensured.
And S14, identifying the labels of the cells and determining the mapping relation among the cells.
In this embodiment, when determining the mapping relationship between the cells, a row relationship identification algorithm may be adopted, and specifically, the row relationship identification algorithm divides a plurality of cells corresponding to each row from the plurality of cells, and identifies a position relationship between the plurality of cells corresponding to each row, thereby determining the mapping relationship between the cells.
And S15, performing normalization processing on the first target text information of each cell in the electronic invoice image to obtain second target text information of each cell in the electronic invoice image.
In this embodiment, the normalization processing is to update the first target text information in each cell, so as to ensure that the extracted electronic invoice information is a standard value, and improve the readability of the electronic invoice information.
In an optional embodiment, the normalizing the first target text information of each cell in the electronic invoice image to obtain the second target text information of each cell in the electronic invoice image includes:
extracting a plurality of first keywords hitting a preset dictionary from the first target text information of each cell;
extracting a plurality of second keywords with a preset number from the preset dictionary according to the plurality of first keywords;
calculating the similarity between any one first keyword and any one second keyword;
and selecting the second keyword with the maximum similarity from the calculated similarities to determine the second keyword as the second target text information of the corresponding cell.
In this embodiment, since the predetermined dictionary includes the standard value, for example, if the first target text information is the decoction pieces of traditional Chinese medicine, the decoction pieces of traditional Chinese medicine are normalized to the traditional Chinese medicine fee.
And S16, extracting the electronic invoice information from the second target text information of the cells of the electronic invoice image according to the mapping relation among the cells.
In this embodiment, for the charging electronic invoice, the mapping relationship refers to an association relationship between the charging item and the amount or quantity, for example, the extracted charging item is a bed fee, and the mapping relationship of the corresponding amount is as follows: XXXX element.
In the embodiment, the mapping relation among the cells is determined, and the information of the electronic invoice is extracted according to the mapping relation, so that the phenomenon of disorder of the electronic invoice information caused by independent implementation of the electronic invoice information is avoided, and the accuracy and efficiency of extraction of the electronic invoice information are improved.
In summary, in the electronic invoice information extraction method according to the embodiment, the electronic invoice image is identified, and the identification result is subjected to semantic segmentation to obtain the cell information of the multiple cells of the electronic invoice image, so that the problem that one bounding box contains multiple cells is solved, and the accuracy of the subsequently extracted electronic invoice information is improved. And classifying the labels based on the cell information of the plurality of cells to obtain the label of each cell, so that the labels of different types are prevented from being divided into the same group, the accuracy of label classification is improved, and the accuracy of each label is ensured. And performing normalization processing on the first target text information of each cell in the electronic invoice image, wherein the normalization processing is to update the first target text information of each cell, so that the extracted electronic invoice information is ensured to be a standard value, and the readability of the electronic invoice information is improved. According to the mapping relation among the cells, the electronic invoice information is extracted from the second target text information of the cells of the electronic invoice image, the phenomenon of electronic invoice information confusion caused by independent electronic invoice information is avoided, and the accuracy and efficiency of electronic invoice information extraction are improved.
Example two
Fig. 2 is a structural diagram of an electronic invoice information extraction apparatus according to a second embodiment of the present invention.
In some embodiments, the electronic invoice information extraction apparatus 20 may include a plurality of functional modules composed of program code segments. The program code of each program segment in the electronic invoice information extraction apparatus 20 can be stored in the memory of the electronic device and executed by the at least one processor to perform the functions of electronic invoice information extraction (described in detail in fig. 1).
In this embodiment, the electronic invoice information extraction device 20 may be divided into a plurality of functional modules according to the functions performed by the electronic invoice information extraction device. The functional module may include: a receiving and identifying module 201, a segmentation module 202, a classification module 203, a determination module 204, a normalization processing module 205, and an extraction module 206. The module referred to herein is a series of computer readable instruction segments stored in a memory that can be executed by at least one processor and that can perform a fixed function. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The receiving and identifying module 201 is configured to receive an electronic invoice image of a text to be extracted, and identify the electronic invoice image to obtain an identification result.
In this embodiment, when a user extracts electronic invoice information, the electronic invoice image of a text to be extracted is sent to the client through the client, specifically, the client may be a smart phone, an IPAD, or other existing intelligent devices, the server may be an electronic invoice information extraction subsystem, and in the electronic invoice information extraction process, if the client can send the electronic invoice image of the text to be extracted to the electronic invoice information extraction subsystem, the electronic invoice image is identified when the electronic invoice information extraction subsystem receives the electronic invoice image of the text to be extracted.
In this embodiment, in the digital medical technology field, the electronic invoice image may be a payment item invoice, a hospital charging electronic invoice, an outpatient service charging bill, or a physical examination report, for example, a blood routine examination report, a urine routine examination report, or other physical examination reports.
In an optional embodiment, the receiving and identifying module 201 identifies the electronic invoice image, and obtaining the identification result includes:
recognizing the electronic invoice image by adopting an OCR (optical character recognition), and obtaining information of a plurality of boundary boxes, wherein each boundary box information comprises first coordinate information, confidence and first text information of each boundary box;
and determining the information of the plurality of bounding boxes as a recognition result.
In this embodiment, the first coordinate information of each bounding box includes coordinate information of an upper left corner, coordinate information of a lower left corner, coordinate information of an upper right corner, and coordinate information of a lower right corner of each bounding box.
In this embodiment, for the outpatient service charging ticket, the first text information may be other information such as a project name, an amount, a remark, and the like.
And the segmentation module 202 is configured to perform semantic segmentation on the recognition result to obtain cell information of multiple cells of the electronic invoice image.
In this embodiment, the cell information includes the second coordinate information and the first target text information of each cell, the cells are generally divided through a blank area, one boundary box represents one table unit, that is, one cell, and there are also special cases, when a cell text is long, spatial adhesion may occur with adjacent cells, there is a phenomenon that a plurality of cells are included in one boundary box, semantic division needs to be performed on the cells, the problem that one boundary box includes a plurality of cells is solved, and the accuracy of electronic invoice information extracted subsequently is improved.
In an optional embodiment, the semantic segmentation performed on the recognition result by the segmentation module 202 to obtain the cell information of the multiple cells of the electronic invoice image includes:
carrying out sequence labeling on the first text information of each boundary box in the recognition result;
inputting the first text information with the sequence mark into a sequence mark model trained in advance for recognition to obtain second text information of each bounding box;
and identifying the label in the second text information of each boundary box, and performing semantic segmentation on each corresponding boundary box to obtain the cell information of a plurality of cells of the electronic invoice image.
In this embodiment, sequence labeling may be performed on the first text information of each bounding box by adopting BIO sequence labeling, where the BIO sequence labeling is the prior art and is not repeated in this case.
Further, the identifying the label in the second text information of each bounding box and performing semantic segmentation on each corresponding bounding box to obtain the cell information of the multiple cells of the electronic invoice image includes:
when a plurality of labels in the second text message of any one of the plurality of bounding boxes are identified, segmenting the any one bounding box according to the labels to obtain a plurality of cells;
performing coordinate conversion on the first coordinate information of any one of the bounding boxes to obtain second coordinate information of each of the plurality of cells;
determining first target text information of each cell according to the second coordinate information of each cell in the plurality of cells;
and updating the identification result according to the second coordinate information and the first target text information of each unit cell in the plurality of unit cells to obtain the unit cell information of the plurality of unit cells of the electronic invoice image.
In the embodiment, semantic segmentation is performed on each boundary box, electronic invoice information can be extracted after two overlapped boundary boxes are segmented, the problem that electronic invoice information is extracted in a disordered manner due to the fact that electronic invoice information is extracted directly from the boundary boxes is solved, and the accuracy of the extracted electronic invoice information is improved.
Further, the coordinate converting the first coordinate information of any one of the bounding boxes to obtain the second coordinate information of each of the plurality of cells includes:
identifying the character types to which all characters in the second text information in any one bounding box belong;
determining a standard character of each character according to the character type of each character;
converting all characters in the second text information of any one bounding box into standard characters according to the character types to which all the characters belong and the corresponding standard characters of each character, calculating the sum of the number of the standard characters, and determining the sum as the sum of the number of the standard characters of the second text information of any one bounding box;
and calculating the coordinate information of each character in any boundary box by adopting a preset formula according to the sum of the first coordinate information of any boundary box and the number of the standard characters of the second text information of any boundary box, and calculating the second coordinate information of each cell in the multiple cells according to the coordinate information of each character.
Specifically, calculating the coordinate information of the upper left corner and the coordinate information of the upper right corner of the nth character in any one of the bounding boxes by using the following preset formulas includes:
Figure BDA0003480620220000171
Figure BDA0003480620220000181
xn_0=xn-1_3
yn_0=yn-1_3
Figure BDA0003480620220000182
Figure BDA0003480620220000183
x0_0=xori_0
y0_0=yori_0
wherein the content of the first and second substances,
Figure BDA0003480620220000184
coordinate information indicating the upper left corner of any one of the bounding boxes,
Figure BDA0003480620220000185
coordinate information of the upper right corner of any one bounding box is represented, and all _ normal _ char represents the sum of standard characters of the second text information of any one bounding box; normal _ charnThe number of standard characters representing the nth character, (x)n_0,yn_0) Coordinate information representing the upper left corner of the nth character, (x)n_3,yn_3) Coordinate information representing the upper right corner of the nth character.
Illustratively, the standard character of a Chinese character is 1, the standard character of an uppercase English character is 0.75, the standard character of a lowercase English character is 0.5, the standard character of a punctuation mark is 0.5, normal _ charnThe standard character representing the nth character, for example, the second text message is "Helen eye drops hydrochloric acid levofloxacin krypton", and the nth character is "salt", then normal _ charnThe reference character 1 represents the 6 th character.
Specifically, the principle of calculating the coordinate information of the lower left corner and the coordinate information of the lower right corner of the nth character in the arbitrary bounding box is the same as the principle of calculating the coordinate information of the upper left corner and the coordinate information of the upper right corner of the nth character in the arbitrary bounding box.
In the embodiment, in order to ensure the accuracy of semantic segmentation, the first text information in each bounding box is subjected to sequence labeling and then subjected to semantic segmentation, and meanwhile, the first text information in each bounding box is converted into standard characters and then subjected to coordinate conversion, so that the cell information of a plurality of cells of the electronic invoice image is obtained.
And the classification module 203 is configured to perform label classification based on the cell information of the multiple cells to obtain a label of each cell.
In this embodiment, after the cell information is obtained, since the cell information includes the second coordinate information and the first target text information of each cell, the label of each cell can be obtained according to the first target text information of each cell.
In an optional embodiment, the classifying module 203 performs label classification based on the cell information of the plurality of cells, and obtaining the label of each cell includes:
and inputting the cell information of the plurality of cells into a label classification model trained in advance to obtain the label of each cell.
Specifically, the training process of the label classification model includes:
obtaining historical cell information;
extracting historical text information and coordinate information of each cell from the historical cell information;
determining the basic feature of each cell according to the historical text information of each cell, and determining the column alignment area feature, the relative position feature and the row adjacent cell feature of each cell according to the coordinate information of each cell;
associating the basic feature, the column alignment area feature, the relative position feature and the line adjacent cell feature of each cell to obtain a target feature of each cell;
determining a training set and a test set from the target features of the plurality of cells;
training a preset fine adjustment model based on the training set to obtain a label classification model;
inputting the test set into the label classification model for testing, and calculating a test passing rate;
if the test passing rate is greater than or equal to a preset passing rate threshold value, determining that the training of the label classification model is finished; and if the test passing rate is smaller than the preset passing rate threshold value, increasing the number of the training sets, and re-training the label classification model.
In this embodiment, a fine adjustment model may be preset, where the fine adjustment model is preset based on the basic features, the column alignment region features, the relative position features, and the row adjacent cell features of the history cells.
Further, the determining the basic feature of each cell according to the historical text information of each cell includes:
when the historical text information of each cell hits a keyword in a preset dictionary, determining that the first basic feature of each cell is 1; or when the historical text information of each cell does not hit the keywords in the preset dictionary, determining that the first basic feature of each cell is 0;
counting the first times of the keywords in the preset dictionary when any two words in the historical text information of each cell are ordered and hit, and determining the first times as the second basic characteristic of each cell;
counting a second time of hitting the keywords in the preset dictionary by each word in the historical text information of each cell, and determining the second time as a third basic feature of each cell;
when the historical text information of each cell is matched with a preset amount regular expression, determining that the fourth basic characteristic of each cell is 1; or when the historical text information of each cell is not matched with the preset amount regular expression, determining that the fourth basic characteristic of each cell is 0;
when the numerical value in the historical text information of each cell is 1, determining that the fifth basic characteristic of each cell is 1; alternatively, when the numerical value in the history text information of each cell is not 1, it is determined that the fifth basic feature of each cell is 0.
Further, the determining the column alignment area feature, the relative position feature and the row adjacent cell feature of each cell according to the coordinate information of each cell includes:
randomly selecting any cell in any column as a target cell, and calculating the column alignment feature of the next row of cells of the target cell, wherein the calculating the column alignment feature of the next row of cells of the target cell comprises: starting recursion from the target cell, sequentially traversing the next row of cells of the target cell, and calculating the column height of the next row of cells of the target cell to obtain the column height; calculating the column distance difference between the target cell and the next row of cells to obtain the column distance difference; calculating the overlapping rate between the target cell and the next row of cells to obtain the row overlapping rate; when the quotient of the column distance difference and the column height is smaller than or equal to a preset first threshold value and the row overlapping rate is larger than or equal to a preset second threshold value, determining that the target unit cell and a unit cell in the next row of the target unit cell form an alignment area; when the target cell and the next row of cells of the target cell form an alignment area, calculating an average value of basic features of the next row of cells of the target cell, and determining the average value as a column alignment feature of the next row of cells of the target cell; repeatedly executing the calculation of the column alignment features of the next row of the target unit cells until the column alignment features of all the unit cells are extracted;
identifying the label of each cell to determine a title cell and an information cell, calculating the relative row distance between each information cell and each title cell, extracting the relative row distance, and determining the relative row distance as the relative position characteristic of the corresponding information cell;
and merging the basic features and the basic features of the left adjacent cell and the right adjacent cell of each cell to obtain merged basic features, and determining the merged basic features as the row adjacent cell features of each cell.
In this embodiment, the preset first threshold may be 2, and the preset second threshold may be 0.8, which is not limited herein.
In this embodiment, the column height refers to a distance between lower boundary lines of upper boundary lines of each cell; the column distance difference is the difference between the minimum y-axis coordinate of the next cell of the target cell and the maximum y-axis coordinate of the target cell; the line overlapping rate is a quotient of a difference between an upper boundary line of the target cell and a lower boundary line of a next cell of the target cell divided by a sum of a column height of the target cell and a column height of the next cell of the target cell.
Further, after determining that the target cell and the cell next to the target cell form an alignment area, the method further includes:
calculating the quotient of the difference between the minimum value of the next row of the target unit cells and the minimum value of the target unit cells and the column height to be smaller than a preset third threshold value, and determining that the target unit cells and the next row of the target unit cells form left alignment; or
Calculating the quotient of the difference between the maximum value of the column of the next row of the target cell and the maximum value of the column of the target cell and the column height to be smaller than the preset third threshold value, and determining that the target cell and the next row of the target cell form right alignment; or
And calculating that the quotient of the difference between the column middle value of the next row of the target cell and the column middle value of the target cell and the column height is less than the preset third threshold value, and determining that the target cell and the next row of the target cell form middle alignment.
In this embodiment, the preset third threshold may be 1, and this embodiment is not limited herein.
Further, calculating that the quotient of the difference between the minimum value of the row of the next row of the target cell and the minimum value of the row of the target cell and the column height is greater than or equal to a preset third threshold, or calculating that the quotient of the difference between the maximum value of the row of the next row of the target cell and the maximum value of the column of the target cell and the column height is greater than or equal to the preset third threshold, or calculating that the quotient of the difference between the middle value of the column of the next row of the target cell and the middle value of the column of the target cell and the column height is greater than or equal to the preset third threshold, and determining that the target cell and the next row of the target cell do not form column alignment.
In the embodiment, the position coordinates of the corresponding cells can be adjusted by determining the left alignment, the right alignment and the middle alignment, and when the electronic invoice information is extracted at the later stage, the adjusted position coordinates of the cells are taken into consideration, so that the accuracy of extracting the electronic invoice information is improved.
In this embodiment, when performing label classification on cell information of a plurality of cells, consideration is given to four aspects, namely, a basic feature, a column alignment area feature, a relative position feature, and a row adjacent cell feature, in the cell information of each cell, and specifically, the basic feature is determined from five dimensions, namely, whether historical text information of each cell hits a key in a preset dictionary, the number of times that any two words hit a keyword in the preset dictionary in a sorted manner, the number of times that each word hits the keyword in the preset dictionary, whether the word matches a preset money regular expression, and whether a numerical value is 1, so that the integrity of the basic feature is ensured; the column alignment area features are considered from whether each cell is a column, so that the obtained cells are ensured not to have a dislocation phenomenon; the relative position feature considers the relative position of each cell and the title cell under the condition of ensuring the alignment of the columns, and further ensures the accuracy of the position coordinate of each cell; the row adjacent cell features expand the base features of each cell, increasing the features of each cell.
In the embodiment, the labels of each cell are determined by considering from multiple dimensions, so that the labels of different classes are prevented from being divided into the same group, the accuracy of label classification is improved, and the accuracy of each label is ensured.
A determining module 204, configured to identify multiple tags of the multiple cells, and determine a mapping relationship between the cells.
In this embodiment, when determining the mapping relationship between the cells, a row relationship identification algorithm may be adopted, and specifically, the row relationship identification algorithm divides a plurality of cells corresponding to each row from the plurality of cells, and identifies a position relationship between the plurality of cells corresponding to each row, thereby determining the mapping relationship between the cells.
The normalization processing module 205 is configured to perform normalization processing on the first target text information of each cell in the electronic invoice image to obtain second target text information of each cell in the electronic invoice image.
In this embodiment, the normalization processing is to update the first target text information in each cell, so as to ensure that the extracted electronic invoice information is a standard value, and improve the readability of the electronic invoice information.
In an optional embodiment, the normalizing process module 205 performs a normalization process on the first target text information of each cell in the electronic invoice image, and obtaining the second target text information of each cell in the electronic invoice image includes:
extracting a plurality of first keywords hitting a preset dictionary from the first target text information of each cell;
extracting a plurality of second keywords with a preset number from the preset dictionary according to the plurality of first keywords;
calculating the similarity between any one first keyword and any one second keyword;
and selecting the second keyword with the maximum similarity from the calculated similarities to determine the second keyword as the second target text information of the corresponding cell.
In this embodiment, since the predetermined dictionary includes the standard value, for example, if the first target text information is the decoction pieces of traditional Chinese medicine, the decoction pieces of traditional Chinese medicine are normalized to the traditional Chinese medicine fee.
And an extracting module 206, configured to extract the electronic invoice information from the second target text information of the multiple cells of the electronic invoice image according to the mapping relationship between the cells.
In this embodiment, for the charging electronic invoice, the mapping relationship refers to an association relationship between the charging item and the amount or quantity, for example, the extracted charging item is a bed fee, and the mapping relationship of the corresponding amount is as follows: XXXX element.
In the embodiment, the mapping relation among the cells is determined, and the information of the electronic invoice is extracted according to the mapping relation, so that the phenomenon of disorder of the electronic invoice information caused by independent implementation of the electronic invoice information is avoided, and the accuracy and efficiency of extraction of the electronic invoice information are improved.
In summary, the electronic invoice information extraction device described in this embodiment identifies the electronic invoice image, and performs semantic segmentation on the identification result to obtain the cell information of a plurality of cells of the electronic invoice image, thereby solving the problem that one bounding box contains a plurality of cells, and improving the accuracy of the subsequently extracted electronic invoice information. And classifying the labels based on the cell information of the plurality of cells to obtain the label of each cell, so that the labels of different types are prevented from being divided into the same group, the accuracy of label classification is improved, and the accuracy of each label is ensured. And performing normalization processing on the first target text information of each cell in the electronic invoice image, wherein the normalization processing is to update the first target text information of each cell, so that the extracted electronic invoice information is ensured to be a standard value, and the readability of the electronic invoice information is improved. According to the mapping relation among the cells, the electronic invoice information is extracted from the second target text information of the cells of the electronic invoice image, the phenomenon of electronic invoice information confusion caused by independent electronic invoice information is avoided, and the accuracy and efficiency of electronic invoice information extraction are improved.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 does not constitute a limitation of the embodiment of the present invention, and may be a bus-type configuration or a star-type configuration, and the electronic device 3 may include more or less other hardware or software than those shown, or a different arrangement of components.
In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.
It should be noted that the electronic device 3 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 is used for storing program codes and various data, such as the electronic invoice information extraction device 20 installed in the electronic equipment 3, and realizes high-speed and automatic access to programs or data during the operation of the electronic equipment 3. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
In some embodiments, the at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects various components of the electronic device 3 by using various interfaces and lines, and executes various functions and processes data of the electronic device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the electronic device 3 may further include a power supply (such as a battery) for supplying power to each component, and optionally, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In a further embodiment, in conjunction with fig. 2, the at least one processor 32 may execute an operating device of the electronic device 3 and various installed application programs (such as the electronic invoice information extraction device 20), program codes, and the like, for example, the above modules.
The memory 31 has program code stored therein, and the at least one processor 32 can call the program code stored in the memory 31 to perform related functions. For example, the modules illustrated in fig. 2 are program code stored in the memory 31 and executed by the at least one processor 32, so as to implement the functions of the modules for the purpose of electronic invoice information extraction.
Illustratively, the program code may be divided into one or more modules/units, which are stored in the memory 31 and executed by the processor 32 to accomplish the present application. The one or more modules/units may be a series of computer readable instruction segments capable of performing certain functions, which are used for describing the execution process of the program code in the electronic device 3. For example, the program code may be partitioned into a receiving and identifying module 201, a segmenting module 202, a classifying module 203, a determining module 204, a normalizing module 205, and an extracting module 206.
In one embodiment of the present invention, the memory 31 stores a plurality of computer-readable instructions that are executed by the at least one processor 32 to implement the functionality of electronic invoice information extraction.
Specifically, the at least one processor 32 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, and details are not repeated here.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. An electronic invoice information extraction method, characterized in that the method comprises:
receiving an electronic invoice image of a text to be extracted, and identifying the electronic invoice image to obtain an identification result;
performing semantic segmentation on the recognition result to obtain cell information of a plurality of cells of the electronic invoice image;
performing label classification based on the cell information of the plurality of cells to obtain a label of each cell;
identifying a plurality of labels of the plurality of cells, and determining the mapping relation among the cells;
normalizing the first target text information of each cell in the electronic invoice image to obtain second target text information of each cell in the electronic invoice image;
and according to the mapping relation among the cells, extracting the electronic invoice information from a plurality of second target text information of a plurality of cells of the electronic invoice image.
2. The method for extracting electronic invoice information according to claim 1, wherein the semantically segmenting the recognition result to obtain the cell information of the plurality of cells of the electronic invoice image comprises:
carrying out sequence labeling on the first text information of each boundary box in the recognition result;
inputting the first text information with the sequence mark into a sequence mark model trained in advance for recognition to obtain second text information of each bounding box;
and identifying the label in the second text information of each boundary box, and performing semantic segmentation on each corresponding boundary box to obtain the cell information of a plurality of cells of the electronic invoice image.
3. The method for extracting electronic invoice information according to claim 2, wherein the identifying the label in the second text information of each bounding box and performing semantic segmentation on each corresponding bounding box to obtain the cell information of the multiple cells of the electronic invoice image comprises:
when a plurality of labels in the second text message of any one of the plurality of bounding boxes are identified, segmenting the any one bounding box according to the labels to obtain a plurality of cells;
performing coordinate conversion on the first coordinate information of any one of the bounding boxes to obtain second coordinate information of each of the plurality of cells;
determining first target text information of each cell according to the second coordinate information of each cell in the plurality of cells;
and updating the identification result according to the second coordinate information and the first target text information of each unit cell in the plurality of unit cells to obtain the unit cell information of the plurality of unit cells of the electronic invoice image.
4. The electronic invoice information extraction method of claim 3, wherein the coordinate transforming the first coordinate information of any one of the bounding boxes to obtain the second coordinate information of each of the plurality of cells comprises:
identifying the character types to which all characters in the second text information in any one bounding box belong;
determining a standard character of each character according to the character type of each character;
converting all characters in the second text information of any one bounding box into standard characters according to the character types to which all the characters belong and the corresponding standard characters of each character, calculating the sum of the number of the standard characters, and determining the sum as the sum of the number of the standard characters of the second text information of any one bounding box;
and calculating the coordinate information of each character in any boundary box by adopting a preset formula according to the sum of the first coordinate information of any boundary box and the number of the standard characters of the second text information of any boundary box, and calculating the second coordinate information of each cell in the multiple cells according to the coordinate information of each character.
5. The electronic invoice information extraction method of claim 1, wherein the performing label classification based on the cell information of the plurality of cells, obtaining a label for each cell comprises:
inputting the cell information into a label classification model trained in advance to obtain a label of each cell, wherein the training process of the label classification model comprises the following steps:
obtaining historical cell information;
extracting historical text information and coordinate information of each cell from the historical cell information;
determining the basic feature of each cell according to the historical text information of each cell, and determining the column alignment area feature, the relative position feature and the row adjacent cell feature of each cell according to the coordinate information of each cell;
associating the basic feature, the column alignment area feature, the relative position feature and the line adjacent cell feature of each cell to obtain a target feature of each cell;
determining a training set and a test set from the target features of the plurality of cells;
training a preset fine adjustment model based on the training set to obtain a label classification model;
inputting the test set into the label classification model for testing, and calculating the test passing rate;
if the test passing rate is greater than or equal to a preset passing rate threshold value, determining that the training of the label classification model is finished; and if the test passing rate is smaller than the preset passing rate threshold value, increasing the number of the training sets, and re-training the label classification model.
6. The electronic invoice information extraction method of claim 5, wherein the determining the column alignment area feature, the relative position feature, and the row adjacent cell feature for each cell from the coordinate information for each cell comprises:
randomly selecting any cell in any column as a target cell, and calculating the column alignment feature of the next row of cells of the target cell, wherein the calculating the column alignment feature of the next row of cells of the target cell comprises: starting recursion from the target cell, sequentially traversing the next row of cells of the target cell, and calculating the column height of the next row of cells of the target cell to obtain the column height; calculating the column distance difference between the target cell and the next row of cells to obtain the column distance difference; calculating the overlapping rate between the target cell and the next row of cells to obtain the row overlapping rate; when the quotient of the column distance difference and the column height is smaller than or equal to a preset first threshold value and the row overlapping rate is larger than or equal to a preset second threshold value, determining that the target unit cell and a unit cell in the next row of the target unit cell form an alignment area; when the target cell and the next row of cells of the target cell form an alignment area, calculating an average value of basic features of the next row of cells of the target cell, and determining the average value as a column alignment feature of the next row of cells of the target cell; repeatedly executing the calculation of the column alignment features of the next row of the target unit cells until the column alignment features of all the unit cells are extracted;
identifying the label of each cell to determine a title cell and an information cell, calculating the relative row distance between each information cell and each title cell, extracting the relative row distance, and determining the relative row distance as the relative position characteristic of the corresponding information cell;
and merging the basic features and the basic features of the left adjacent cell and the right adjacent cell of each cell to obtain merged basic features, and determining the merged basic features as the row adjacent cell features of each cell.
7. The method for extracting electronic invoice information according to claim 1, wherein the normalizing the first target text information of each cell in the electronic invoice image to obtain the second target text information of each cell in the electronic invoice image comprises:
extracting a plurality of first keywords hitting a preset dictionary from the first target text information of each cell;
extracting a plurality of second keywords with a preset number from the preset dictionary according to the plurality of first keywords;
calculating the similarity between any one first keyword and any one second keyword;
and selecting the second keyword with the maximum similarity from the calculated similarities to determine the second keyword as the second target text information of the corresponding cell.
8. An electronic invoice information extraction apparatus, the apparatus comprising:
the receiving and identifying module is used for receiving the electronic invoice image of the text to be extracted and identifying the electronic invoice image to obtain an identification result;
the segmentation module is used for performing semantic segmentation on the recognition result to obtain cell information of a plurality of cells of the electronic invoice image;
the classification module is used for performing label classification based on the cell information of the plurality of cells to obtain a label of each cell;
the determining module is used for identifying a plurality of labels of the plurality of cells and determining the mapping relation among the cells;
the normalization processing module is used for performing normalization processing on the first target text information of each cell in the electronic invoice image to obtain second target text information of each cell in the electronic invoice image;
and the extraction module is used for extracting the electronic invoice information from a plurality of second target text information of a plurality of cells of the electronic invoice image according to the mapping relation among the cells.
9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to implement the electronic invoice information extraction method according to any one of claims 1 to 7 when executing the computer program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the electronic invoice information extraction method according to any one of claims 1 to 7.
CN202210067279.XA 2022-01-20 2022-01-20 Electronic invoice information extraction method and device, electronic equipment and storage medium Pending CN114495133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210067279.XA CN114495133A (en) 2022-01-20 2022-01-20 Electronic invoice information extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210067279.XA CN114495133A (en) 2022-01-20 2022-01-20 Electronic invoice information extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114495133A true CN114495133A (en) 2022-05-13

Family

ID=81472104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210067279.XA Pending CN114495133A (en) 2022-01-20 2022-01-20 Electronic invoice information extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114495133A (en)

Similar Documents

Publication Publication Date Title
CN112597312A (en) Text classification method and device, electronic equipment and readable storage medium
CN108629043A (en) Extracting method, device and the storage medium of webpage target information
CN113190372B (en) Multi-source data fault processing method and device, electronic equipment and storage medium
CN114663223A (en) Credit risk assessment method, device and related equipment based on artificial intelligence
CN113704429A (en) Semi-supervised learning-based intention identification method, device, equipment and medium
CN115146865A (en) Task optimization method based on artificial intelligence and related equipment
CN112614578A (en) Doctor intelligent recommendation method and device, electronic equipment and storage medium
CN113626607A (en) Abnormal work order identification method and device, electronic equipment and readable storage medium
CN113111162A (en) Department recommendation method and device, electronic equipment and storage medium
CN111986744A (en) Medical institution patient interface generation method and device, electronic device and medium
Srihari et al. Name and address block reader system for tax form processing
CN114708461A (en) Multi-modal learning model-based classification method, device, equipment and storage medium
CN112132016A (en) Bill information extraction method and device and electronic equipment
CN114416939A (en) Intelligent question and answer method, device, equipment and storage medium
CN114880449A (en) Reply generation method and device of intelligent question answering, electronic equipment and storage medium
CN114003704A (en) Method and device for creating designated tag guest group, electronic equipment and storage medium
CN113344125A (en) Long text matching identification method and device, electronic equipment and storage medium
CN111651625A (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN113570286B (en) Resource allocation method and device based on artificial intelligence, electronic equipment and medium
CN114495133A (en) Electronic invoice information extraction method and device, electronic equipment and storage medium
CN113469291B (en) Data processing method and device, electronic equipment and storage medium
CN112328752B (en) Course recommendation method and device based on search content, computer equipment and medium
CN115525761A (en) Method, device, equipment and storage medium for article keyword screening category
CN114996386A (en) Business role identification method, device, equipment and storage medium
CN114385815A (en) News screening method, device, equipment and storage medium based on business requirements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination