CN115205884A

CN115205884A - Bill information extraction method and device, equipment, medium and product thereof

Info

Publication number: CN115205884A
Application number: CN202210887294.9A
Authority: CN
Inventors: 罗丹
Original assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Current assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-10-18

Abstract

The application discloses a bill information extraction method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: the method comprises the steps of carrying out text recognition on a bill image to obtain text box position information of each text area in the image and text information of each text area; acquiring text visual features of corresponding positions according to the text box position information of each text area in the bill image; fusing text information and text box position information in each text area to obtain second text information characteristics of each text area; and finally, fusing the text visual features and the second text information features corresponding to the text regions to obtain multi-modal features of the text regions, inputting the multi-modal features into a classification neural network for discrimination, and determining classification labels corresponding to the text information of the text regions. The method and the device can accurately extract the text information required by the order data from the bill image.

Description

Bill information extraction method and device, equipment, medium and product thereof

Technical Field

The present application relates to the field of e-commerce information technology, and in particular, to a method for extracting ticket information, and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

Background

With the progress and development of the information age, the automatic information extraction of the bill images becomes more and more important requirements of related industries. Such as invoice information extraction, shopping list information extraction, payment voucher information extraction, and the like.

For a part of offline operated e-market scenes, a consumer user of the e-commerce platform may provide corresponding bill information of an e-commerce order, such as commodity information required for placing an order or remittance payment information corresponding to paying an order, in a manner of screenshot by a third party application or a paper electronic picture, and the like, and accordingly, data related to the order needs to be acquired on the basis of the picture.

The data corresponding to the orders are obtained from the pictures, and the method relates to the rapid extraction and structuralization of key information in the transaction orders, is beneficial to helping merchants to quickly collect all transaction orders, quickly assisting the merchants to complete the statistics of order information, is beneficial to the merchants to master the global transaction condition, can specifically make corresponding measures such as sales promotion, delivery, production and the like, can also quickly master commercial confidential information, and avoids the situations of omission, leakage and the like.

Most of the existing image-based key information extraction technologies are based on customized templates and carry out information extraction in specific areas, the method needs to artificially customize multiple templates in advance, has high requirements on bill structures, poor generalization and robustness and insufficient intelligent degree, and particularly has no universality when bill information extraction services corresponding to multiple languages are required to be provided for multiple regions.

Disclosure of Invention

A primary object of the present application is to solve the above problems and provide a method for extracting ticket information, and a corresponding device, computer readable storage medium, and computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

the bill information extraction method adapted to one of the purposes of the application comprises the following steps:

performing character recognition on the bill image to obtain text box position information of each text area in the image and text information of each text area;

acquiring text visual features of corresponding positions according to the text box position information of each text area in the bill image;

fusing text information and text box position information in each text area to obtain second text information characteristics of each text area;

and fusing the text visual features and the second text information features corresponding to the text regions to obtain multi-modal features of the text regions, inputting the multi-modal features into a classification neural network for discrimination, and determining classification labels corresponding to the text information of the text regions.

The bill information extraction device comprises a text recognition module, a visual feature module, an information feature module and a text classification module. The text recognition module is used for performing character recognition on the bill image to obtain text box position information of each text area in the image and text information of each text area; the visual feature module is used for acquiring text visual features of corresponding positions according to the text box position information of each text area in the bill image; the information characteristic module is used for fusing text information and text box position information in each text area to obtain second text information characteristics of each text area; and the text classification module is used for fusing the text visual features and the second text information features corresponding to the text regions to obtain multi-modal features of the text regions, inputting the multi-modal features into the classification neural network for discrimination, and determining the classification labels corresponding to the text information of the text regions.

The computer device comprises a central processing unit and a memory, wherein the central processing unit is used for calling and running a computer program stored in the memory to execute the steps of the bill information extraction method.

A computer-readable storage medium is provided for storing a computer program implemented according to the method for extracting ticket information in the form of computer-readable instructions, and when the computer program is called by a computer, the computer program performs the steps included in the method.

A computer program product adapted to another object of the present application is provided, which comprises a computer program/instructions, when executed by a processor, for implementing the steps of the ticket information extraction method described in any one of the embodiments of the present application.

Compared with the prior art, the technical scheme of the application at least comprises the following technical advantages:

the method comprises the steps of carrying out text recognition on a bill image to obtain text box position information of each text area in the image and text information of each text area; acquiring text visual characteristics of corresponding positions according to text box position information of each text area in the bill image; fusing text information and text box position information in each text area to obtain second text information characteristics of each text area; and finally, fusing the text visual features and the second text information features corresponding to the text regions to obtain multi-modal features of the text regions, inputting the multi-modal features into a classification neural network for discrimination, and determining classification labels corresponding to the text information of the text regions.

According to the method and the device, the text box position information and the text information of the text area in the bill image are automatically extracted based on the text recognition technology, the problem of complex operation of customizing the template in advance is solved, the extraction efficiency of the key information of the image is improved, and the automatic processing performance of the image is enhanced.

The method and the device fuse the position information of the text box and the text information, extract deep features of the text box by adopting a neural network structure, and can fuse and complement the position information of the text and the semantic information of the text, thereby providing more complete text data.

The method and the device fuse visual features of the image and text information features, adopt a neural network structure based on multi-mode feature extraction, can fully utilize information complementation of the text and the image, and provide additional image information supplementation for feature extraction of text data, thereby improving the accuracy of key information extraction and the accuracy of text information classification.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow diagram of an exemplary embodiment of a ticket information extraction method of the present application;

FIG. 2 is a schematic flow chart illustrating a process of detecting and identifying travel texts from bill images in the embodiment of the present application;

FIG. 3 is a flowchart illustrating a process of obtaining visual text features of a text area in a document image according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a process of acquiring a second text information feature of a text area in a document image according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating the operation of obtaining visual characteristics at multiple resolutions according to an embodiment of the present application;

FIG. 6 is a flow diagram illustrating a process for fusing multimodal features and classifications in an embodiment of the present application;

FIG. 7 is a schematic block diagram of a ticket information extraction device of the present application;

fig. 8 is a schematic structural diagram of a computer device used in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other appliance having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. in the present application is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principles such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, and an output device, in which a computer program is stored in the memory, and the central processing unit loads a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby accomplishing specific functions.

It should be noted that the concept of "server" in the present application can be extended to the case of server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art should understand this variation and should not be so constrained as to implement the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations to this.

The bill information extraction method can be programmed into a computer program product and is realized by being deployed in a client or a server for running, so that the method can be executed by accessing an open interface after the computer program product runs and performing man-machine interaction with a process of the computer program product through a graphical user interface.

Referring to fig. 1, in an exemplary embodiment of the method for extracting ticket information of the present application, the method includes the following steps:

step S1100, performing character recognition on the bill image to obtain the position information of a text box of each text area in the image and the text information of each text area;

the ticket image, in the exemplary e-commerce scenario of the present application, generally refers to an image of textual information containing order related data. In this application, the order related data includes various types of data related to one or more business links in the e-commerce platform order business process, such as: an off-line remittance payment link of the E-commerce order; the order related data may include related data such as total amount, total name, payment amount name, commission amount name, payment time, other information, and the like. For the logistics link of the e-commerce order, the order related data can comprise different types of data such as express delivery order numbers, logistics carriers and the like. Such can be considered order related data.

It should be noted that offline in the present application refers to offline operation with respect to e-commerce transactions, i.e. operation not performed in the e-commerce platform in the present application, and does not mean that the user's operation is not dependent on the internet. For example, in the foregoing example, the user performs the corresponding payment operation through the third-party application program and then captures the image of the ticket to obtain the image of the ticket, which is regarded as the operation of offline remittance referred to in this application.

The ticket image may be obtained by capturing an area where an image containing the order data is located from an original picture submitted by a user of the e-commerce platform. The picture determined as the bill image is generally a picture in which an image containing the order data can be manually identified, but for a computer device of an e-commerce platform, whether the computer device contains the order data or not can be judged in advance through technical identification, the picture not containing the order data can be not subjected to subsequent processing, and for the picture containing the order data, the bill image therein can be obtained, and text identification processing is further performed on the bill image.

In one embodiment, the determination of whether the original picture includes the order data may be implemented by using a neural network model trained to a convergence state in advance, where the neural network model has the capability of performing representation learning on the original picture to obtain image feature information corresponding to the original picture, and then the image feature information is input into a two-classifier to perform classification decision, so as to determine whether the original picture includes the order data, and obtain a bill image from the original picture including the order data.

It is understood that in the form image containing the order data, there are text regions each containing text information, and an image corresponding to each text region, that is, a text image. When performing text recognition on a bill image containing order data, a text box corresponding to each text area in the bill image may be detected first to obtain a text image corresponding to the text area, the text box may be represented by coordinate information of the text area corresponding to the text box, for example, the text box corresponding to the text area may be represented by coordinates corresponding to four corner points of the text area, and then, corresponding text images may be captured from the bill image according to the coordinate information, and then, text recognition may be performed on each text image, so that text information corresponding to each text image may be obtained.

The determination of the text image in the document image and the recognition of the corresponding text information according to the text image can be implemented by an optical image recognition technology (OCR) in the conventional technology, or by extraction by a neural network model based on deep learning.

Step S1200, acquiring text visual characteristics of corresponding positions according to the text box position information of each text area in the bill image.

For the bill image, firstly, a convolutional neural network is adopted for feature extraction so as to obtain the depth visual features of the bill image. The convolutional neural network can adopt sufficient training samples to carry out prior training to a convergence state, so that the convolutional neural network can extract features capable of representing image depth semantic information from a given bill image, and the convolutional neural network is used in the application. In one exemplary application of the present application, the convolutional neural network employs the ResNet50 model, which is superior in measured performance.

And secondly, according to the depth visual features, adopting downsampling operations of different multiples to obtain the depth visual features under different resolutions. The network is enabled to obtain different degrees of view, capturing information on different scales. In the exemplary application example of the present application, feature maps at different resolutions are obtained by adopting four-time, eight-time, sixteen-time, and thirty-time sampling operations, and then upsampling fusion is performed to obtain fusion of key detail features and depth semantic features, and the specific implementation steps thereof can be seen in the subsequent step S1220.

And then, according to the text box position information after text detection, intercepting the visual features of the corresponding text area from the visual features under different resolutions by adopting a ROIAlign method to serve as the text visual features of the text area. Thus, one or more text regions corresponding to one or more sets of visual features of the text at different resolutions exist in a document image. The text visual features can represent semantic feature information of corresponding text regions in the bill images.

S1300, fusing text information and text box position information in each text area to obtain second text information characteristics of each text area;

the method and the device have the advantages that the text visual features are extracted, and meanwhile the text information features are also extracted, so that the fusion supplement of the image visual information, the text information and the text position information is realized, the more complete bill image text feature representation is achieved, and the detection precision is improved.

According to the text information embedding method and device, a multi-language dictionary is provided in advance, the text information is segmented according to the text information in each text area, then characteristic values corresponding to all the segmented words are obtained through query in the multi-language dictionary, and the characteristic values are coded into the text information embedding vectors. The multi-language dictionary may be multiple basic dictionaries provided independently corresponding to different languages, or may be a comprehensive dictionary formed by combining dictionaries corresponding to all languages.

Secondly, text box information corresponding to each text area, namely four position coordinate point information, is subjected to feature vectorization by connecting a first fully-connected linear layer to obtain a position embedded vector; and simultaneously inputting the text embedding vector into a second full-connection linear layer, a first one-dimensional convolutional layer Conv1d, a first active layer ReLU and a first pooling layer MaxPool to obtain a first text information characteristic.

And then, fusing the text box information and the text information by adding and summarizing the position embedding vector and the first text information characteristic. After fusion, the text information is input into a pretrained to converged convolutional neural network structure Bertlayer to obtain a second text information characteristic.

And S1400, fusing the text visual features and the second text information features corresponding to the text regions to obtain multi-modal features of the text regions, inputting the multi-modal features into a classification neural network for judgment, and determining classification labels corresponding to the text information of the text regions.

After the text visual characteristics and the second text information characteristics of each text region are obtained, the text visual characteristics and the second information characteristics are fused in an addition fusion mode to obtain multi-modal characteristics, comprehensive characteristic representation of image information and text information is achieved, information complementation is formed among different modal characteristics, and probability prediction is made to be more robust.

Inputting the fused multi-modal characteristics into a classification space of a preset classification network to complete classification mapping, acquiring the probability of the multi-modal characteristics belonging to all preset classes, and selecting the class with the maximum probability value as a final classification label. The desired preset category is set by the relevant technician according to an example application scenario. The classification network is pre-configured to the number of classifications in the stage of participating in network training and is trained to a convergence state.

From the above embodiments, it can be seen that the present application has various advantages, including but not limited to:

the method comprises the steps of performing text recognition on a bill image to obtain text box position information of each text area in the image and text information of each text area; acquiring text visual features of corresponding positions according to the text box position information of each text area in the bill image; fusing text information and text box position information in each text area to obtain second text information characteristics of each text area; and finally, fusing the text visual features and the second text information features corresponding to the text regions to obtain multi-modal features of the text regions, inputting the multi-modal features into a classification neural network for discrimination, and determining classification labels corresponding to the text information of the text regions.

According to the method and the device, the text box position information and the text information of the text area in the bill image are automatically extracted based on the text recognition technology, the problem that the complex operation of customizing the template in advance in the common technology is solved, the key information extraction efficiency of the image is improved, and the automatic processing performance of the image is enhanced.

Referring to fig. 2, in an embodiment deepened on the basis of any of the above embodiments, the step S1100 performs text recognition on the ticket image, and includes the following steps:

step S1110, calling a text detection model which is trained to a convergence state in advance to detect the bill image, and obtaining text box position information corresponding to each text area;

the method comprises the steps of extracting key information of a bill image, firstly carrying out text detection on the bill image, and accordingly adopting a preset text detection model trained to a convergence state in advance for text detection of the bill image so as to obtain text box coordinate information of a text area in the bill image. The text detection model needs to restrict the size of the input image through the input parameters, so that the bill image containing order data can be preprocessed according to the input parameters, and the bill image is adjusted to a specific size, such as 512 pixels by 512 pixels, through operations such as cutting and scaling, so as to obtain the preprocessed bill image, thereby meeting the input requirements of the text detection model.

The text detection model can be implemented by adopting any one of the basic neural network models such as CNN and Resnet, which can perform representation learning on images and realize text detection by combining a classifier, and similarly, the text detection model is trained to be in a convergence state in advance, and a person skilled in the art can perform corresponding training on the text detection model by adopting a sufficient amount of corresponding training samples to learn the capability of recognizing text boxes corresponding to a plurality of lines of text images in a given preprocessed bill image.

Step S1120, capturing a text region image corresponding to each text box in the bill image according to the text box position information, inputting the image into a text recognition model pre-trained to a convergence state for text recognition, and obtaining text information corresponding to each text region.

And determining the position of a text area in the corresponding bill image according to the position information of the text box, namely coordinates of four corner points, intercepting the text area image at the position, inputting the text area image into a text recognition model which is pre-trained to be in a convergence state for text recognition, and thus obtaining the text information in the text area.

The text recognition model can be implemented by adopting any basic neural network model capable of realizing text detection, such as CRNN or attentionOCR, wherein the RNN in the CRNN takes a bidirectional LSTM basic model as a backbone so as to enhance feature extraction, and the CNN part adopts a common convolutional neural network model.

Similarly, the text recognition model is trained to converge in advance, and can be trained accordingly by those skilled in the art using a sufficient number of corresponding training samples to learn the ability to recognize text information from a given line of text images.

By performing text recognition on the line text image corresponding to each text box through the text recognition model, text information in each text region existing in the bill image can be obtained.

In this embodiment, firstly, the document image is subjected to text detection, a plurality of text boxes containing text information in the document image are detected, so as to determine coordinate information of each text area in the document image, then, the corresponding text area image in the document image is cut according to the coordinate information of the text boxes, a preset text recognition model is adopted to perform text recognition on each text area image so as to obtain corresponding text content, that is, to obtain corresponding text information, and finally, extraction of the text information corresponding to the order data in the document image is completed. The method and the device provide an effective data source for realizing rapid and accurate identification of order data, provide automatic extraction of the order data and enhance the intelligent degree of the order data.

Referring to fig. 3, in an embodiment deepened based on any of the above embodiments, the step S1200 of obtaining the visual text feature of the corresponding position according to the text box position information of each text region includes the following steps:

step S1210, obtaining depth visual characteristics of the bill image by adopting a convolution neural network structure pre-trained to a convergence state;

the convolutional neural network structure is suitable for a task of the application and needs to be trained to a convergence state in advance to obtain corresponding characteristic representation capacity, and accordingly, the convolutional neural network structure takes the preprocessed bill image as input and extracts the depth visual characteristic with the key information of the representation image for the bill image.

The convolutional neural network structure is a preferred neural network model implementation, for example in the present embodiment, the convolutional neural network structure is ResNet50 trained to converge. Alternatively, the neural network model may be selected from a variety of well known convolutional neural network structures, including but not limited to: the ResNet model, the EfficientNet series model, the MobileNet series model and the like are mature convolutional neural network structures.

The preprocessed bill image is processed in a preprocessing mode preset by a relevant technician according to an example application scenario, for example, in the embodiment of the present application, the preprocessing mode is as follows: the size specification of the input bill images is uniformly adjusted to 512 × 512, and then normalization processing with a mean value (123.675, 116.28, 103.53) and a variance (58.395, 57.12, 57.375) is performed, so that the normalized images are obtained as the input of the subsequent operation.

Step S1220, visual features under multiple resolutions are obtained according to the depth visual features;

the depth visual features are unified names of visual features extracted from the bill images by the convolutional neural network structure which is pre-trained to be in a convergence state, and the unified names comprise a first feature map, a second feature map, a third feature map and a fourth feature map which are obtained by downsampling operations of different multiples under different resolutions.

And then, respectively carrying out up-sampling operation on the second characteristic diagram, the third characteristic diagram and the fourth characteristic diagram, and then carrying out addition fusion on the second characteristic diagram, the third characteristic diagram and the first characteristic diagram to obtain a seventh characteristic diagram, a sixth characteristic diagram and a fifth characteristic diagram.

And finally, the obtained visual features under different resolutions are used as the input of subsequent operations, and the visual features are respectively a seventh feature map, a sixth feature map, a fifth feature map and a fourth feature map.

Step S1230, obtaining visual features at each resolution in the corresponding position of each text region according to the text box position information of each text region, as the text visual features of each text region.

The position information of the text box after the text detection in step S1100 includes coordinates of four corner points of the text box. And obtaining the visual features of the corresponding text areas in the visual features under different resolutions by adopting a ROIAlign method according to the coordinates of the four corner points of the text box and the size of the preprocessed bill image. Therefore, the first text feature map, the second text feature map, the third text feature map and the fourth text feature map can be obtained according to the seventh feature map, the sixth feature map, the fifth feature map and the fourth feature map.

Accordingly, if one or more text regions exist in one bill image, one or more groups of text feature maps under different resolutions exist correspondingly. The text feature map can represent semantic feature information of a corresponding text region in the bill image. And all the text feature graphs are collectively called as text visual features and used as input of subsequent operations.

In the embodiment, feature extraction is firstly carried out on the bill image to obtain feature maps of the bill image under different resolutions in a deep layer representation space, and then fusion processing is carried out after up-down sampling to obtain effective fusion of depth semantic information and shallow layer detail information, so that the characterization capability of the feature maps is more sufficient; and then obtaining the text visual characteristics of the text area in the bill image according to the text box position information obtained in the previous step and the ROIAlign method, and using the text visual characteristics as text visual modal information of the subsequent operation. The method provides a modal data source with important supplementary effect for the application, and the modal data source has more targeted and more sufficient feature representation for the text information through the up-down sampling fusion processing and the targeted feature extraction of the text box.

Referring to fig. 4, in an embodiment embodied on the basis of any of the above embodiments, the step S1220 of obtaining visual features at multiple resolutions according to the depth visual feature includes the following steps:

step S1221, obtaining a first feature map, a second feature map, a third feature map and a fourth feature map under different resolutions by adopting downsampling operations of different multiples;

according to step S1210, depth visual features of the input preprocessed bill image may be obtained, which includes obtaining feature maps at different resolutions by down-sampling operations of different multiples. In the embodiment of the present application, the first feature map img4x after four times of down-sampling, the second feature map img8x after eight times of down-sampling, the third feature map img16x after sixteen times of down-sampling, and the fourth feature map img32x after thirty-two times of down-sampling are included.

Step S1222, performing an up-sampling operation on the fourth feature map, and adding and fusing the fourth feature map and the third feature map to obtain a fifth feature map;

the fourth feature map img32x obtained in step S1221 is subjected to an upsampling operation and then is additively fused with the third feature map img16x to obtain a fifth feature map img16x _ new.

Step S1223, performing up-sampling operation on the third feature map, and adding and fusing the third feature map and the second feature map to obtain a sixth feature map;

the third feature map img16x obtained in step S1221 is subjected to an upsampling operation, and then is added and fused with the second feature map img8x to obtain a sixth feature map img8x _ new.

Step S1224, performing an upsampling operation on the second feature map, and performing an additive fusion with the first feature map to obtain a seventh feature map;

the second feature map img8x obtained in step S1221 is subjected to an upsampling operation, and then is added and fused with the first feature map img4x, so that a seventh feature map img4x _ new is obtained.

And step S1225, taking the seventh feature map, the sixth feature map, the fifth feature map and the fourth feature map as visual features under corresponding resolutions.

The seventh feature map img4x _ new, the sixth feature map img8x _ new, the fifth feature map img16x _ new, and the fourth feature map img32x obtained in step S1221, step S1222, step S1223, and step S1224 are input in the subsequent step S1230.

In the embodiment, the depth visual features are extracted on different resolution scales to obtain visual features under multiple resolutions, so that richer deep information is obtained.

Referring to fig. 5, in an embodiment deepened based on any of the above embodiments, the step S1300 of fusing the text information and the text box position information in each text area to obtain the second text information feature of each text area includes the following steps:

step 1310, encoding text information in each text area according to a preset language dictionary to obtain text embedding vectors of each text area;

the preset language dictionary can correspond to a plurality of basic dictionaries independently provided in different languages, and dictionaries corresponding to all languages can be collected into the same comprehensive dictionary. The preset of which can be set by the skilled person according to the actual application scenario.

In this embodiment, first, text segmentation is performed on text information in a text region, feature values corresponding to each segmented word are obtained by searching in the multi-language dictionary, and a text embedding vector of the text region is formed by the feature values, so as to be used in step S1330.

Step S1320, coding the text box position information of each text area to obtain a position embedding vector;

the text box position information of the text area is the text box obtained from the bill image through text detection in step S1110; the position information of each text box comprises four corner coordinate values, the four corner coordinate values comprise the position information of the text area in the bill image, and the position information has potential key information, so that the position information is worth extracting features. And directly converting the text box position information of the text area into a characteristic vector form to obtain the characteristic information of the coordinate information, namely the position embedded vector.

Step S1330, inputting the text embedding vector into a convolution neural network structure which is pre-trained to be convergent to obtain a first text information characteristic;

inputting the text embedding vector obtained in step S1310 into a convolutional neural network structure pre-trained to converge, which may be set by a relevant technician according to actual business scenario requirements, and in this embodiment, the convolutional neural network structure includes a fully-connected linear layer, a one-dimensional convolutional layer Conv1d, an activation layer ReLU, and a pooling layer MaxPool, thereby obtaining a first text information feature.

Step S1340, fusing the position embedded vector and the first text information characteristic, and inputting the position embedded vector and the first text information characteristic into a network structure which is pre-trained to be converged to obtain a second text information characteristic;

and fusing the text box information and the text information by adopting an addition and summarization mode on the position embedding vector and the first text information characteristic. And after fusion, inputting the second text information into a pretrained to converged convolutional neural network structure Bertlayer to obtain the second text information characteristic.

The second text information characteristic integrates the content information of the text and the position information of the text in the bill image, and has more complete related data from the viewpoint of data source information.

In the embodiment, the text position coordinate information and the text content information of the text region are subjected to feature extraction and feature fusion, so that on one hand, deep spatial representation of the text information can be obtained, and on the other hand, the position information of the text region in the bill image can provide some auxiliary judgment information to a certain extent; thus making its characterization capabilities more robust.

Referring to fig. 6, in an embodiment deepened on the basis of any of the above embodiments, in step S1400, fusing the text visual features and the second text information features corresponding to each text region to obtain multi-modal features of each text region, inputting the multi-modal features into a classification neural network for discrimination, and determining a classification label corresponding to text information of each text region, the method includes the following steps:

step S1410, adding the text visual features of the text regions to the corresponding second text information features to obtain multi-modal features of the corresponding text regions;

the text visual characteristic represents key information in a text region image in the bill image, and the second text information characteristic represents key information of text content information and text position information in the text region in the bill image; the former characterizes image understanding, the latter characterizes text understanding; the two are feature representations of different modes, and can play a role in mutual information supplement and mutual promotion.

There are many ways of multi-modal feature fusion, common ones are additive fusion, weighted fusion and neural network structure; the specific fusion mode can be set by related technical personnel in the field according to actual application scenes and business requirements, and in the embodiment of the application, the addition fusion mode is adopted, so that the operation is simpler and more convenient, and the expected effect can be achieved. Thus obtaining multi-modal features of various text regions in the bill image.

Step S1420, inputting the multi-modal characteristics into a classification neural network which is pre-trained to be convergent, and obtaining corresponding probabilities of a plurality of preset categories;

the multi-modal features obtained in step S1410 are input into a pre-trained to converged classification neural network, which has been pre-configured as a classification number at the stage of participating in network training, to perform classification mapping, so that the probabilities of the respective classes can be obtained sequentially correspondingly.

And step S1430, taking the highest category in the probability values as a classification label of the text information in the corresponding text area.

And according to the category probabilities of the preset number obtained in the step S1420, selecting a category with the highest probability value as the category label of the current text region, thereby completing classification of the text information of the text region in the bill image, obtaining corresponding structured data, and using the structured data as order data.

In the embodiment, the visual features and the text information features of the image are fused, a neural network structure based on multi-modal feature extraction is adopted, the information complementation of the text and the image can be fully utilized, and the additional image information complementation is provided for the feature extraction of the text data, so that the accuracy of key information extraction and the accuracy of text information classification are improved.

Referring to fig. 7, a bill information extraction apparatus adapted to one of the purposes of the present application is a functional implementation of the bill information extraction method of the present application, and the apparatus includes a text recognition module 1100, a visual feature module 1200, a text feature module 1300, and a text classification module 1400. The text recognition module 1100 is configured to perform text recognition on a ticket image to obtain text box position information of each text region in the image and text information of each text region; the visual feature module 1200 is configured to obtain a text visual feature of a corresponding position according to the text box position information of each text region; the text feature module 1300 is configured to fuse text information and text box position information in each text region to obtain a second text information feature of each text region; the text classification module 1400 integrates the text visual features and the second text information features corresponding to each text region to obtain multi-modal features of each text region, inputs the multi-modal features into the classification neural network for discrimination, and determines the classification label corresponding to the text information of each text region.

In an embodiment deepened on the basis of any of the above embodiments, the text recognition module 1100 includes: the text detection unit is used for calling a pre-trained text detection model to detect the bill image and acquiring the position information of the text box corresponding to each text area; and the text recognition unit is used for intercepting the text region images corresponding to the text boxes in the bill images according to the text box position information, inputting the text region images into a text recognition model which is pre-trained to a convergence state for text recognition, and obtaining the text information corresponding to the text regions.

In an embodiment deepened on the basis of any of the above embodiments, the visual feature module 1200 includes: the depth visual feature acquisition unit is used for acquiring the depth visual features of the bill images by adopting a convolutional neural network structure which is pre-trained to a convergence state; a multi-resolution obtaining unit, configured to obtain visual features at multiple resolutions according to the depth visual features; and the text visual characteristic acquisition unit is used for acquiring visual characteristics under each resolution in the corresponding position of each text area as the text visual characteristics of each text area according to the text box position information of each text area.

In an embodiment as embodied on the basis of any of the above embodiments, the multi-resolution obtaining unit includes: the down-sampling sub-unit is used for obtaining a first feature map, a second feature map, a third feature map and a fourth feature map under different resolutions by adopting down-sampling operation of different multiples; a fifth feature map obtaining subunit, configured to perform upsampling operation on the fourth feature map, add and fuse the upsampled operation with the third feature map, and obtain a fifth feature map; a sixth feature map obtaining subunit, configured to perform upsampling operation on the third feature map, and add and fuse the third feature map and the second feature map to obtain a sixth feature map; a seventh feature map obtaining subunit, configured to perform upsampling operation on the second feature map, add and fuse the second feature map and the first feature map, and obtain a seventh feature map; the multi-resolution visual feature subunit is used for taking the seventh feature map, the sixth feature map, the fifth feature map and the fourth feature map as visual features under corresponding resolutions;

in an embodiment deepened based on any of the above embodiments, the text feature module 1300 includes: the text embedding unit is used for coding the text information in each text region according to a preset language dictionary to obtain a text embedding vector of each text region; a position embedding unit, which is used for coding the text box position information of each text area and obtaining a position embedding vector; the first text information characteristic unit is used for inputting the text embedding vector into a convolution neural network structure which is pre-trained to be converged to obtain a first text information characteristic; and the second text information characteristic unit is used for fusing the position embedded vector and the first text information characteristic and inputting the fused position embedded vector and the first text information characteristic into a network structure which is pre-trained to be converged to obtain a second text information characteristic.

In an embodiment deepened based on any of the above embodiments, the text classification module 1400 includes: the fusion unit is used for adding the text visual features of the text regions to the corresponding second text information features to obtain multi-modal features of the corresponding text regions; the probability calculation unit is used for inputting the multi-modal characteristics into a classification neural network which is pre-trained to be converged to obtain corresponding probabilities of a plurality of preset categories; and the classification unit is used for selecting the highest category in the probability values as a classification label of the text information in the corresponding text region.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 8, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and when the computer readable instructions are executed by a processor, the processor can realize a bill information extraction method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions, which, when executed by the processor, may cause the processor to perform the ticket information extraction method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 7, and the memory stores program codes and various data required for executing the modules or the sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data necessary for executing all modules/submodules in the ticket information extraction device of the present application, and the server can call the program codes and data of the server to execute the functions of all the submodules.

The present application further provides a storage medium storing computer readable instructions, which when executed by one or more processors, cause the one or more processors to perform the steps of the ticket information extraction method of any embodiment of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

In summary, the order data expressed in the multinational languages can be automatically identified from the bill image, the order data can be directly called by the e-commerce order business process, the processing efficiency of identifying the order data from the picture so as to execute the e-commerce order business process is improved, the automatic processing performance of the e-commerce order business process is enhanced, and the e-commerce order business process combined online and offline is convenient to realize. The method and the device fuse the visual features of the image and the features of the text information, and effectively improve the accuracy of extracting the key information in the bill image and the accuracy of classifying the text information by adopting a neural network structure based on multi-mode feature extraction.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, the steps, measures, and schemes in the various operations, methods, and flows disclosed in the present application in the prior art can also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a few embodiments of the present application and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present application, and that these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A bill information extraction method is characterized by comprising the following steps:

acquiring text visual characteristics of corresponding positions according to text box position information of each text area in the bill image;

2. The bill information extraction method according to claim 1, wherein the character recognition of the bill image comprises the steps of:

calling a text detection model which is trained to a convergence state in advance to detect the bill image, and obtaining the position information of the text box corresponding to each text area;

and intercepting the text region image corresponding to each text box in the bill image according to the text box position information, inputting the text region image into a text recognition model which is pre-trained to be in a convergence state for text recognition, and obtaining text information corresponding to each text region.

3. The bill information extraction method according to claim 1, wherein the text visual feature of the corresponding position is obtained based on the text box position information of each text region in the bill image, comprising the steps of:

obtaining the depth visual characteristics of the bill image by adopting a convolutional neural network structure which is pre-trained to a convergence state;

obtaining visual features under a plurality of resolutions according to the depth visual features;

and acquiring visual features under each resolution in the corresponding position of each text area according to the text box position information of each text area, and taking the visual features as the text visual features of each text area.

4. The bill information extraction method according to claim 3, wherein obtaining visual features at a plurality of resolutions based on the depth visual feature comprises the steps of:

adopting downsampling operation of different multiples to obtain a first feature map, a second feature map, a third feature map and a fourth feature map under different resolutions;

carrying out up-sampling operation on the fourth feature map, and adding and fusing the fourth feature map and the third feature map to obtain a fifth feature map;

performing up-sampling operation on the third feature map, and adding and fusing the third feature map and the second feature map to obtain a sixth feature map;

carrying out up-sampling operation on the second characteristic diagram, and adding and fusing the second characteristic diagram and the first characteristic diagram to obtain a seventh characteristic diagram;

and taking the seventh feature map, the sixth feature map, the fifth feature map and the fourth feature map as visual features under corresponding resolutions.

5. The bill information extraction method according to claim 1, wherein the text information and the text box position information in each text area are fused to obtain the second text information characteristic of each text area, comprising the steps of:

coding text information in each text region according to a preset language dictionary to obtain a text embedding vector of each text region;

coding the text box position information of each text area to obtain a position embedding vector;

inputting the text embedding vector into a convolution neural network structure which is pre-trained to be convergent to obtain a first text information characteristic;

and fusing the position embedding vector and the first text information characteristic, and inputting the position embedding vector and the first text information characteristic into a network structure which is pre-trained to be converged to obtain a second text information characteristic.

6. The bill information extraction method according to claim 1, wherein the multi-modal features of each text region are obtained by fusing the text visual features and the second text information features corresponding to each text region, and the multi-modal features are input to a classification neural network for discrimination, and the classification label corresponding to the text information of each text region is determined, comprising the steps of:

adding the corresponding second text information characteristic to the text visual characteristic of each text region to obtain the multi-modal characteristic of the corresponding text region;

inputting the multi-modal features into a classification neural network which is pre-trained to be convergent, and obtaining corresponding probabilities of a plurality of preset classes;

and selecting the category with the highest probability value as a classification label of the text information in the corresponding text area.

7. A bill information extraction method is characterized by comprising the following steps:

the text recognition module is used for carrying out character recognition on the bill image to obtain text box position information of each text area in the image and text information of each text area;

the visual feature module is used for acquiring the text visual features of corresponding positions according to the text box position information of each text area in the bill image;

the information characteristic module is used for fusing text information and text box position information in each text area to obtain second text information characteristics of each text area;

and the text classification module is used for fusing the text visual features and the second text information features corresponding to the text regions to obtain multi-modal features of the text regions, inputting the multi-modal features into the classification neural network for discrimination, and determining the classification labels corresponding to the text information of the text regions.

8. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.

10. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 7.