CN117373046A

CN117373046A - Information extraction method, device, computer equipment and storage medium

Info

Publication number: CN117373046A
Application number: CN202311440396.7A
Authority: CN
Inventors: 熊玉竹; 周红林
Original assignee: Qichacha Technology Co ltd
Current assignee: Qichacha Technology Co ltd
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-01-09

Abstract

The application relates to an information extraction method, an information extraction device, computer equipment and a storage medium. The method comprises the following steps: acquiring a picture to be identified; inputting the picture to be identified into a preset text line detection model to obtain a position frame of each text content in the picture to be identified, which is output by the text line detection model; dividing the picture to be identified into a plurality of sub-pictures based on the position frame, and determining picture information of the sub-pictures; text recognition is carried out on the text content in the sub-pictures, and text information and position information corresponding to the plurality of sub-pictures are obtained; inputting text information, position information and picture information corresponding to each sub-picture into a category prediction model, and determining key information of each sub-picture; and splicing the key information of each sub-picture to obtain the information in the picture to be identified. By adopting the method, the key information in the text can be effectively extracted.

Description

Information extraction method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an information extraction method, an information extraction device, a computer device, and a storage medium.

Background

Because the current business is large in information handling amount and complicated, the speed of manually extracting key information is low, such as extracting invoice, bill information and the like. In the related art, text information recognition can be performed on a document, but two lines of characters in some places are close to each other, so that recognition is affected, part of the document is very fuzzy, and recognition accuracy of the text is affected.

Disclosure of Invention

Based on this, it is necessary to provide an information extraction method in view of the above technical problems.

In a first aspect, the present application provides an information extraction method. The method comprises the following steps:

acquiring a picture to be identified;

inputting the picture to be identified into a preset text line detection model to obtain a position frame of each text content in the picture to be identified, which is output by the text line detection model;

dividing the picture to be identified into a plurality of sub-pictures based on the position frame, and determining picture information of the sub-pictures;

text recognition is carried out on the text content in the sub-pictures, and text information and position information corresponding to the plurality of sub-pictures are obtained;

inputting text information, position information and picture information corresponding to each sub-picture into a category prediction model, and determining key information of each sub-picture;

and splicing the key information of each sub-picture to obtain the information in the picture to be identified.

In one embodiment, after the obtaining the position frame of each text content in the picture to be identified output by the text line detection model, the method further includes:

judging whether intersection exists between two adjacent position frames;

screening the two adjacent position frames under the condition that intersection exists between the two adjacent position frames;

and combining the two screened adjacent position frames to obtain a combined position frame.

In one embodiment, the determining the key information of each sub-picture includes:

dividing characters in the text information based on the position information and the text information of the sub-picture, and identifying association relations among the characters;

based on the category prediction model, the classification of each character in the sub-picture is realized;

the number of the categories accords with the category of the preset category threshold value and is used as the representative category of each sub-picture;

and acquiring key information in the sub-picture based on the representative category.

In one embodiment, before the text information, the position information and the picture information corresponding to each sub-picture are input into the category prediction model, the method further includes:

scaling the sub-picture based on the picture information;

scaling the text in the sub-picture in equal proportion based on the position information and the scaling of the sub-picture;

judging the length of characters in the text information, and deleting the characters in the intermediate field when the length of the characters exceeds a preset length threshold value.

In one embodiment, the merging the two screened adjacent position frames to obtain the merged position frame includes:

obtaining vertex coordinates of a first position frame and a second position frame, wherein the two adjacent position frames comprise the first position frame and the second position frame;

and determining boundary coordinates of the first position frame and the second position frame based on the vertex coordinates to obtain a combined position frame.

In a second aspect, the present application further provides an information extraction apparatus, the apparatus including:

the acquisition module is used for acquiring the picture to be identified;

the output module is used for inputting the picture to be identified into a preset text line detection model to obtain a position frame of each text content in the picture to be identified, which is output by the text line detection model;

the dividing module is used for dividing the picture to be identified into a plurality of sub-pictures based on the position frame and determining picture information of the sub-pictures;

the identification module is used for carrying out text identification on the text content in the sub-pictures to obtain text information and position information corresponding to the plurality of sub-pictures;

the determining module is used for inputting text information, position information and picture information corresponding to each sub-picture into the category prediction model and determining key information of each sub-picture;

and the splicing module is used for splicing the key information of each sub-picture to obtain the information in the picture to be identified.

In one embodiment, after the obtaining the position frame of each text content in the picture to be identified output by the text line detection model, the apparatus further includes:

judging whether intersection exists between two adjacent position frames;

In one embodiment, before the text information, the position information and the picture information corresponding to each sub-picture are input into the class prediction model, the apparatus further includes:

scaling the sub-picture based on the picture information;

In a third aspect, the present disclosure also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the information extraction method when the processor executes the computer program.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the information extraction method.

In a fifth aspect, the present disclosure also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the information extraction method.

The information extraction method at least comprises the following beneficial effects:

according to the embodiment scheme provided by the disclosure, the picture to be identified can be analyzed to obtain the text information, the position information and the picture information of the picture to be identified, the key information is extracted based on comprehensive judgment of the text information, the position information and the picture information, and the key information with high identification accuracy can be obtained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments or the conventional techniques of the present disclosure, the drawings required for the descriptions of the embodiments or the conventional techniques will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is an application environment diagram of an information extraction method in one embodiment;

FIG. 2 is a flow chart of a method of information extraction in one embodiment;

FIG. 3 is a schematic diagram of a method of information extraction in one embodiment;

FIG. 4 is a block diagram showing the structure of an information extracting apparatus in one embodiment;

FIG. 5 is an internal block diagram of a computer device in one embodiment;

fig. 6 is an internal structural diagram of a server in one embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, it is not excluded that additional identical or equivalent elements may be present in a process, method, article, or apparatus that comprises a described element. For example, if first, second, etc. words are used to indicate a name, but not any particular order.

The embodiment of the disclosure provides an information extraction method, which can be applied to an application environment as shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In some embodiments of the present disclosure, as shown in fig. 2, an information extraction method is provided, and an example of processing a picture to be identified by the server in fig. 1 is described. It will be appreciated that the method may be applied to a server, and may also be applied to a system comprising a terminal and a server, and implemented by interaction of the terminal and the server. In a specific embodiment, the method may include the steps of:

s202: and obtaining a picture to be identified.

The picture to be identified can be a license, an invoice, a bill and the like, the picture to be identified can be PDF (Portable Document Format) documents of various different formats of documents and picture types including text contents, if the document to be identified is a document of other formats, the document to be identified needs to be preprocessed first, and the document of other formats is converted into a PDF document of the picture type, so that the preprocessed picture to be identified is obtained.

S204: and inputting the picture to be identified into a preset text line detection model to obtain a position frame of each text content in the picture to be identified, which is output by the text line detection model.

The text line detection model can detect horizontal or slightly inclined text lines, enrich outline characteristic representation by using outline information aggregation, can restrain the influence of redundant and noisy outline points, and can generate more accurate positioning for any-shaped text. And inputting the picture to be identified into a preset text line detection model, and obtaining a position frame of each text content in the picture to be identified, which is output by the text line detection model, wherein the position frame comprises the text content of each line in the picture to be identified.

S206: and dividing the picture to be identified into a plurality of sub-pictures based on the position frame, and determining picture information of the sub-pictures.

S208: and carrying out text recognition on the text content in the sub-pictures to obtain text information and position information corresponding to the plurality of sub-pictures.

Based on the picture to be identified, text information corresponding to each position frame, position information corresponding to each position frame and picture information corresponding to each position frame are obtained. The text information corresponding to each position frame, i.e. the Chinese information in each position frame table, can also be understood as the text content of the character string. The position information corresponding to each position frame can be identified by coordinates of four foot points of the position frame. The picture information corresponding to each position frame can be regarded as information in the picture corresponding to the position frame.

S210: and inputting text information, position information and picture information corresponding to each sub-picture into a category prediction model, and determining key information of each sub-picture.

And inputting the text information corresponding to each position frame, the position information corresponding to each position frame and the picture information corresponding to each position frame into a category prediction model to obtain the key information of each sub picture. In the embodiment of the invention, taking one of the position frames as an example, based on the picture information, operations such as noise removal, definition improvement and the like can be performed on the text content in the position frame, based on the position information of the sub-picture, adjustment operations such as rotation and the like can be performed on the text content in the position frame, based on the position information among the plurality of sub-pictures, the association relationship among the plurality of text contents can be acquired, and based on the text information, the key information of the text content in the position frame can be acquired.

S212: and splicing the key information of each sub-picture to obtain the information in the picture to be identified.

The key information of each sub-picture can be spliced to obtain the information in the picture to be identified.

According to the information extraction method, the picture to be identified can be analyzed to obtain the text information, the position information and the picture information of the picture to be identified, the key information is extracted based on comprehensive judgment of the text information, the position information and the picture information, and the key information with high identification accuracy can be obtained.

Fig. 3 is a schematic diagram of an information extraction method in an embodiment, which can perform text line recognition and text recognition on a picture to be recognized to obtain location information, picture information and text information, and input a category prediction model to obtain key information of a text.

In some embodiments of the present disclosure, after the obtaining the position box of each text content in the picture to be identified output by the text line detection model, the method further includes:

judging whether intersection exists between two adjacent position frames;

The respective areas of the two adjacent position frames can be obtained, the overlapping area of the two position frames is calculated, under the condition that the overlapping area exceeds a threshold value, the intersection of the two adjacent position frames can be determined, meanwhile, text information in the two position frames can be compared, and the intersection of the two adjacent position frames is further determined. The vertex coordinates of the first position frame and the second position frame can be obtained, and two adjacent position frames comprise the first position frame and the second position frame; and determining boundary coordinates of the first position frame and the second position frame based on the vertex coordinates to obtain a combined position frame.

In some embodiments of the disclosure, the determining the key information of each sub-picture includes:

The category prediction model (layoutlmv 3 model) can be used for predicting the category of each character of the input text, and the category is well defined according to the service requirement when the model is trained and can correspond to the key information field. Based on a category prediction model, classification of each character in the sub-picture is achieved, a plurality of different categories can be obtained by a plurality of characters due to different character segmentation modes, the number of the categories is counted, the category with the largest number can be used as a prediction category of a text line, and extracted key information represents information of the text line.

In some embodiments of the present disclosure, before the text information, the position information, and the picture information corresponding to each sub-picture are input into the class prediction model, the method further includes:

scaling the sub-picture based on the picture information;

The text length can be preprocessed when the model is input, the head and tail contents with the text length larger than 20 are reserved, the middle part of the text is removed, and the text length of the input model is reduced to be within a preset length threshold value on the premise of sacrificing a small amount of semantic information. The method can perform equal-ratio scaling on the picture input into the layoutlmv3 model, and simultaneously perform equal-ratio scaling on the position of the text, so that consistency of the text and the picture is ensured, and the requirement on the video memory is reduced.

In some embodiments of the present disclosure, the merging the two screened neighboring position frames to obtain a merged position frame includes:

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiments of the present disclosure also provide an information extraction apparatus for implementing the above-mentioned information extraction method. The implementation scheme of the solution provided by the device is similar to the implementation scheme described in the above method, so the specific limitation in the embodiments of the information extraction device provided below can be referred to the limitation of the information extraction method hereinabove, and will not be repeated here.

The apparatus may comprise a system (including a distributed system), software (applications), modules, components, servers, clients, etc. that employ the methods described in the embodiments of the present specification in combination with the necessary apparatus to implement the hardware. Based on the same innovative concepts, embodiments of the present disclosure provide for devices in one or more embodiments as described in the following examples. Because the implementation scheme and the method for solving the problem by the device are similar, the implementation of the device in the embodiment of the present disclosure may refer to the implementation of the foregoing method, and the repetition is not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

In one embodiment, as shown in fig. 4, an information extraction apparatus 400 is provided, which may be the aforementioned server, or a module, component, device, unit, etc. integrated with the server. The apparatus 400 may include:

an acquisition module 402, configured to acquire a picture to be identified;

the output module 404 is configured to input the picture to be identified into a preset text line detection model, and obtain a position frame of each text content in the picture to be identified output by the text line detection model;

a dividing module 406, configured to divide the picture to be identified into a plurality of sub-pictures based on the position frame, and determine picture information of the sub-pictures;

the identifying module 408 is configured to perform text identification on the text content in the sub-pictures, so as to obtain text information and position information corresponding to the plurality of sub-pictures;

a determining module 410, configured to input text information, position information and picture information corresponding to each sub-picture into a category prediction model, and determine key information of each sub-picture;

and the splicing module 412 is configured to splice the key information of each sub-picture to obtain information in the picture to be identified.

judging whether intersection exists between two adjacent position frames;

scaling the sub-picture based on the picture information;

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The above-described respective modules in the information extraction apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store key information. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an information extraction method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an information extraction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structures shown in fig. 5 and 6 are merely block diagrams of partial structures associated with the disclosed aspects and do not constitute a limitation of the computer device on which the disclosed aspects may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, implements the method of any of the embodiments of the present disclosure.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method described in any of the embodiments of the present disclosure.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided by the present disclosure may include at least one of non-volatile and volatile memory, among others. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided by the present disclosure may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors involved in the embodiments provided by the present disclosure may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic, quantum computing-based data processing logic, etc., without limitation thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples have expressed only a few embodiments of the present disclosure, which are described in more detail and detail, but are not to be construed as limiting the scope of the present disclosure. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the disclosure, which are within the scope of the disclosure. Accordingly, the scope of the present disclosure should be determined from the following claims.

Claims

1. An information extraction method, characterized in that the method comprises:

acquiring a picture to be identified;

2. The method according to claim 1, wherein after the obtaining the position box of each text content in the picture to be identified output by the text line detection model, the method further comprises:

judging whether intersection exists between two adjacent position frames;

3. The method of claim 1, wherein the determining key information for each sub-picture comprises:

4. The method of claim 1, wherein before inputting the text information, the position information, and the picture information corresponding to each sub-picture into the class prediction model, the method further comprises:

scaling the sub-picture based on the picture information;

5. The method of claim 2, wherein the merging the two filtered neighboring position frames to obtain a merged position frame comprises:

6. An information extraction apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring the picture to be identified;

7. The apparatus according to claim 6, wherein after the obtaining the position frame of each text content in the picture to be identified output by the text line detection model, the apparatus further comprises:

judging whether intersection exists between two adjacent position frames;

8. The apparatus of claim 6, wherein the determining the key information for each sub-picture comprises:

9. The apparatus of claim 6, wherein before inputting the text information, the position information, and the picture information corresponding to each sub-picture into the class prediction model, the apparatus further comprises:

scaling the sub-picture based on the picture information;

10. The apparatus of claim 7, wherein the merging the two filtered neighboring position frames to obtain a merged position frame comprises:

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.

13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 5.