CN117593757A

CN117593757A - Text element extraction method, device and storage medium in scanned item

Info

Publication number: CN117593757A
Application number: CN202311718243.4A
Authority: CN
Inventors: 朱运运; 姚树宇; 何同飞
Original assignee: China Merchants Fund Management Co ltd
Current assignee: China Merchants Fund Management Co ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-02-23
Anticipated expiration: 2043-12-13
Also published as: CN117593757B

Abstract

The invention discloses a method and a device for extracting text elements in a scanned part and a storage medium, and relates to the technical field of information extraction. The text element extraction method in the scanning piece comprises the following steps: identifying first text content in a scanned piece to be extracted through an optical character identification model; splicing the first text content based on the text format of the first text content in the scanning piece to be extracted, wherein the text format is a text paragraph or a table; inputting the spliced first text content into a pre-trained element extraction model, and outputting at least one text element in the first text content and the position of each text element. The method, the device and the storage medium for extracting the text elements can reduce the labor investment and development cost of extracting the text elements, avoid the new problem caused by maintaining a large number of codes, and can be widely used for extracting the text elements in scanning pieces in various fields.

Description

Text element extraction method, device and storage medium in scanned item

Technical Field

The invention belongs to the technical field of information extraction, and particularly relates to a text element extraction method, a text element extraction device and a storage medium in a scanned part.

Background

In the financial industry, a large number of businesses interact in the form of scanning pieces, such as the need to extract key elements from documents for data entry, the scanning pieces based on stamping for data review, and the like, related businesses cannot be automated due to unstructured data, and only a large number of manual operations can be relied on.

At present, a common processing scheme in the industry is to extract the text and the table and position information of the text by using an optical character recognition (Optical Character Recognition, OCR) model, then analyze each type of file template, find the position rule of each element, and write the corresponding rule code to extract the text element. However, since the document template patterns are various, it is necessary to develop a corresponding code for each template in such a manner, thereby increasing development costs, and it is difficult to adapt to newly appeared document templates or original document templates with small modifications. In addition, the daily maintenance cost is high, the code needs to be frequently modified, new errors are easy to introduce, and great inconvenience is brought to the extraction of text elements in the scanned piece.

Therefore, how to provide an effective solution to facilitate the extraction of text elements in a scanned document has become a challenge in the prior art.

Disclosure of Invention

The invention aims to provide a text element extraction method, a text element extraction device and a storage medium in a scanning piece, which are used for solving the problems in the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for extracting text elements in a scanned item, including:

identifying first text content in a scanned piece to be extracted through an optical character identification model;

splicing the first text content based on the text format of the first text content in the scanning piece to be extracted, wherein the text format is a text paragraph or a table;

inputting the spliced first text content into a pre-trained element extraction model, and outputting at least one text element in the first text content and the position of each text element;

the element extraction model comprises a T5 model and a pointer network layer, wherein the pointer network layer is connected with the last output layer of the T5 model, the last output layer of the T5 model is used for outputting at least one text element in the first text content, and the pointer network layer is used for taking the at least one text element as input and outputting the position of each text element in the first text content.

Based on the above disclosure, the present invention identifies the first text content in the scanned item to be extracted through the optical character recognition model; splicing the first text content based on the text format of the first text content in the scanning piece to be extracted, wherein the text format is a text paragraph or a table; inputting the spliced first text content into a pre-trained element extraction model, and outputting at least one text element in the first text content and the position of each text element. Therefore, text elements in the scanned item to be extracted can be extracted very conveniently, the labor input is reduced, meanwhile, corresponding codes are not required to be developed for each file template, the file patterns which are newly appeared generally can be well adapted, the development cost is reduced, and in addition, a large number of codes are not required to be maintained because the corresponding codes are not required to be developed for each file template, so that new problems caused by frequent code modification are avoided, and the method can be widely used for extracting the text elements in the scanned item in various fields, in particular for extracting the text elements in the scanned item in the financial field.

Through the design, the text elements in the scanned item to be extracted can be extracted very conveniently, the labor input is reduced, meanwhile, corresponding codes are not required to be developed for each file template, the file patterns which are newly appeared in general can be well adapted, the development cost is reduced, and in addition, a large number of codes are not required to be maintained because the corresponding codes are not required to be developed for each file template, so that the new problem caused by frequent code modification is avoided, and the method can be widely used for extracting the text elements in the scanned item in various fields, in particular for extracting the text elements in the scanned item in the financial field.

In one possible design, the splicing the first text content includes:

if the text format is a text paragraph, splicing two adjacent lines of text contents in the first text content through a first splicing symbol;

and if the text format is a table, splicing two adjacent columns of text contents in the first text contents through a second splicing symbol according to the sequence from left to right from top to bottom, and splicing two adjacent rows of text contents in the first text contents through the first splicing symbol.

In one possible design, the method further comprises:

identifying second text content in the sample scan by the optical character recognition model;

splicing the second text content based on the text format of the second text content in the sample scanning piece;

and training the spliced second text content marked with the text elements as the input of the element extraction model to obtain a trained element extraction model.

4. A method of extracting text elements from a scanned object as defined in claim 3, wherein the loss function of the element extraction model isWherein D is _Task Representing training samples, s representing prompt words of text elements, x representing spliced second text content of the marked text elements, y representing content of elements to be extracted, and θ _e Representing coding parameters, θ, in an element extraction model _d Representing decoding parameters in the element extraction model.

In one possible design, the file format of the scan piece to be extracted is JPG or PDF.

In a second aspect, the present invention provides a text element extraction device in a scan piece, the text element extraction device in the scan piece including:

the identification unit is used for identifying the first text content in the scanned piece to be extracted through the optical character identification model;

the splicing unit is used for splicing the first text content based on the text format of the first text content in the scanning piece to be extracted, wherein the text format is a text paragraph or a table;

the computing unit is used for inputting the spliced first text content into a pre-trained element extraction model and outputting at least one text element in the first text content and the position of each text element;

In a third aspect, the present invention provides another text element extraction device in a scanned item, including a memory, a processor and a transceiver, which are communicatively connected in sequence, where the memory is configured to store a computer program, the transceiver is configured to send and receive a message, and the processor is configured to read the computer program, and perform the text element extraction method in the scanned item as described in the first aspect or any one of the possible designs of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having instructions stored thereon which, when executed on a computer, perform the text element extraction method of the first aspect or any of the possible designs of the scan piece.

In a fifth aspect, the invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the text element extraction method in a scanner as described in the first aspect or any one of the possible designs of the first aspect.

The beneficial effects are that:

the text element extraction method, the text element extraction device and the storage medium in the scanned item can conveniently extract the text element in the scanned item to be extracted, reduce the labor input, simultaneously, can well adapt to the commonly newly-appearing file patterns without developing corresponding codes for each file template, reduce the development cost, and in addition, because the corresponding codes are not required to be developed for each file template, a large number of codes are not required to be maintained, the new problem caused by frequently modifying the codes is avoided, and the method and the device can be widely applied to the extraction of the text element in the scanned item in various fields, in particular to the extraction of the text element in the scanned item in the financial field.

Drawings

Fig. 1 is a flowchart of a method for extracting text elements in a scanned item according to an embodiment of the present application;

fig. 2 is a schematic block diagram of a text element extracting device in a scanner according to an embodiment of the present application;

fig. 3 is a schematic block diagram of a text element extracting device in another scan piece according to an embodiment of the present application.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the present invention will be briefly described below with reference to the accompanying drawings and the description of the embodiments or the prior art, and it is obvious that the following description of the structure of the drawings is only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art. It should be noted that the description of these examples is for aiding in understanding the present invention, but is not intended to limit the present invention.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention.

It should be understood that for the term "and/or" that may appear herein, it is merely one association relationship that describes an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a alone, B alone, and both a and B; for the term "/and" that may appear herein, which is descriptive of another associative object relationship, it means that there may be two relationships, e.g., a/and B, it may be expressed that: a alone, a alone and B alone; in addition, for the character "/" that may appear herein, it is generally indicated that the context associated object is an "or" relationship.

In order to facilitate extraction of text elements in a scanned item, the embodiment of the application provides a method, a device and a storage medium for extracting text elements in a scanned item, which can reduce the labor investment and development cost of text element extraction, avoid the new problem caused by maintaining a large number of codes, and can be widely used for extracting text elements in scanned items in various fields.

The text element extraction method in the scanned item provided by the embodiment of the application can be applied to a user terminal or a server, wherein the user terminal can be, but is not limited to, a personal computer, a smart phone, a tablet computer, a laptop portable computer, a personal digital assistant and the like. It is understood that the execution bodies do not constitute limitations on the embodiments of the present application.

The text element extraction method in the scanned item provided in the embodiment of the present application will be described in detail below, and the text element extraction method in the scanned item may include, but is not limited to, the following steps S101 to S103.

And S101, recognizing first text content in the scanned piece to be extracted through an optical character recognition model.

In this embodiment of the present application, the first text content in the scanned item to be extracted may be identified by the optical character recognition model, the file format of the scanned item to be extracted may be, but not limited to, JPG or PDF, and the first text content in the scanned item to be extracted may be, but not limited to, content such as chinese characters, letters or numbers, and the first text content in the scanned item to be extracted may be in a text paragraph form or a table form.

The identification of the first text content in the scanned item to be extracted by the optical character recognition model is known in the art and will not be described in detail herein.

And S102, splicing the first text content based on the text format of the first text content in the scanning piece to be extracted.

Wherein the text format may be a text paragraph or a table. In the embodiment of the application, the text format of the first text content in the scanned piece to be extracted can be identified directly through the optical character recognition model.

When the first text content is spliced, if the text format is a text paragraph, two adjacent lines of text content in the first text content can be spliced through the first splicing symbol. If the text format is a table, two adjacent columns of text contents in the first text contents can be spliced through the second splicing symbols in the order from left to right from top to bottom, and two adjacent rows of text contents in the first text contents can be spliced through the first splicing symbols. The first splicing symbol and the second splicing symbol may be set according to practical situations, for example, the first splicing symbol may be "< br >", the second splicing symbol may be "\\", it will be appreciated that in other embodiments the first and second concatenated symbols may take other symbols.

S103, inputting the spliced first text content into a pre-trained element extraction model, and outputting at least one text element in the first text content and the position of each text element.

The element extraction model adopts an improved T5 (Text to Text Transfer Transformer) model, the element extraction model comprises a T5 model and a pointer network layer, the pointer network layer is connected with the last output layer of the T5 model, the last output layer of the T5 model is used for outputting at least one text element (entity) in the first text content, and the pointer network layer is used for taking at least one text element as input and outputting the position of each text element in the first text content. In this way, by inputting the spliced first text content into the element extraction model, at least one text element and the position of each text element in the first text content can be output.

For example, in a fund confirmation list, required elements, such as a fund name, a fund code, an application share, an application amount, a confirmation share, a confirmation amount and the like, and the positions of the text elements can be extracted according to service requirements, wherein the positions of the extracted text elements can facilitate the verification of the extracted text elements by related personnel.

In the embodiment of the application, a T5 model is adopted as an initialization training model, and fine tuning training is performed on the basis of the model, so that unified modeling of different information extraction tasks is realized. The basic idea is to model each text-related problem as a "text-to-text" problem, which has the advantage that each NLP (natural language processing ) task can be treated using the same model, objective function, training and decoding process with high flexibility. In addition, the T5 model has good migration and generalization capability, supports key information extraction without limiting the industry field and extraction targets, can realize zero sample rapid cold start, has excellent small sample fine tuning capability, and is low in cost and suitable for the extraction targets in the specific field. Thus, efficient extraction of text elements can be achieved with small samples in the financial field.

Before training the element extraction model, the second text content in the sample scanning piece can be identified through the optical character recognition model, and the second text content is spliced based on the text format of the second text content in the sample scanning piece. During training, the spliced second text content can be used as a training sample, text elements needing to be extracted in the spliced second text content are marked, and then the spliced second text content marked with the text elements is used as input of an element extraction model to train, so that the trained element extraction model is obtained. The annotated text elements may be, for example, the name of the fund in the fund confirmation list, the fund code, the applied share, the applied amount, the confirmed share, the confirmed amount, and the like.

The second text Content marked with text elements and spliced can be represented in a manner such as [ CLS ] promtt [ SEP ] Content [ SEP ], wherein promt is a Prompt word (i.e., a text element), tells the model which text elements need to be extracted, and Content is the Content from which elements need to be extracted (i.e., the spliced second text Content). For example, if a fund category element needs to be extracted from the text paragraph "category of asset management plan, this plan is a fixed revenue class product", the fund category may be used as a Prompt word Prompt, and then text content is carried on, and the two categories are connected by a separator.

In the embodiment of the application, fine adjustment can be performed on a specific task scene based on a training element extraction model, the training sample data size can be selected according to the task complexity (the more text elements are extracted, the more complex the task is), and the training sample data size can be selected to be divided into a training set, a verification set and a test set according to a certain proportion, for example, the training sample data size can be according to 7.5: the training set, validation set and test set are set in a 1.5:1 ratio.

In the embodiment of the application, the loss function of the element extraction model may beWherein D is _Task Representing training samples, s representing prompt words of text elements, x representing spliced second text content of the marked text elements, y representing content of elements to be extracted, and θ _e Representing coding parameters, θ, in an element extraction model _d Representing decoding parameters in the element extraction model.

In one or more embodiments, only a portion of the text content in the scan to be extracted may be required to be extracted (e.g., only text elements of a specified chapter or page) so that some negative examples may be trained by adding text elements elsewhere in the text than where they are required to be extracted. In the training process, the training process can be stopped when the training round number reaches the set times or the loss function converges.

In summary, according to the text element extraction method in the scanned item provided by the embodiment of the application, the first text content in the scanned item to be extracted is identified through the optical character recognition model; splicing the first text content based on the text format of the first text content in the scanning piece to be extracted, wherein the text format is a text paragraph or a table; inputting the spliced first text content into a pre-trained element extraction model, and outputting at least one text element in the first text content and the position of each text element. Therefore, text elements in the scanned item to be extracted can be extracted very conveniently, the labor input is reduced, meanwhile, corresponding codes are not required to be developed for each file template, the file patterns which are newly appeared generally can be well adapted, the development cost is reduced, and in addition, a large number of codes are not required to be maintained because the corresponding codes are not required to be developed for each file template, so that new problems caused by frequent code modification are avoided, and the method can be widely used for extracting the text elements in the scanned item in various fields, in particular for extracting the text elements in the scanned item in the financial field.

Referring to fig. 2, a second aspect of the embodiments of the present application provides a text element extracting device in a scan piece, where the text element extracting device in the scan piece includes:

The working process, working details and technical effects of the text element extraction device in the scanning piece provided in the second aspect of the present embodiment may refer to the first aspect of the embodiment, and are not repeated herein.

As shown in fig. 3, a third aspect of the embodiment of the present application provides another text element extraction device in a scan piece, which includes a memory, a processor and a transceiver that are sequentially communicatively connected, where the memory is configured to store a computer program, the transceiver is configured to send and receive a message, and the processor is configured to read the computer program, and execute the text element extraction method in the scan piece according to the first aspect of the embodiment.

By way of specific example, the Memory may include, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), flash Memory (Flash Memory), first-in-first-out Memory (FIFO), and/or first-in-last-out Memory (FILO), etc.; the processor may not be limited to a processor adopting architecture such as a microprocessor, ARM (Advanced RISC Machines), X86, etc. of the model STM32F105 series or a processor integrating NPU (neural-network processing units); the transceiver may be, but is not limited to, a WiFi (wireless fidelity) wireless transceiver, a bluetooth wireless transceiver, a general packet radio service technology (GeneralPacket Radio Service, GPRS) wireless transceiver, a ZigBee protocol (low power local area network protocol based on the ieee802.15.4 standard), a 3G transceiver, a 4G transceiver, and/or a 5G transceiver, etc.

A fourth aspect of the present embodiment provides a computer readable storage medium storing instructions comprising the method for extracting a text element in a scanned item according to the first aspect of the present embodiment, i.e. the computer readable storage medium has instructions stored thereon, which when executed on a computer, perform the method for extracting a text element in a scanned item according to the first aspect. The computer readable storage medium refers to a carrier for storing data, and may include, but is not limited to, a floppy disk, an optical disk, a hard disk, a flash Memory, and/or a Memory Stick (Memory Stick), etc., where the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.

A fifth aspect of the present embodiment provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of text element extraction in a scan of the first aspect of the embodiment, wherein the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus.

It should be understood that specific details are provided in the following description to provide a thorough understanding of the example embodiments. However, it will be understood by those of ordinary skill in the art that the example embodiments may be practiced without these specific details. For example, a system may be shown in block diagrams in order to avoid obscuring the examples with unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the example embodiments.

Finally, it should be noted that: the foregoing description is only of the preferred embodiments of the invention and is not intended to limit the scope of the invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting text elements in a scanned item, comprising:

2. The method for extracting text elements from a scanned item of claim 1, wherein the stitching the first text content comprises:

3. The method of text element extraction in a scanned item of claim 1, further comprising:

4. A method of extracting text elements from a scanned object as defined in claim 3, wherein the loss function of the element extraction model isWherein D is _T ask represents a training sample, s represents a prompt word of a text element, x represents a spliced second text content marked with the text element, y represents a content of the element to be extracted, and θ _e Representing coding parameters, θ, in an element extraction model _d Representing decoding parameters in the element extraction model.

5. The method for extracting text elements from a scanned item according to claim 1, wherein the file format of the scanned item to be extracted is JPG or PDF.

6. A text element extraction device in a scanned item, comprising:

7. A text element extraction device in a scanned item, comprising a memory, a processor and a transceiver in communication with each other in sequence, wherein the memory is configured to store a computer program, the transceiver is configured to send and receive a message, and the processor is configured to read the computer program and perform the text element extraction method in a scanned item according to any one of claims 1 to 5.

8. A computer readable storage medium having instructions stored thereon which, when executed on a computer, perform a method of extracting text elements in a scanned item as claimed in any one of claims 1 to 5.