CN118038248A

CN118038248A - Image processing method and device, and text image processing model training method and device

Info

Publication number: CN118038248A
Application number: CN202410044810.0A
Authority: CN
Inventors: 万建强; 宋思博; 余文文; 姚聪; 杨志博
Original assignee: Zhejiang Alibaba Robot Co ltd
Current assignee: Zhejiang Alibaba Robot Co ltd
Priority date: 2024-01-11
Filing date: 2024-01-11
Publication date: 2024-05-14

Abstract

The embodiment of the specification provides an image processing method and device, and a text image processing model training method and device, wherein the image processing method comprises the following steps: determining a target text image and a target task identifier corresponding to the target text image; inputting the target text image and the target task identifier into a text image processing model to obtain a target text recognition result corresponding to the target image processing task in the target text image, wherein the text image processing model is obtained through training of a sample text image, sample task identifiers respectively corresponding to a plurality of sample image processing tasks and sample text recognition results corresponding to each sample image processing task in the sample text image; and a plurality of image processing tasks are completed by using one model, so that the whole link of processing is simplified, the information interaction and sharing efficiency among different tasks can be improved, and more efficient and more accurate processing is realized.

Description

Image processing method and device, and text image processing model training method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to an image processing method and device, and a text image processing model training method and device.

Background

The document processing is an important task indispensable in the present informatization age, wherein a text detection and recognition task, a key information extraction task and a form recognition task are used as core tasks of the document processing, and have important significance. However, these tasks are currently typically accomplished using separate custom models, resulting in a cumbersome and complex overall image processing link and a significant amount of computational resources.

Therefore, there is a need for an image processing method that simplifies the links of document processing between a plurality of document processing tasks, and achieves more efficient and accurate document processing.

Disclosure of Invention

In view of this, the present embodiment provides an image processing method. One or more embodiments of the present specification relate to a text image processing model training method, an image processing apparatus, a text image processing model training apparatus, a computing device, a computer readable storage medium, and a computer program product, which solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present specification, there is provided an image processing method including:

Determining a target text image and a target task identifier corresponding to the target text image, wherein the target task identifier is a task identifier of a target image processing task, and the target image processing task is any one or more of a plurality of initial image processing tasks;

Inputting the target text image and the target task identifier into a text image processing model to obtain a target text recognition result corresponding to the target image processing task in the target text image,

The text image processing model is obtained through training of sample text images, sample task identifiers corresponding to a plurality of sample image processing tasks respectively, and sample text recognition results corresponding to the sample image processing tasks in the sample text images.

According to a second aspect of embodiments of the present specification, there is provided an image processing apparatus comprising:

The system comprises a determining module, a processing module and a processing module, wherein the determining module is configured to determine a target text image and a target task identifier corresponding to the target text image, wherein the target task identifier is a task identifier of a target image processing task, and the target image processing task is any one or more of a plurality of initial image processing tasks;

An obtaining module configured to input the target text image and the target task identifier into a text image processing model, obtain a target text recognition result corresponding to the target image processing task in the target text image,

According to a third aspect of embodiments of the present disclosure, there is provided a text image processing model training method, including:

determining a plurality of sample image processing tasks, sample task identifiers corresponding to the sample image processing tasks and sample text images;

determining a sample character recognition result corresponding to a target sample image processing task in a target sample character image, wherein the target sample image processing task is any one of the sample image processing tasks, and the target sample character image is a sample character image corresponding to the target sample image processing task;

Training to obtain a character image processing model according to the target sample character image, the sample task identifier corresponding to the target sample image processing task and the sample character recognition result.

According to a fourth aspect of embodiments of the present specification, there is provided a text image processing model training apparatus, comprising:

The first determining module is configured to determine a plurality of sample image processing tasks, sample task identifiers corresponding to the sample image processing tasks and sample text images;

A second determining module, configured to determine a sample text recognition result corresponding to a target sample image processing task in a target sample text image, where the target sample image processing task is any one of the plurality of sample image processing tasks, and the target sample text image is a sample text image corresponding to the target sample image processing task;

And the training module is configured to train to obtain a word image processing model according to the target sample word image, the sample task identifier corresponding to the target sample image processing task and the sample word recognition result.

According to a fifth aspect of embodiments of the present disclosure, there is provided a text image processing model training method, applied to a cloud, including:

Receiving a plurality of sample image processing tasks, sample task identifiers corresponding to the sample image processing tasks and sample text images sent by a client;

According to a sixth aspect of embodiments of the present disclosure, there is provided a text image processing model training device applied to a cloud, including:

the receiving module is configured to receive a plurality of sample image processing tasks, sample task identifiers corresponding to the sample image processing tasks and sample text images sent by the client;

According to a seventh aspect of embodiments of the present specification, there is provided a computing device comprising:

A memory and a processor;

The memory is configured to store computer-executable instructions that, when executed by the processor, implement the steps of the image processing method or the text image processing model training method described above.

According to an eighth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the above-described image processing method, or text-to-image processing model training method.

According to a ninth aspect of embodiments of the present specification, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above-described image processing method, or text image processing model training method.

An embodiment of the present specification provides an image processing method including: determining a target text image and a target task identifier corresponding to the target text image, wherein the target task identifier is a task identifier of a target image processing task, and the target image processing task is any one or more of a plurality of initial image processing tasks; and inputting the target text image and the target task identifier into a text image processing model to obtain a target text recognition result corresponding to the target image processing task in the target text image, wherein the text image processing model is obtained through training of a sample text image, sample task identifiers respectively corresponding to a plurality of sample image processing tasks and sample text recognition results corresponding to each sample image processing task in the sample text image.

Based on the above, the text image processing model is obtained based on sample text images, sample task identifiers respectively corresponding to a plurality of sample image processing tasks and sample text recognition results corresponding to the sample image processing tasks in the sample text images, and the information interaction and sharing efficiency among different tasks can be improved by carrying out joint processing on the plurality of image processing tasks, so that more efficient and more accurate document processing can be realized; and a plurality of image processing tasks are completed through one text image processing model, so that the problems of slow information transmission and high calculation cost among a plurality of independent models can be avoided, and the whole link of document processing is greatly simplified.

Drawings

Fig. 1 is a schematic view of a scenario of an image processing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of an image processing method provided in one embodiment of the present disclosure;

FIG. 3 is a flow chart of a text image processing model training method according to one embodiment of the present disclosure;

FIG. 4 is a flowchart of a training method of text image processing model applied to cloud according to an embodiment of the present disclosure;

FIG. 5 is a process diagram of an image processing method according to one embodiment of the present disclosure;

fig. 6 is a schematic structural view of an image processing apparatus according to an embodiment of the present specification;

FIG. 7 is a schematic diagram of a training device for text image processing model according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.

In one or more embodiments of the present description, a large model refers to a deep learning model with large scale model parameters, typically including hundreds of millions, billions, trillions, and even more than one billion model parameters. The large Model can be called as a Foundation Model, a training Model is performed by using a large-scale unlabeled corpus, a pre-training Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model (Large Language Model, LLM), a multi-modal pre-training Model (multi-modal pre-training Model) and the like.

When the large model is actually applied, the pretrained model can be applied to different tasks by only slightly adjusting a small number of samples, the large model can be widely applied to the fields of natural language processing (Natural Language Processing, NLP for short), computer vision and the like, and particularly can be applied to the tasks of the computer vision fields such as vision question and answer (Visual Question Answering, VQA for short), image description (IC for short), image generation and the like, and the tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes of the large model comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like.

First, terms related to one or more embodiments of the present specification will be explained.

OCR: optical Character Recognition, optical character recognition, refers to the process of electronic devices (e.g., scanners or digital cameras) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method.

Transformer, an encoder-decoder model of the attention mechanism, is one of the commonly used deep learning models.

In the present information age, a large number of documents and pictures require automated processing and analysis. However, conventional document processing methods often require a lot of manpower and time, are inefficient and prone to error; to solve this problem, automated document processing techniques have been developed which aim to utilize related techniques such as computer vision and natural language processing to improve the efficiency and accuracy of document processing. The text detection and recognition, the key information extraction and the form recognition are core tasks of document processing; the text detection and recognition refers to automatically detecting and recognizing text information in pictures through an image processing technology, and the task has wide application value in the fields of automatic driving, medical treatment, education, insurance and the like. For example, in automatic driving, vehicles need to identify traffic signs and signs on roads; in the medical field, doctors need to extract medical record information of patients from medical reports.

The key information extraction means that important information such as address, date and the like is extracted from the document; this information is critical for subsequent analysis and processing. For example, in the insurance industry, the claimants need to extract key information such as identity information of insured persons, accident time, etc. from insurance policies and accident reports.

The table identification refers to converting a table in a document picture into an editable electronic table, so that the digital processing of the document is realized. Forms are widely used in various industries as an important information presentation form. For example, in the financial industry, analysts need to convert form data in financial statements to digital form for further analysis.

However, these document processing tasks currently typically require the use of separate custom models to complete, resulting in cumbersome and inefficient overall image processing links. Therefore, the embodiment of the specification provides a unified parsing method for text images, which aims to integrate text detection recognition, key information extraction and form recognition into one model so as to optimize an overall processing link and improve processing efficiency and accuracy.

In the present specification, an image processing method, an image processing apparatus, a text image processing model training apparatus, a computing device, and a computer-readable storage medium are provided, and the present specification relates to a text image processing model training method, an image processing apparatus, an image processing model training apparatus, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Referring to fig. 1, fig. 1 shows a schematic view of a scene of an image processing method according to an embodiment of the present specification.

Specifically, the image processing method is implemented by using an end-side device 102 and a server 104, where the end-side device 102 is configured to send a target text image and a target task identifier to the server 104, where the target text image may be understood as an image to be subjected to image processing, which is sent by a user, and the image includes text; the target task identifier may be understood as a mode of performing image processing on the target text image by the user, for example, the user may send the target task identifier 1 corresponding to the text detection and identification task when the user needs to perform text detection and identification on the target text image; in practical application, the user can select any one or more of processing tasks such as text detection and identification, key information extraction, form identification and the like for the target text image, so that the image processing task can be more efficiently completed.

The text image processing model is obtained through training in the server 104, wherein the text image processing model is obtained through training of sample text images, sample task identifiers corresponding to a plurality of sample image processing tasks respectively, and sample text recognition results corresponding to the sample image processing tasks in the sample text images.

The text image processing model comprises an image encoder, a structured sequence decoder, a detection frame decoder and an identification content decoder, and can be a large model; when the server 104 receives the target text image and the target task identifier sent by the terminal side device 102, inputting the target text image and the target task identifier into a text image processing model, and obtaining an image feature vector corresponding to the target text image by using an image encoder of the text image processing model; inputting the image feature vector and the target task identifier into a structured sequence decoder of a text image processing model to obtain a text center point sequence of a target text image corresponding to the target task identifier; the text center point sequence can be provided with a structural symbol, wherein the structural symbol in the text center point sequence can be used for representing category information in a key information extraction task and table structure information in a table identification task.

Inputting the character center point sequence and the image feature vector into a character center point-based detection frame decoder and a character center point-based identification content decoder to respectively obtain a detection bounding frame and identification content of the character center point in the corresponding target character image; by fusing the detection recognition result (i.e., the detection bounding box and the recognition content) with the structural symbol, a complete output result, such as a formatted output in key information extraction and an HTML (Hyper Text Markup Language, i.e., hypertext markup language) representation output in form recognition, can be obtained, and the output result is returned to the end-side device 102.

The end-side device 102 may include a browser, APP (Application), or web Application such as H5 (Hyper Text Markup Language, hypertext markup language version 5) Application, or a light Application (also referred to as applet, a lightweight Application), or cloud Application, etc., and the end-side device may be based on a software development kit (SDK, software Development Kit) of the corresponding service provided by the server, such as a real-time communication (RTC, real Time Communication) based SDK development acquisition, etc. The end-side device may be deployed in an electronic device, need to run depending on the device or some APP in the device, etc. The electronic device may have a display screen and support information browsing, etc., as may be a personal mobile terminal such as a cell phone, tablet computer, personal computer, etc. Various other types of applications are also commonly deployed in electronic devices, such as human-machine conversation type applications, model training type applications, text processing type applications, web browser applications, shopping type applications, search type applications, instant messaging tools, mailbox clients, social platform software, and the like.

Server 104 may be understood as a server providing various services, including physical servers, cloud servers, such as servers providing communication services for multiple clients, servers for background training that provide support for models used on clients, servers that process data sent by clients, and so forth. It should be noted that, the server 104 may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. The server 104 may also be a server of a distributed system or a server that incorporates a blockchain. The server 104 may also be a cloud server for cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), basic cloud computing services such as big data and artificial intelligence platforms, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.

It should be noted that, the image processing method provided in the embodiment of the present disclosure may be executed by the server 104, and in other embodiments of the present disclosure, a text image processing model may be deployed in the end-side device 102, so that the end-side device 102 may also have a similar function to the server 104, so as to execute the image processing method provided in the embodiment of the present disclosure; in other embodiments, the image processing method provided in the embodiments of the present disclosure may also be performed by the end-side device 102 in conjunction with the server 104.

According to the image processing method provided by the embodiment of the specification, the text center point of the text image is used as the unified feature expression among the plurality of image processing tasks, and the plurality of image processing tasks are processed in a combined mode, so that the information interaction and sharing efficiency among different tasks can be improved, and more efficient and more accurate document processing can be realized; and a plurality of image processing tasks are completed through one text image processing model, so that the problems of slow information transmission and high calculation cost among a plurality of independent models can be avoided, and the whole link of document processing is greatly simplified.

Referring to fig. 2, fig. 2 shows a flowchart of an image processing method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 202: determining a target text image and a target task identifier corresponding to the target text image, wherein the target task identifier is a task identifier of a target image processing task, and the target image processing task is any one or more of a plurality of initial image processing tasks.

The target text image is understood as an image which is to be subjected to image processing and contains text; the target image processing task may be understood as a processing mode of performing image processing on a target text image, and is not limited to one or more of text detection and recognition tasks, key information extraction tasks and the like; the initial image processing tasks include, but are not limited to, text detection recognition tasks, key information extraction tasks, form recognition tasks.

Specifically, determining a text image to be subjected to image processing, and determining a processing mode for performing image processing on the text image by using a task identifier; for example, when the user sends a target text image a and target task identifiers 1 and 3 through the client, and the target task identifier 1 represents a text detection and recognition task and the target task identifier 3 represents a form recognition task, it may be determined that text detection and recognition and form recognition are to be performed on the target text image a.

In one or more embodiments of the present description, the target image processing task includes a text detection recognition task, a key information extraction task, and/or a form recognition task; accordingly, the target task identifications are different according to different target image processing tasks, and one or more task identifications corresponding to the target text image can be determined for the target text image. The specific implementation mode is as follows:

The determining the target text image and the target task identifier corresponding to the target text image comprises the following steps:

determining a target text image, and a detection and identification task identifier, an information extraction task identifier and/or a form identification task identifier corresponding to the target text image, wherein the detection and identification task identifier is a task identifier of the text detection and identification task, the information extraction task identifier is a task identifier of the key information extraction task, and the form identification task identifier is a task identifier of the form identification task.

The text detection and recognition task can be understood as finding out the position of the text in the image, and the text is usually represented as a rectangular or polygonal boundary box; and further analyzing the specific content of each character on the basis of the positioned text area, and converting the specific content into an editable and searchable text format.

The key information extraction task can be understood as extracting information with important meaning and value from a large amount of unstructured text data in a document or an image; including but not limited to specific fields such as address, date, amount, etc.; the key information extraction task not only requires that all texts be identified, but also understands the meaning of the texts so as to screen out key contents.

The form recognition task not only needs to detect the division of the form boundary, the row and the column in the image, but also accurately recognizes the characters in each cell, and converts the information into the form of an electronic form or a database; to preserve the structural integrity of the original form so that the machine can understand and process the form data.

Specifically, different target image processing tasks correspond to different target task identifications, and a text detection recognition task, a key information extraction task and a form recognition task correspond to a detection recognition task identification, an information extraction task identification and a form recognition task identification respectively.

For the target text image, one or more target task identifications corresponding to the target text image may be determined, for example, an information extraction task identification and a form recognition task identification corresponding to the target text image B may be determined, whereby it may be determined that a key information extraction task and a form recognition task are to be performed for the target text image B.

According to the image processing method provided by the embodiment of the specification, one or more task identifications corresponding to the target text image are determined, so that one or more target image processing tasks of the target text image can be completed according to the one or more target task identifications; and a plurality of tasks are completed through one text image processing model, so that the complexity of the system and the demand on computing resources are reduced.

Step 204: and inputting the target text image and the target task identifier into a text image processing model to obtain a target text recognition result corresponding to the target image processing task in the target text image.

The text image processing model is obtained through training of sample text images, sample task identifiers corresponding to a plurality of sample image processing tasks respectively, and sample text recognition results corresponding to the sample image processing tasks in the sample text images, and the sample text recognition results are obtained based on text center points of the sample text images.

The word image processing model is understood as a model for processing word images; the target character recognition result may be understood as an output result of the target character image corresponding to the target image processing task.

Specifically, taking a target image processing task as a text detection and recognition task as an example, inputting an image containing a license plate as a target text image into a text image processing model, and inputting a detection and recognition task identifier corresponding to the text detection and recognition task into the text image processing model; after receiving an image containing a license plate, the text image processing model detects and identifies the text of the license plate part to obtain the corresponding number and letter sequence of the license plate, such as 'Beijing A123'.

In one or more embodiments of the present disclosure, by inputting the target text image and the task identifications of the different target image processing tasks into the text image processing model, the recognition result of the corresponding target image processing task can be obtained, and the recognition result of the corresponding target image processing tasks can be obtained when the different target task identifications are input. The specific implementation mode is as follows:

Inputting the target text image and the target task identifier into a text image processing model to obtain a target text recognition result corresponding to the target image processing task in the target text image, wherein the method comprises the following steps:

inputting the target text image and the detection and identification task identifier into a text image processing model to obtain a text detection and identification result corresponding to the text detection and identification task in the target text image; and/or

Inputting the target text image and the information extraction task identifier into a text image processing model to obtain a key information identification result corresponding to the key information extraction task in the target text image; and/or

And inputting the target text image and the table recognition task identifier into a text image processing model to obtain a table text recognition result corresponding to the table recognition task in the target text image.

The text detection and recognition result can be understood as a result of locating a text region in a target text image and recognizing specific text; the key information recognition result can be understood as a result of locating a text region of a specific category in a target text image and recognizing a specific text; the table text recognition result may be understood as a result that a table boundary in the target text image is detected, text in each cell is recognized, and converted into a form of an electronic table or database.

Specifically, for different target image processing tasks, after the target text image and the corresponding target task identifier are input into the text image processing model, different output results are obtained.

For example, in the case that the target task identifier is a detection and identification task identifier, taking the document picture a as a target text image, attaching the detection and identification task identifier, and inputting the target text image and the detection and identification task identifier into a text image processing model, wherein the text image processing model can automatically locate each text region in the document picture a and identify corresponding text information in the text regions; in practical application, for each text region of the document picture a, a detection bounding box is used to display the text content corresponding to each detection bounding box, for example, the document picture a contains "date: 4/8/2019; time: the text content of 5 PM is characterized in that a detection bounding box is utilized to respectively frame 'date', '4/8/2019', 'time', '5 PM', and corresponding text information of 'date', '4/8/2019', 'time', '5 PM' is displayed; the positioned detection bounding box and the identified text information are text detection and identification results corresponding to the text detection and identification tasks.

Under the condition that the target task identifier is an information extraction task identifier, the text image processing model can extract specific key information, for example, when the time is the specific key information, the text image processing model can concentrate on searching and extracting specific key information fields, namely, the text at the position, and output an identification result; and the document picture A is also taken as a target text image, and the key information identification result obtained by using the text image processing model is that a detection bounding box is used for framing the time and the 5 pm, and text information of the corresponding time and the 5 pm is displayed.

Under the condition that the target task identifier is a table identification task identifier, taking a text image B with a table structure as a target text image, attaching the table identification task identifier, and inputting the target text image B and the target text image to a text image processing model together, wherein the text image processing model needs to identify all text information in a table to form structured data, and the text image processing model can identify a table boundary and each table text content therein, wherein a table text identification result is structured table data which can comprise a column title and all corresponding cell contents.

According to the image processing method provided by the embodiment of the specification, the target text image and the task identifications of different target image processing tasks are input into the text image processing model, so that output results of different forms corresponding to the target image processing tasks can be obtained, the actual requirements of users are met, and the user experience is improved.

In one or more embodiments of the present disclosure, a target text image and a target task identifier are specifically input into the text image processing model, and then a text center point sequence corresponding to the target task identifier is generated by using a structured sequence decoder of the text image processing model, so as to obtain a target text recognition result corresponding to the target image processing task. The specific implementation mode is as follows:

the text image processing model comprises an image encoder and a structured sequence decoder;

Inputting the target text image and the target task identifier into the text image processing model, and obtaining a target image feature vector of the target text image by using the image encoder;

Inputting the target image feature vector and the target task identifier into the structured sequence decoder to obtain a character center point sequence corresponding to the target task identifier;

And obtaining a target character recognition result corresponding to the target image processing task in the target character image according to the target image feature vector and the character center point sequence corresponding to the target task identifier.

The image encoder is used for encoding the input target text image so as to obtain an image feature vector corresponding to the target text image; specifically, the image encoder may be composed of swin-transducer (sliding window-transducer) and feature pyramid network FPN (Feature Pyramid Network), where under the condition of combining swin-transducer with FPN to construct the image encoder, the advantages of the transducer model on capturing capability of global context information and FPN on multi-scale feature expression can be fully utilized to construct a high-efficiency image encoder capable of acquiring deep level abstract features and maintaining space details, and better performance can be obtained in downstream tasks such as object detection, instance segmentation and the like.

The structured-sequence decoder is used for generating an output sequence with a specific structure or constraint; for example, in the text image processing model provided in the embodiment of the present specification, the structured sequence decoder is configured to generate a text center point sequence of the target text image, and specifically, the text center point may be understood as a geometric center of each text element in the text image, where the embodiment of the present specification uses a form of coordinates to represent the text center point; the character center point sequence is understood as a one-dimensional sequence in which all character center point coordinates in the character image are arranged in order from top to bottom and from left to right.

Specifically, when the target text image is the target text image shown in fig. 1 and the target task identifier is the detection and identification task identifier, inputting the target text image and the detection and identification task identifier into a text image processing model, and obtaining a target image feature vector of the target text image by using an image encoder; inputting the target image feature vector and the detection and identification task identifier into a structured sequence decoder, and obtaining a character center point sequence corresponding to the detection and identification task identifier in a cross attention mode {<S_TS>,x₁,y₁,x₂,y₂,x₃,y₃,x₄,y₄,</S>}.

Wherein < S_TS > is a start prompt of the character center point sequence, which can represent that the character center point sequence is a character center point sequence corresponding to the detection and identification task identifier, and < S > is an end prompt of the character center point sequence; (x ₁,y₁) represents the character center point coordinates corresponding to "date" in the target character image; (x ₂,y₂) representing the character center point coordinates corresponding to "time" in the target character image; (x ₃,y₃) represents the character center point coordinates of "4/8/2019" in the corresponding target character image; (x ₄,y₄) represents the character center point coordinates corresponding to "5 pm" in the target character image. And obtaining a character detection and recognition result corresponding to the character detection and recognition task in the target character image according to the character center point sequence corresponding to the target image feature vector and the detection and recognition task identifier.

In the case that the target text image is the target text image as shown in fig. 1 and the target task identifier is the information extraction task identifier, a text center point sequence { < s_ KIE >, < date >, < x ₃,y₃, </date >, < time >, < x ₄,y₄, </time >, </S > } corresponding to the information extraction task identifier is obtained, where the text center point sequence has a structural symbol, specifically, in the case that the target task identifier is the information extraction task identifier, that is, the target image processing task is the key information extraction task, the structural symbol is used for representing category information in the key information extraction task, such as date and time.

Wherein < S_ KIE > is a start prompt of the character center point sequence, which can represent the character center point sequence corresponding to the information extraction task identifier, and < S > is an end prompt of the character center point sequence; (x ₃,y₃) represents the character center point coordinates of "4/8/2019" in the corresponding target character image; (x ₄,y₄) represents the character center point coordinates corresponding to "5 pm" in the target character image. And obtaining a key information identification result corresponding to the key information extraction task in the target text image according to the target image feature vector and the text center point sequence corresponding to the information extraction task identifier.

In the case that the target text image is the target text image as shown in fig. 1 and the target task identifier is the form identification task identifier, obtaining a text center point sequence corresponding to the form identification task identifier

{<S_TR>,<tr>,<td>,x₁,y₁,</td>,<td,colspan＝"2",>,x₂,y₂,</td>,</tr>,<tr>,<td,colspan＝"2",>,x₃,y₃,</td>,<td>,x₄,y₄,</td>,</tr>,</S>}, The text center point sequence is provided with a structural symbol, and the structural symbol is used for representing the Table structure information in the Table recognition task when the target task identifier is the Table recognition task identifier, namely the target image processing task is the Table recognition task, for example, the Table structure is defined and represented by using < tr >, < td > tags in HTML (hypertext markup language), wherein < tr >, which is called as Table Row, represents one Row in the Table; < td >, which is called as Table Data Cell (Table Data Cell), represents one Cell in the Table; and colspan = "2" is one attribute setting applied to the < td > tag, indicating that the cell will span the two consecutive columns of the row in which it is located.

Wherein < S_TR > is a start prompt of the character center point sequence, which can represent that the character center point sequence is a character center point sequence corresponding to the table identification task identifier, and < S > is an end prompt of the character center point sequence; (x ₁,y₁) represents the character center point coordinates corresponding to "date" in the target character image; (x ₂,y₂) representing the character center point coordinates corresponding to "time" in the target character image; (x ₃,y₃) represents the character center point coordinates of "4/8/2019" in the corresponding target character image; (x ₄,y₄) represents the character center point coordinates corresponding to "5 pm" in the target character image. And obtaining a table character recognition result corresponding to the table recognition task in the target character image according to the target image feature vector and the character center point sequence corresponding to the table recognition task identifier.

According to the image processing method provided by the embodiment of the specification, the text center point sequence corresponding to the target task identifier can be generated by using the structured sequence decoder of the text image processing model, so that the target text recognition result corresponding to the target image processing task can be accurately obtained.

In one or more embodiments of the present description, the text image processing model further includes a detection frame decoder, an identification content decoder; and obtaining the target detection bounding box and the target characters by using the detection box decoder and the identification content decoder, thereby obtaining the target character identification result of the target character image. The specific implementation mode is as follows:

the step of obtaining a target character recognition result corresponding to the target image processing task in the target character image according to the target image feature vector and the character center point sequence corresponding to the target task identifier, comprises the following steps:

Inputting the target image feature vector and the character center point sequence corresponding to the target task identifier into the detection frame decoder to obtain a target detection bounding box;

inputting the target image feature vector and the character center point sequence corresponding to the target task identifier into the identification content decoder to obtain target characters;

And obtaining a target character recognition result corresponding to the target image processing task in the target character image according to the target detection bounding box and the target characters, wherein the target detection bounding box and the target characters are determined based on character center points in the target character image.

The detection bounding box is understood to be a rectangular or polygonal bounding box, which completely wraps a specific object or region of interest in the image, and in the embodiment of the present specification, is used to wrap a text region; the target text may be understood as text recognized in the target text image.

Specifically, in the above example, when the target task identifier is the detection and identification task identifier, the text center point sequence corresponding to the detection and identification task identifier is {<S_TS>,x₁,y₁,x₂,y₂,x₃,y₃,x₄,y₄,</S>},, so that when the target image feature vector and the text center point sequence are input into the text center point-based detection frame decoder, a target detection bounding box determined based on the text center point in the text center point sequence is obtained, namely, the target detection bounding box is obtained at (x ₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄); under the condition that the target image feature vector and the character center point sequence are input into a character center point-based identification content decoder, target characters determined by the character center points in the character center point sequence are obtained, namely characters of (x ₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄) are identified, and the identified characters are date, time, 4/8/2019 and 5 pm; and obtaining a character detection and recognition result corresponding to the character detection and recognition task in the target character image according to the target detection bounding box and the target character.

In the case that the target task identifier is the information extraction task identifier, the text center point sequence corresponding to the information extraction task identifier is { < s_ KIE >, < date >, x ₃,y₃, </date >, < time >, < x ₄,y₄, </time >, </S > }, so that in the case that the target image feature vector and the text center point sequence are input into the text center point-based detection frame decoder, a target detection bounding box determined based on the text center point in the text center point sequence is obtained, that is, the target detection bounding box is obtained at (x ₃,y₃),(x₄,y₄); under the condition that the target image feature vector and the character center point sequence are input into a character center point-based identification content decoder, acquiring target characters determined by the character center points in the character center point sequence, namely identifying (x ₃,y₃),(x₄,y₄) characters, wherein the identified characters are 4/8/2019 and 5 pm; obtaining a key information identification result { "date" } "corresponding to a key information extraction task in a target text image according to a target detection bounding box and target text: 4/8/2019, "time": 5 pm).

In the case that the target task identifier is a form identification task identifier, the text center point sequence corresponding to the form identification task identifier is

{<S_TR>,<tr>,<td>,x₁,y₁,</td>,<td,colspan＝"2",>,x₂,y₂,</td>,</tr>,<tr>,<td,colspan＝"2",>,x₃,y₃,</td>,<td>,x₄,y₄,</td>,</tr>,</S>}, Thus, in the case of inputting the target image feature vector and the text center point sequence into the text center point-based detection frame decoder, a target detection bounding box determined based on the text center point in the text center point sequence is obtained, i.e., the target detection bounding box is obtained at (x ₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄); under the condition that the target image feature vector and the character center point sequence are input into a character center point-based identification content decoder, target characters determined by the character center points in the character center point sequence are obtained, namely characters of (x ₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄) are identified, and the identified characters are date, time, 4/8/2019 and 5 pm; and according to the target detection bounding box and the target text, obtaining a form recognition result corresponding to the form recognition task in the target text image.

In practical application, after a detection bounding box and recognition content of a text center point corresponding to a target text image are obtained by using a text center point-based detection box decoder and a text center point-based recognition content decoder, respectively, the detection recognition result (the detection bounding box and the recognition content) and a structural symbol in a text center point sequence are fused, so that a complete output result, such as formatted output in key information extraction and HTML (hypertext markup language) representation output in form recognition, is obtained.

According to the image processing method provided by the embodiment of the specification, a detection surrounding frame and identification content corresponding to a character center point can be respectively obtained by using a detection frame decoder and an identification content decoder of a character image processing model; thus, the complete target character recognition result corresponding to the target image processing task is accurately obtained.

In one or more embodiments of the present disclosure, in order to enhance user experience, a target text image and a target task identifier sent by a user through a user interaction interface of a client are received, and after a target text recognition result is obtained, the target text image and the target task identifier are displayed to the user through the user interaction interface of the client. The specific implementation mode is as follows:

Before the target text image and the target task identifier corresponding to the target text image are determined, the method further comprises the following steps:

receiving a target text image sent by a user through a user interaction interface of a client and a target task identifier corresponding to the target text image;

after the target text recognition result corresponding to the target image processing task in the target text image is obtained, the method further comprises the following steps:

and displaying the target character recognition result to the user through a user interaction interface of the client.

The core objective of the user interaction interface design is to create an intuitive, easy-to-use and efficient user experience, so that users of any skill level can easily understand how to operate and acquire the required information and services through visual, auditory or other sensory input modes.

Therefore, the user can send the target text image and the target task identifier corresponding to the target text image by utilizing the user interaction interface of the client; and under the condition that the target character recognition result is obtained, displaying the target character recognition result to a user through a user interaction interface of the client.

According to the image processing method provided by the embodiment of the specification, the target text image and the target task identifier sent by the user are received through the user interaction interface of the client, and the target text recognition result is displayed to the user through the user interaction interface, so that natural and efficient interaction with the user can be realized.

In one or more embodiments of the present specification, the text image processing model is obtained through training of the following steps:

and training to obtain the text image processing model according to the target sample text image, the sample task identifier corresponding to the target sample image processing task and the sample text recognition result.

According to the image processing method provided by the embodiment of the specification, the text center point of the text image is used as the unified feature expression among the plurality of image processing tasks, and the plurality of image processing tasks are processed in a combined mode, so that the information interaction and sharing efficiency among different tasks can be improved, and more efficient and more accurate document processing can be realized; and a plurality of image processing tasks are completed through one text image processing model, so that information transmission and calculation cost among a plurality of independent models can be avoided, and the whole link of document processing is greatly simplified.

Referring to fig. 3, fig. 3 shows a flowchart of a text image processing model training method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 302: and determining a plurality of sample image processing tasks, sample task identifiers corresponding to the sample image processing tasks and sample text images.

The sample image processing task may be understood as an initial image processing task in the above embodiment, including but not limited to a text detection and recognition task, a key information extraction task, and a form recognition task; the sample task identifier may be understood as the target task identifier in the above embodiment; a sample text image is understood to be an image containing text corresponding to each sample image processing task.

Specifically, training data for training a text image processing model is obtained, for example, tasks for determining that the text image processing model can process include a text detection and recognition task, a key information extraction task, a form recognition task and the like, a sample task identifier corresponding to the text detection and recognition task is determined as a detection and recognition task identifier, and a sample text image corresponding to the text detection and recognition task is an image containing any text; determining a sample task identifier corresponding to a key information extraction task as an information extraction task identifier, wherein a sample text image corresponding to the key information extraction task is a text image containing category information, and the category information can comprise specific fields such as name, age, date, time and the like; and determining a sample task identifier corresponding to the table identification task as a table identification task identifier, wherein a sample text image corresponding to the table identification task is a text image containing a table structure.

In one or more embodiments of the present description, in order for a text image processing model to better learn the distribution of text locations and semantic features in an image, an initial text image processing model is trained to obtain a text image processing model. The specific implementation mode is as follows:

Before determining the plurality of sample image processing tasks, the sample task identifiers corresponding to the sample image processing tasks and the sample text images, the method further comprises:

Determining a pre-training character image and character recognition results which accord with a target space window prompt and a target word prefix prompt in the pre-training character image;

inputting the pre-training text image, the target space window prompt and the target word prefix prompt into an initial text image processing model to obtain a predicted text recognition result conforming to the target space window prompt and the target word prefix prompt;

And training the initial text image processing model according to the predicted text recognition result and the text recognition result to obtain the text image processing model.

The pre-training text image is understood as an image containing text; the target space window prompt can be understood as a preset search area or range, which is used for defining a rectangular window in the pre-training text image, and guiding the structured sequence decoder to read the text in the appointed rectangular window; the target word prefix hint may be understood as a pre-set lexical prefix range for directing the structured sequence decoder to identify words within the specified lexical prefix range.

Specifically, the method for prompting the space window is to define a rectangular window in the image, and the initial text image processing model only outputs all the text within the window according to the sequence from top to bottom and from left to right in the pre-training process; the word prefix hinting method is to define a character range, for example, from a to f, and the initial text image processing model only outputs words with word initials in the character range in the pre-training process.

Through the method for prompting the space window and prompting the word prefix, the expression capacity of a text image processing model can be enhanced; the spatial window prompt only outputs words in the range of the appointed window by limiting the text image processing model, so that the spatial position distribution in the text image can be better perceived. The word prefix prompt only outputs word initial letters in a designated character range by limiting the word image processing model, so that the word prefix prompt can better understand word semantic relations in the word image; through the design of the two methods, the character image processing model can achieve good performance on a plurality of tasks.

For example, determining a pre-training character image A and character recognition results which accord with a target space window prompt and a target word prefix prompt in the pre-training character image A; inputting a pre-training character image A, a target space window prompt (46,32,140,165) and a target word prefix prompt B-H into an initial character image processing model to obtain a predicted character recognition result conforming to the target space window prompt and the target word prefix prompt under the condition that the target space window prompt Is (46,32,140,165) and the target word prefix prompt Is B-H, wherein if characters conforming to the target space window prompt in the pre-training character image A comprise 'Is', 'Ha', 'Am', 'Co', and under the condition that the target word prefix prompt Is B-H, the predicted character recognition result Is 'Ha', 'Co'; and training the initial character image processing model by using the predicted character recognition result and the character recognition result to obtain a character image processing model.

After the initial text image processing model is obtained through the training, in order to complete a plurality of complex text image analysis tasks in one text image processing model, the pre-training is utilized to enable the text image processing model to better learn the commonality among different tasks, and the data labeling quantity required by each task is reduced.

According to the text image processing model training method provided by the embodiment of the specification, the initial text image processing model is trained, so that the text image processing model can better sense the distribution of text images on the spatial position and the semantic relation, and a more accurate text detection recognition result, a more accurate key information extraction result and a more accurate form recognition result are provided.

Step 304: determining a sample character recognition result corresponding to a target sample image processing task in a target sample character image, wherein the target sample image processing task is any one of the sample image processing tasks, and the target sample character image is the sample character image corresponding to the target sample image processing task.

Specifically, when training the text image processing model, selecting any one target sample image processing task from a plurality of sample image processing tasks, and determining a sample text image corresponding to the target sample text image task, thereby determining a sample text recognition result corresponding to the target sample image processing task in the sample text image corresponding to the target sample text image task.

And determining sample character recognition results corresponding to the target sample image processing tasks in the target sample character image, so that the character image processing model can process a plurality of sample image processing tasks, namely, a plurality of complex character image analysis tasks are completed in one character image processing model.

Step 306: training to obtain a character image processing model according to the target sample character image, the sample task identifier corresponding to the target sample image processing task and the sample character recognition result.

Specifically, a predicted sample character recognition result corresponding to a target sample image processing task is obtained through a target sample character image and a sample task identifier by utilizing a character image processing model; and training to obtain a text image processing model according to the predicted sample text recognition result and the sample text recognition result.

In one or more embodiments of the present description, a text image processing model includes a structured sequence decoder, a detection frame decoder, and an identification content decoder; training is required to be performed on the structured-sequence decoder, the detection-frame decoder, and the identification-content decoder, respectively, so as to train and obtain a text-image processing model. The specific implementation mode is as follows:

The training to obtain the text image processing model according to the target sample text image, the sample task identifier corresponding to the target sample image processing task and the sample text recognition result comprises the following steps:

Training the structured sequence decoder, the detection frame decoder and the identification content decoder according to the target sample text image and the sample task identifier corresponding to the target sample image processing task;

And training to obtain the text image processing model by using the structured sequence decoder, the detection frame decoder and the identification content decoder according to the sample text recognition result.

Specifically, training the structured sequence decoder, the detection frame decoder and the identification content decoder according to the target sample text image and the sample task identifier corresponding to the target sample image processing task to enable the structured sequence decoder, the detection frame decoder and the identification content decoder to generate, wherein in the target sample text image, the generation result corresponding to each target sample image processing task; and training to obtain a text image processing model by combining the generation result of the structured sequence decoder, the detection frame decoder and the identification content decoder and the sample text identification result.

According to the text image processing model training method provided by the embodiment of the specification, under the condition that the text image processing model comprises the structured sequence decoder, the detection frame decoder and the identification content decoder, the structured sequence decoder, the detection frame decoder and the identification content decoder are respectively trained, so that the text image processing model with better training effect is obtained.

In one or more embodiments of the present disclosure, the generated results of the structured-sequence decoder, the detection-frame decoder, and the identification content decoder are the character center point sequence, the detection bounding box, and the character, respectively, and therefore, the structured-sequence decoder, the detection-frame decoder, and the identification content decoder are trained with the real results and the generated results of the structured-sequence decoder, the detection-frame decoder, and the identification content decoder, respectively. The specific implementation mode is as follows:

Training the structured sequence decoder, the detection frame decoder and the identification content decoder according to the target sample text image and the sample task identifier corresponding to the target sample image processing task, including:

determining a sample character center point sequence, a sample detection bounding box and sample characters corresponding to the sample task identifier in the target sample character image, wherein the sample task identifier is a task identifier corresponding to the target sample image processing task, and the sample detection bounding box and the sample characters are obtained based on the character center point of the target sample character image;

training the structured sequence decoder according to the target sample text image and the sample text center point sequence;

training the detection frame decoder according to the target sample character image, the sample character center point sequence and the sample detection bounding box;

Training the identification content decoder according to the target sample character image, the sample character center point sequence and the sample characters.

The sample character center point sequence can be understood as a real character center point coordinate sequence corresponding to the sample task identifier in the target sample character image; the sample detection bounding box can be understood as a detection bounding box of characters in the real target sample character image in the target sample character image; sample text is understood to be the actual text content in the target sample text image.

Specifically, taking a file detection and identification task as an example, determining a character center point sequence of a target sample character image aiming at the target sample character image, namely arranging all character center point coordinates of the target sample character image into a one-dimensional sequence according to the sequence from top to bottom and from left to right; based on the coordinates of the character center point, generating a detection bounding box of the corresponding sample character, wherein the detection bounding box is used for accurately marking the position of the character in the image; based on the coordinates of the character center point, the actual character content is extracted as a sample character.

Training the structured sequence decoder by utilizing the target sample text image and the corresponding text center point sequence thereof, so that the structured sequence decoder can output the structured information of the target sample text image based on the text center point; training a detection frame decoder by combining a target sample text image, a text center point sequence and a sample detection bounding box, so that the detection frame decoder can locate and identify each independent text region in the image; training the recognition content decoder using the target sample text image, the text center point sequence, and the actual sample text, such that the recognition content decoder recognizes and outputs the specific character content of each detected text in the image.

In practice, the underlying structures of the structured-sequence decoder, the detection-box decoder, and the identification-content decoder may be based on a transducer decoder design and trained with training data for each specific task in order to impart different functions to these decoders.

According to the training method for the text image processing model, the text image processing model can be better focused on key features of each task through the respective training of the structured sequence decoder, the detection frame decoder and the identification content decoder; and each decoder only pays attention to the required data part, thereby avoiding the influence of irrelevant information on the training process and improving the use efficiency of the computing resources.

In one or more embodiments of the present description, the text image processing model further includes an image encoder; and obtaining sample image feature vectors of the target sample text image through the image encoder, so that the structured sequence decoder is trained according to the sample image feature vectors, the sample task identifier and the sample text center point sequence. The specific implementation mode is as follows:

Training the structured sequence decoder according to the target sample text image and the sample text center point sequence, including:

Inputting the target sample text image into the image encoder to obtain a sample image feature vector of the target sample text image;

Inputting the sample image feature vector and the sample task identifier into the structured sequence decoder to obtain a predicted sample character center point sequence corresponding to the target sample image processing task in the target sample character image;

training the structured sequence decoder according to the sample character center point sequence and the predicted sample character center point sequence.

Specifically, inputting a target sample text image into an image encoder, extracting features of the target sample text image by the image encoder, and outputting a high-dimensional feature vector which contains compressed representation of important visual information in the target sample text image; inputting the sample image feature vector obtained from the image encoder and the corresponding sample task identifier to a structured sequence decoder; obtaining a predicted character center point coordinate sequence (i.e. a predicted sample character center point sequence) of each character or word corresponding to the sample task identifier in the target sample character image; and comparing the real sample character center point sequence with the predicted sample character center point sequence predicted by the structured sequence decoder, and updating parameters of the structured sequence decoder according to the back propagation of the loss function by calculating the loss function and other modes, so that the accuracy of the predicted character center point sequence is gradually improved.

According to the training method for the text image processing model, the structural sequence decoder can be used for decoding image features in a targeted mode to generate the prediction result related to the target sample image processing task, so that the text image processing model can concentrate on completing a specific task, and the text image processing model can carry out modularized training by obtaining the coordinate sequence of the text center point.

In one or more embodiments of the present disclosure, to achieve accurate positioning of characters in a text image, a detection frame decoder is trained, and specifically, the detection frame decoder is trained by a sample image feature vector, a sample character center point sequence, and a sample detection bounding box. The specific implementation mode is as follows:

Training the detection frame decoder according to the target sample text image, the sample text center point sequence and the sample detection bounding box, including:

Inputting the sample image feature vector and the sample character center point sequence into the detection frame decoder to obtain a predicted sample detection bounding box of the target sample character image, wherein the target sample detection bounding box is determined based on character center points in the sample character center point sequence;

and training the detection frame decoder according to the sample detection bounding box and the prediction sample detection bounding box.

Specifically, the sample image feature vector and the sample character center point sequence are used as input and provided for a detection frame decoder; the detection frame decoder can understand the image context according to the sample image feature vector and generate a predicted detection bounding box by combining the sample text center point sequence. For example, the center point coordinates of the character "a" exist in the sample text center point sequence, and the detection frame decoder predicts a rectangular detection bounding box suitable for containing the entire character "a" based on this center point coordinates.

In the training process, the predicted detection bounding box is compared with the actually marked sample detection bounding box, and a loss function and the like between the predicted detection bounding box and the actually marked sample detection bounding box are calculated, so that a detection bounding box decoder can gradually learn how to predict the detection bounding box of the characters more accurately based on the image characteristics and the character center point sequence.

In order to achieve accurate positioning of characters in a character image, the training method for the character image processing model provided by the embodiment of the specification can calculate the actual position range of each character more accurately through the sample character center point sequence, and decodes the detection frame by utilizing the character center point sequence.

In one or more embodiments of the present disclosure, to achieve accurate recognition of text in a text image, the recognition content decoder is trained, specifically by a sample image feature vector, a sample text center point sequence, and sample text. The specific implementation mode is as follows:

Training the identifying content decoder according to the target sample character image, the sample character center point sequence and the sample characters, comprising:

inputting the sample image feature vector and the sample character center point sequence into the identification content decoder to obtain a predicted sample character of the target sample character image;

and training the identification content decoder according to the sample characters and the predicted sample characters.

Specifically, the sample image feature vector and the sample character center point sequence are used as input and provided for an identification content decoder; the recognition content decoder analyzes characters or words corresponding to the actual positions of the sample character center point sequences from the sample image feature vectors, thereby outputting predicted characters.

Calculating the difference between the predicted text and the real sample text by comparing the predicted text with the real sample text and using cross entropy loss or other loss functions suitable for the text sequence, and adjusting parameters of the identified content decoder based on the back propagation; through continuous iterative training, the identification content decoder can gradually improve the capability of accurately predicting the text content in the image.

According to the text image processing model training method provided by the embodiment of the specification, the spatial layout and the context relation of the text in the image can be more accurately understood by the identification content decoder by combining the image characteristics and the text center point sequence, so that the identification accuracy of text content is improved, and the identification content decoder is facilitated to locate the text region in a complex image scene on the basis that the sample text center point sequence provides the position information of the text region, so that the identification error caused by the undefined text position is reduced.

In one or more embodiments of the present disclosure, corresponding prediction results are obtained by three decoders, the prediction results of the three decoders are fused to obtain a predicted word recognition result, and a word image processing model is obtained through training according to the predicted word recognition result and the sample word recognition result. The specific implementation mode is as follows:

According to the sample text recognition result, training to obtain the text image processing model by using the structured sequence decoder, the detection frame decoder and the recognition content decoder, including:

obtaining a predicted sample character center point sequence output by the structured sequence decoder, a predicted sample detection bounding box output by the detection box decoder and a predicted sample character output by the identification content decoder;

determining a predicted character recognition result according to the predicted sample character center point sequence, the predicted sample detection bounding box and the predicted sample character;

And training to obtain the text image processing model according to the sample text recognition result and the predicted text recognition result.

Specifically, the structured sequence decoder outputs a predicted character center point coordinate sequence for each character or word center point, and the detection frame decoder outputs predicted detection bounding boxes for each text block, which bounding each individual word unit in the image; for each detected text block, the identifying content decoder further gives predicted literal content, such as "Hello", "World", etc.; integrating all the information, determining the position, the size and the predicted text content of each text region in the text image, and carrying out structural symbol information in a coordinate sequence of a text center point so as to generate a predicted text recognition result carrying the structural symbol information; and training to obtain a character image processing model by using the sample character recognition result and the predicted character recognition result.

According to the text image processing model training method provided by the embodiment of the specification, the text center point is adopted as the intermediate representation, and tasks with greatly different output forms are skillfully fused in one text image processing model through the combination of the image encoder, the structured sequence decoder, the text center point-based detection frame decoder and the text center point-based identification content decoder, so that a plurality of complicated text image analysis tasks are completed in one text image processing model, the system complexity and the computing resource requirements are reduced, unnecessary computation and redundant operation are reduced, and the processing efficiency is improved; and the design of the pre-training task enhances the perception capability of the text image processing model on the text image, and provides more accurate results for each task.

Referring to fig. 4, fig. 4 shows a flowchart of a training method of a text image processing model applied to a cloud according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 402: and receiving a plurality of sample image processing tasks, sample task identifiers corresponding to the sample image processing tasks and sample text images sent by the client.

Step 404: and determining a plurality of sample image processing tasks, sample task identifiers corresponding to the sample image processing tasks and sample text images.

Step 406: determining a sample character recognition result corresponding to a target sample image processing task in a target sample character image, wherein the target sample image processing task is any one of the sample image processing tasks, and the target sample character image is the sample character image corresponding to the target sample image processing task.

Step 408: training to obtain a character image processing model according to the target sample character image, the sample task identifier corresponding to the target sample image processing task and the sample character recognition result.

The specific implementation can be referred to the above embodiments, and will not be described herein.

The above is a schematic scheme applied to the cloud and text image processing model training method in this embodiment. It should be noted that, the technical solution applied to the cloud and the text image processing model training method and the technical solution of the text image processing model training method belong to the same concept, and details which are not described in detail in the technical solution applied to the cloud and the text image processing model training method can be referred to the description of the technical solution of the text image processing model training method.

Referring to fig. 5, fig. 5 is a process diagram of an image processing method according to an embodiment of the present disclosure.

In practical application, the three tasks of character detection and recognition, key information extraction and form recognition are all OCR related analysis tasks in nature, and the input and the output of the tasks are centered on characters, so that the tasks have the characteristic of being unified naturally.

Specifically, as shown in fig. 5, a text image and a task identifier are input into a text image processing model, for example, the task identifier includes a detection and identification task identifier corresponding to a text detection task, an information extraction task identifier corresponding to a key information extraction task, and a form identification task identifier corresponding to a form identification task; an image encoder of the text image processing model is utilized to obtain an image feature vector corresponding to the text image, and the image feature vector is used as an input of a subsequent decoder.

Inputting the image feature vector and the task identifier into a structured sequence decoder, and generating a character center point sequence with a structural symbol through the structured sequence decoder; the structural symbols can be used for representing category information in a key information extraction task and table structure information in table identification, and by adopting a word center point mode, the position information of words in a word image can be effectively captured, the perception capability of a word image processing model on the word space position is improved, and the word image processing model can generate a word center point sequence more accurately; for example, the text center point sequence A{<S_TS>,x₁,y₁,x₂,y₂,x₃,y₃,x₄,y₄,</S>}, corresponding to the detection and identification task identifier is obtained, the text center point sequence B { < s_ KIE >, < date >, x ₃,y₃, </date >, < time >, < x ₄,y₄, </time >, </S > } corresponding to the task identifier is extracted, and the text center point sequence C corresponding to the task identifier is identified in the form

{<S_TR>,<tr>,<td>,x₁,y₁,</td>,<td,colspan＝"2",>,x₂,y₂,</td>,</tr>,<tr>,<td,colspan＝"2",>,x₃,y₃,</td>,<td>,x₄,y₄,</td>,</tr>,</S>}.

The obtained text center point sequence A, B, C and the image feature vector are input into a text center point-based detection frame decoder and a text center point-based recognition content decoder to respectively obtain a text center point detection bounding box and recognition content of a corresponding text image, and the detection recognition result (the detection bounding box and the recognition content) and the structural symbol information are fused to obtain complete output results, such as formatted output in key information extraction and HTML representation output in form recognition, such as text detection recognition result, key information extraction result and form recognition result in fig. 5.

In practical application, the image processing method is not limited to three tasks of character detection and recognition, key information extraction and form recognition, and can also process other OCR related tasks, such as hierarchical character detection; in the case of performing hierarchical text detection tasks, it is only necessary to introduce hierarchical line and paragraph separators into the structural symbols contained in the text center point sequence generated by the structured sequence decoder.

The image processing method has the advantages of high efficiency, accuracy, high multitask adaptability and the like through the design of a character center point representation and a pre-training method, and the three tasks of character detection and identification, key information extraction and form identification are jointly modeled through taking the character center point as a unified feature representation; by the method, information interaction and sharing efficiency among different tasks can be improved, and therefore more efficient and more accurate document processing is achieved.

According to the image processing method provided by the embodiment of the specification, multiple complex text image analysis tasks are completed in one text image processing model, so that multi-task learning and cross-task migration are realized; by sharing the intermediate representation and the feature learning, the data labeling amount required by each task is reduced, and the generalization capability of the text image processing model on each task is improved.

Corresponding to the above method embodiments, the present disclosure further provides an image processing apparatus embodiment, and fig. 6 shows a schematic structural diagram of an image processing apparatus according to one embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:

A determining module 602, configured to determine a target text image and a target task identifier corresponding to the target text image, where the target task identifier is a task identifier of a target image processing task, and the target image processing task is any one or more of a plurality of initial image processing tasks;

An obtaining module 604 configured to input the target text image and the target task identifier into a text image processing model, obtain a target text recognition result corresponding to the target image processing task in the target text image,

Optionally, the obtaining module 604 is further configured to:

And obtaining a target character recognition result corresponding to the target image processing task in the target character image according to the target detection frame and the target characters, wherein the target detection bounding box and the target characters are determined based on character center points in the target character image.

Optionally, the determining module 602 is further configured to:

Optionally, the obtaining module 604 is further configured to:

The device further comprises:

And the receiving module is configured to receive a target text image sent by a user through a user interaction interface of the client and a target task identifier corresponding to the target text image.

The device further comprises:

And the display module is configured to display the target character recognition result to the user through a user interaction interface of the client.

According to the image processing device provided by the embodiment of the specification, the information interaction and sharing efficiency among different tasks can be improved by carrying out joint processing on a plurality of image processing tasks, so that more efficient and more accurate document processing is realized; and a plurality of image processing tasks are completed through one text image processing model, so that information transmission and calculation cost among a plurality of independent models can be avoided, and the whole link of document processing is greatly simplified.

The above is a schematic scheme of an image processing apparatus of the present embodiment. It should be noted that, the technical solution of the image processing apparatus and the technical solution of the image processing method belong to the same concept, and details of the technical solution of the image processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the image processing method.

Corresponding to the above method embodiment, the present disclosure further provides an embodiment of a text image processing model training device, and fig. 7 shows a schematic structural diagram of a text image processing model training device according to one embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

A first determining module 702 configured to determine a plurality of sample image processing tasks, sample task identifiers corresponding to each sample image processing task, and sample text images;

A second determining module 704, configured to determine a sample text recognition result corresponding to a target sample image processing task in a target sample text image, where the target sample image processing task is any one of the plurality of sample image processing tasks, and the target sample text image is a sample text image corresponding to the target sample image processing task;

The training module 706 is configured to train to obtain a text image processing model according to the target sample text image, the sample task identifier corresponding to the target sample image processing task, and the sample text recognition result.

Optionally, the training module 706 is further configured to:

The device further comprises:

The pre-training module is configured to determine a pre-training character image and character recognition results which accord with a target space window prompt and a target word prefix prompt in the pre-training character image; inputting the pre-training text image, the target space window prompt and the target word prefix prompt into an initial text image processing model to obtain a predicted text recognition result conforming to the target space window prompt and the target word prefix prompt; and training the initial text image processing model according to the predicted text recognition result and the text recognition result to obtain the text image processing model.

According to the text image processing model training device provided by the embodiment of the specification, the text center point is adopted as the intermediate representation, and tasks with greatly different output forms are skillfully fused in one text image processing model through the combination of the image encoder, the structured sequence decoder, the text center point-based detection frame decoder and the text center point-based identification content decoder, so that a plurality of complicated text image analysis tasks are completed in one text image processing model, unnecessary calculation and redundant operation are reduced, and the processing efficiency is improved; and the design of the pre-training task enhances the perception capability of the text image processing model on the text image, and provides more accurate results for each task.

The above is a schematic scheme of a text image processing model training device of this embodiment. It should be noted that, the technical solution of the text image processing model training device and the technical solution of the above text image processing model training method belong to the same concept, and details of the technical solution of the text image processing model training device which are not described in detail can be referred to the description of the technical solution of the above text image processing model training method.

Fig. 8 illustrates a block diagram of a computing device 800 provided in accordance with one embodiment of the present description. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.

Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 840 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near Field Communication (NFC).

In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 800 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 800 may also be a mobile or stationary server.

Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the image processing method or the text image processing model training method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the image processing method or the text image processing model training method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the image processing method or the text image processing model training method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described image processing method, or text-to-image processing model training method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above image processing method or the text image processing model training method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the above image processing method or the text image processing model training method.

An embodiment of the present specification also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above-described image processing method, or text-to-image processing model training method.

The foregoing is a schematic version of a computer program product of this embodiment. It should be noted that, the technical solution of the computer program product and the technical solution of the image processing method or the text image processing model training method belong to the same concept, and details of the technical solution of the computer program product, which are not described in detail, can be referred to the description of the technical solution of the image processing method or the text image processing model training method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. An image processing method, comprising:

2. The image processing method of claim 1, the text image processing model comprising an image encoder, a structured sequence decoder;

3. The image processing method of claim 2, the text image processing model further comprising a detection frame decoder, an identification content decoder;

And obtaining a target character recognition result corresponding to the target image processing task in the target character image according to the target detection bounding box and the target character, wherein the target detection bounding box and the target character are determined based on a character center point in the target character image.

4. The image processing method according to claim 1, wherein the target image processing task includes a text detection recognition task, a key information extraction task, and/or a form recognition task;

5. The image processing method according to claim 4, wherein the inputting the target text image and the target task identifier into a text image processing model to obtain a target text recognition result corresponding to the target image processing task in the target text image includes:

6. The image processing method according to claim 1, wherein the text image processing model is obtained by training:

7. The image processing method according to claim 1, further comprising, before the determining the target text image and the target task identifier corresponding to the target text image:

8. A training method of a text image processing model comprises the following steps:

9. The text-to-image processing model training method of claim 8, the text-to-image processing model comprising a structured sequence decoder, a detection frame decoder, and an identification content decoder;

10. The text-image processing model training method of claim 9, wherein training the structured sequence decoder, the detection frame decoder, and the identification content decoder according to the target sample text image, the sample task identifier corresponding to the target sample image processing task, comprises:

11. The text image processing model training method according to claim 9, wherein training to obtain the text image processing model by using the structured sequence decoder, the detection frame decoder, and the recognition content decoder according to the sample text recognition result comprises:

12. The text image processing model training method of claim 8, further comprising, before determining the plurality of sample image processing tasks, the sample task identifier corresponding to each sample image processing task, and the sample text image:

13. A training method of a text image processing model is applied to a cloud and comprises the following steps:

14. An image processing apparatus comprising:

15. A text image processing model training device, comprising:

16. A computing device, comprising:

A memory and a processor;

the memory is configured to store computer executable instructions that, when executed by the processor, implement the steps of the image processing method of any one of claims 1 to 7, or the text image processing model training method of any one of claims 8 to 12.

17. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the image processing method of any one of claims 1 to 7 or the text image processing model training method of any one of claims 8 to 12.

18. A computer program product comprising a computer program which when executed by a processor performs the steps of the image processing method of any one of claims 1 to 7 or the text image processing model training method of any one of claims 8 to 12.