CN114495129B

CN114495129B - Character detection model pre-training method and device

Info

Publication number: CN114495129B
Application number: CN202210405265.4A
Authority: CN
Inventors: 宋思博; 万建强; 杨志博; 姚聪
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-09-09
Anticipated expiration: 2042-04-18
Also published as: CN114495129A

Abstract

The embodiment of the specification provides a character detection model pre-training method and a device, wherein the method comprises the following steps: inputting the character samples into a text encoder to obtain character features, and inputting the image samples into an image encoder to obtain image features, wherein the character samples are extracted from the image samples; determining whether the image samples contain character samples according to the data dictionary and the image characteristics to obtain a character containing result, wherein the data dictionary comprises the character samples; determining the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a corresponding result of the image and the text; predicting the covered character samples according to the character features and the image features to obtain character prediction results; and adjusting parameters of the image encoder according to the inclusion result, the image-text corresponding result and the text prediction result to obtain a pre-training text detection model. By enabling the visual representation to have semantic knowledge, the problems of row-to-column ambiguity and the like caused by insufficient semantic knowledge are avoided.

Description

Character detection model pre-training method and device

Technical Field

The embodiment of the specification relates to the technical field of model training, in particular to a pre-training method for a character detection model.

Background

With the rapid development of personal consumer electronics (digital cameras, mobile phones, etc.), the Optical Character Recognition (OCR) technology for extracting text information from multi-modal visual signals (such as document images, card pictures, street view videos, etc.) has been widely applied due to the rapid promotion of digitization and informatization of various industries such as finance, logistics, medical treatment, education, etc. With the advent of the deep learning era, OCR technology gradually moves from simple scanned document character recognition to the era of solving complex character scenes such as complex character distribution, artistic words, table characters, even handwritten formulas and the like in a wide range of scenes. The general OCR technology is generally divided into three stages of character detection, character recognition and format understanding. In the text detection stage, a text detection model needs to be trained based on deep learning and training data, and text line regions in the picture are located. The traditional text detection method is based on a data-driven mode, and learns the visual characteristics of text characters and the visual characteristics of text lines (such as character spacing, font style, font size similarity and the like) so as to detect the text in line granularity. However, extracting characters from a complex text format often involves understanding the semantic meaning of text content (especially in a chinese scene such as line and row ambiguity and wide-range word), and cannot simply determine the text line by the distance between characters, thereby affecting the text detection effect.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a method for pre-training a text detection model. One or more embodiments of the present disclosure also relate to a pre-training apparatus for a text detection model, a computing device, a computer-readable storage medium, and a computer program, so as to solve technical deficiencies in the prior art.

According to a first aspect of embodiments of the present specification, there is provided a text detection model pre-training method, including:

inputting a text sample into a text encoder to obtain a text characteristic, and inputting an image sample into an image encoder to obtain an image characteristic, wherein the text sample corresponds to the image sample;

determining whether the image sample contains the character sample according to a data dictionary and the image characteristics to obtain a character containing result, wherein the data dictionary comprises the character sample;

determining the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a text corresponding result;

predicting the covered character samples according to the character features and the image features to obtain character prediction results;

and adjusting parameters of the image encoder according to the inclusion result, the image-text corresponding result and the text prediction result to obtain a pre-training text detection model.

According to a second aspect of the embodiments of the present specification, there is provided a text detection model pre-training apparatus, including:

the encoding module is configured to input a text sample into a text encoder to obtain a text characteristic, and input an image sample into an image encoder to obtain an image characteristic, wherein the text sample corresponds to the image sample;

a first task module configured to determine whether the image sample contains the text sample according to the data dictionary and the image features, and obtain a text containing result, wherein the data dictionary includes the text sample;

the second task module is configured to determine the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a text-image corresponding result;

the third task module is configured to predict the covered character samples according to the character features and the image features to obtain character prediction results;

and the parameter adjusting module is configured to perform parameter adjustment on the image encoder according to the inclusion result, the image-text corresponding result and the text prediction result to obtain a pre-training text detection model.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions, and the computer executable instructions are executed by the processor to realize the steps of the character detection model pre-training method.

According to a fourth aspect of the embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the above-mentioned text detection model pre-training method.

According to a fifth aspect of embodiments herein, there is provided a computer program, wherein the computer program, when executed on a computer, causes the computer to perform the steps of the above-mentioned text detection model pre-training method.

The embodiment of the specification provides a character detection model pre-training method and a device, wherein the method comprises the following steps: inputting the character samples into a text encoder to obtain character features, and inputting the image samples into an image encoder to obtain image features, wherein the character samples are extracted from the image samples; determining whether the image samples contain character samples according to the data dictionary and the image characteristics to obtain a character containing result, wherein the data dictionary comprises the character samples; determining the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a corresponding result of the image and the text; predicting the covered character samples according to the character features and the image features to obtain character prediction results; and adjusting parameters of the image encoder according to the inclusion result, the image-text corresponding result and the text prediction result to obtain a pre-training text detection model. Through the combined training of the vision and language models, the language models and the corpus information are embedded into the vision representation in the pre-training stage, so that the vision representation has semantic knowledge, more accurate character row positions and row granularity can be obtained in the character detection fine-tuning stage, and the problems of row-row and column-row ambiguity and the like caused by insufficient semantic knowledge are avoided.

Drawings

FIG. 1 is a diagram illustrating a pre-training method for a text detection model according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for pre-training a text detection model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a processing procedure of a pre-training method for a text detection model according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a pre-training apparatus for a text detection model according to an embodiment of the present disclosure;

fig. 5 is a block diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be implemented in many ways other than those specifically set forth herein, and those skilled in the art will appreciate that the present description is susceptible to similar generalizations without departing from the scope of the description, and thus is not limited to the specific implementations disclosed below.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

The data dictionary is used for defining and describing data items, data structures, data streams, data storage, processing logic and the like of data, and aims to make detailed description on each element in the data flow chart, and the data dictionary is used as a simple modeling item. In short, a data dictionary is a collection of information that describes data, a collection of definitions for all data elements used in the system.

A feed-forward neural network: also called forward network, is a simplest neural network, with neurons arranged in layers, each neuron being connected to only the preceding layer of neurons. And receiving the output of the previous layer and outputting the output to the next layer, wherein no feedback exists between the layers. Is one of the most widely applied and rapidly developed artificial neural networks.

Transformer, an attention-based coder-decoder model, one of the commonly used deep learning models.

BERT: these are called Bidirectional Encoder responses from transforms, and can be roughly understood as a Bidirectional transform encoding method.

Residual network the residual network is characterized by easy optimization and can improve the accuracy by adding considerable depth. The inner residual block uses jump connection, and the problem of gradient disappearance caused by depth increase in a deep neural network is relieved.

Attention Mechanism (Attention Mechanism) originated from the study of human vision. In cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing. The above mechanism is commonly referred to as the attentional mechanism. Different parts of the human retina have different degrees of information processing ability, i.e., Acuity (Acuity), and only the foveal region has the strongest Acuity. In order to make reasonable use of limited visual information processing resources, a human needs to select a specific portion in a visual region and then focus on it. For example, when a person is reading, only a few words to be read are usually attended to and processed. In summary, the attention mechanism has two main aspects: deciding which part of the input needs to be focused on; limited information processing resources are allocated to the important parts.

Multi-head attention (multi-head attentions) uses multiple queries to compute in parallel the selection of multiple information from input information.

Feature Pyramid (Feature Pyramid Network) a basic component in a recognition system for detecting objects of different scales.

Loss function (loss function) is a function that maps the value of a random event or its associated random variable to a non-negative real number to represent the "risk" or "loss" of the random event. In application, the loss function is usually associated with the optimization problem as a learning criterion, i.e. the model is solved and evaluated by minimizing the loss function.

Model Fine-tuning given a Pre-trained model (Pre _ trained model), Fine-tuning is performed based on the model (Fine Tune). Compared with Training from scratch (Training a model from scratch), the method has the advantages that a large amount of computing resources and computing time are saved for you through fine tuning, computing efficiency is improved, and even accuracy is improved.

The existing word detection pre-training algorithm can be mainly divided into two types: one is a word detection pre-training scheme based on word position labeling as supervision, and the other is a pre-training scheme based on text content as supervision. Among them, pure visual signals based on picture and location surveillance are the most common and most widespread pre-training methods. Specifically, there are two schemes as follows.

A method for backbone network (backbone) pre-training on a cross-domain non-textual dataset (e.g., ImageNet). The pre-training method is mainly characterized in that a model of a classification and detection task is trained on data sets of general picture classification, object detection and the like, so that a backbone network has the representation capability of low-level visual features (such as edges, textures and the like). And then, the backbone network is migrated to be used as an initialization parameter of the backbone network of the character detection model, and the parameter is used for training in a fine adjustment stage. A method for pre-training a backbone network and a character detection head network on a cross-domain (real domain and synthetic domain) character data set (such as MLT, SynthText and the like) is provided.

The method based on the character position labeling is direct, but because the semantic information of the character content in the image characters is not modeled, the method is restricted by semantic understanding in Chinese scenes such as line-to-column ambiguity, wide-distance words and the like, the series relation and line-to-line granularity of the characters cannot be accurately predicted, and the character detection effect is reduced. For example, vertical verses in an ancient poem are usually mistakenly detected as verses distributed in a horizontal manner; the frequently occurring words of "name", "gender", etc. with larger intervals in the card are detected as two separate blocks of text. In these scenarios, if there is no ability to understand semantic information, it is impossible to judge such line-by-line granularity by visual features alone, and it also brings great difficulty to the downstream text understanding and extraction.

In addition, pre-training schemes based on text content as a supervision are currently less. The STKM method uses a picture encoder and combines a text decoder with an attention mechanism to decode a picture end to end through a paradigm of character-like identification, and then trains a method for predicting text contents. The method uses the label of the character content, takes the label of the character level as supervision, trains and obtains the backbone network containing the character semantics and the visual representation. And carrying out implicit alignment on semantic signals and visual signals of the characters through an attention mechanism, so that the network learns the capacity of positioning the positions of character areas. However, since the method adopts a process of character-like recognition to map the picture signal from the visual modality to the semantic signal, the interaction between the two modalities is insufficient due to the single modality mapping mode, and the semantic signal is not well embedded and aligned into the visual signal. Meanwhile, the method uses a character-level decoding method, word sequence information in dictionaries and linguistic data is not well mined, the capability of a language model for text modeling is not fully exerted, and further sufficient language information is not embedded in visual representation, so that the character detection effect cannot be well improved.

Based on this, in the present specification, a method for pre-training a text detection model is provided, and the present specification also relates to a device for pre-training a text detection model, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a text detection model pre-training method according to an embodiment of the present disclosure, where the method includes: a Text encoder, an Image encoder, a cross-mode decoder, a data dictionary, an Image-Text contrast Learning (ITC) module, a Prediction module (WIP) and a Mask Language Modeling (MLM) module.

The image encoder mainly comprises a residual error Network layer (ResNet), a Feature Pyramid Network layer (Feature Pyramid Network) and an attention pooling layer and is used for outputting the image Feature vector according to the input image sample; the data dictionary is a data dictionary containing correct text samples and incorrect text samples; an Image-Text contrast Learning module (ITC) implements two subtasks by constructing a batch of Image-Text pairs as training data: the first subtask: predicting pictures corresponding to each segment of characters in the pictures in batches, wherein the second subtask comprises the following steps: predicting texts corresponding to each image in the batch of texts; a Prediction module (Word-in-Image Prediction, WIP) for performing a task of predicting whether a Word is in a graph; a Masked Language Modeling (MLM) module for masking partial words at the text output randomly and predicting the Masked words by image representation. The Cross-modal decoder is composed of a Multi-Head Self-Attention (Multi-Head Cross-Attention) module and a forward-Network (Feed-forward Network) module.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for pre-training a text detection model according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 202: inputting the text samples into a text encoder to obtain text characteristics, and inputting the image samples into an image encoder to obtain image characteristics, wherein the text samples correspond to the image samples.

Wherein, the character sample can be a sample of Chinese, English and other character formats; the image sample may be a sample in a picture format; the text encoder may be a transform encoder, for example: a BERT encoder, a RoBERTA encoder, a GPT encoder; the image encoder may be a transform encoder, for example: ViT encoder, ResNet-based encoder; the character features may be vector features output by a text encoder; the image features may be vector features output by an image encoder, and the text samples corresponding to the image samples may be understood as randomly generating text samples from a corpus, and the image samples are synthesized by rendering the generated text samples into a graph.

For example, text samples include: "why", "last", and "the", wherein the image samples include picture 1, picture 2, and picture 3, and the text feature T vectors are obtained by inputting "why", "last", and "the" into the text encoder, and the image feature I vectors are obtained by inputting picture 1, picture 2, and picture 3 into the image encoder.

In one implementation, multiple network layers are included in the image encoder.

Specifically, the inputting an image sample into an image encoder to obtain an image feature includes:

and sequentially passing the image sample through a residual error network layer, a feature pyramid layer and an attention pooling layer to obtain the image features.

Wherein, the residual network layer can be a network layer such as ResNext, DenseNee, ViT, Swin Transformer, etc.; a feature pyramid layer may be understood as a network layer determined from the feature pyramid components; an attention-pooling layer may be understood as a pooling layer with the addition of an attention mechanism.

In practical application, the final purpose of pre-training is to adjust parameters of a residual network layer, and the residual network layer after parameter adjustment is used as a backbone network, so that a better text line detection effect is obtained in a fine adjustment stage.

Step 204: and determining whether the image sample contains the character sample according to a data dictionary and the image characteristics to obtain a character containing result, wherein the data dictionary comprises the character sample.

The data dictionary may be a pre-established data dictionary, or may be a data dictionary established when the step is performed, where the data dictionary includes positive samples and negative samples of the text samples, for example, the positive samples are "why", and the negative samples are "way"; the text inclusion result can be understood as the result of whether the text sample is included in the image sample, for example, the text in picture 1 is "why", and for the text sample "why", the text inclusion result is inclusion.

In practical application, whether each Word is in a graph or not is predicted by comparing the picture representation with the representation of the positive and negative Word sample (Word-in-Image Prediction). The model is trained by the natural information of whether the word is in the graph or not, so that the model predicts that the positive sample word actually appears in the graph in all the words. To increase the difficulty of letting the model predict, we further set a negative Example of the difficult word by Online difficult sample Mining (Online Hard sample Mining). By doing so, we sample several words whose positive sample words represent close in the dictionary as difficult negative samples, and then let the network predict from them the corresponding positive sample words contained in each picture.

In an implementation manner, before determining whether the image sample contains the text sample according to the data dictionary and the image feature, the method further includes:

determining a character negative sample similar to the character sample according to the character sample;

and determining the data dictionary according to the character samples and the character negative samples.

For example, the text samples are "why", "last", "the", and a negative sample of the text sample is found, wherein the negative sample of "why" is "way", and the negative sample of "last" is: negative examples of "lost", "the" are: "she", then the data dictionary includes: "why", "way", "last", "lost", "the", "she".

The number of the negative examples of the text example may be plural, and the embodiments of the present specification are not limited thereto.

Further, the determining whether the image sample includes the text sample according to the data dictionary and the image feature to obtain a text-including result includes:

generating dictionary character sample characteristics according to the dictionary character samples in the data dictionary;

and comparing the dictionary character sample characteristics with the image characteristics to obtain the character containing result.

Following the above example, text samples include: "why", "last", "the", the image sample includes picture 1, picture 2, picture 3, the data dictionary includes: "why", "way", "last", "lost", "the", "she", wherein "why" is included in picture 1, "last" is included in picture 2, and "the" is "included in picture 3. According to the image characteristics and dictionary character sample characteristics, which word of why, way, last, lost, the and the she is contained in the picture 1 is predicted, correspondingly, which word of why, way, last, lost, the and the she is contained in the picture 2 is predicted, which word of why, way, last, lost, the and the she is contained in the picture 3 is predicted, and the character containing result can be obtained after the prediction is completed.

In the embodiment of the description, the positive sample and the negative sample of the text sample are used, so that sufficient semantic information can be embedded into the visual representation, and the text detection effect is improved.

Step 206: and determining the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a text-image corresponding result.

The image-text corresponding result can be understood as an image sample corresponding to the text sample, or a text sample corresponding to the image sample.

In practical application, the character samples are used for corresponding to the image samples, or the image samples are used for corresponding to the character samples, so that fine-grained modal interaction can be performed, and the character detection effect is improved.

Specifically, the determining the corresponding relationship between the text sample and the image sample according to the text feature and the image feature includes:

and comparing each character feature with all picture features, and determining the corresponding relation between the character sample and the image sample.

For example, text samples include: why, last and the image samples comprise pictures 1, 2 and 3, the corresponding pictures in pictures 1, 2 and 3 are judged according to the character characteristics and the image characteristics, the corresponding pictures in pictures 1, 2 and 3 are judged according to "why", the corresponding pictures in pictures 1, 2 and 3 are judged according to "last", and the corresponding pictures in pictures 1, 2 and 3 are judged according to "the".

Preferably, the determining the correspondence between the text sample and the image sample according to the text feature and the image feature includes:

and comparing each image characteristic with all character characteristics to determine the corresponding relation between the character sample and the image sample.

For example, text samples include: "why", "last", "the", the image sample includes picture 1, picture 2, picture 3, the word corresponding to picture 1 in "why", "last", "the" is judged according to the character feature and the image feature, the word corresponding to picture 2 in "why", "last", "the" is judged, and the word corresponding to picture 3 in "why", "last", "the" is judged.

In the embodiment of the specification, the text sample is used for corresponding to the image sample, and the image sample is used for corresponding to the text sample, so that fine-grained modal interaction can be performed, and the text detection effect is improved.

Step 208: and predicting the covered character samples according to the character features and the image features to obtain a character prediction result.

The masked text sample can be understood as adding a masking identifier into the text sample, where the masking identifier may be added manually or randomly, and the embodiment of the present specification is not limited.

Specifically, the predicting the masked text sample according to the text feature and the image feature comprises:

covering the character features according to a preset rule to obtain partial character features;

and obtaining the masked text sample according to the partial text feature and the image feature.

The preset masking rule may be a rule for random masking, a rule for specifying masking, and the like, for example, a character feature with an odd designated order is masked.

For example, text samples include: "why", "last", and "the" image samples include picture 1, picture 2, and picture 3, and the "last" in the text samples is masked, so that the text features of "why" and "the" remain, and "last" is predicted according to the image features of picture 1, picture 2, and picture 3.

In one implementation, predicting the masked text samples from the text features and the image features may be performed by a cross decoder, as described below.

Specifically, the obtaining the masked text sample according to the partial text feature and the image feature includes:

and inputting the partial character features and the image features into a cross decoder to obtain character samples corresponding to the partial character features and the masked character samples, wherein the cross decoder comprises a multi-head self-attention module, a multi-head cross-attention module and a forward network module.

In practical applications, the image features and text features obtained by the image encoder and the text encoder are transmitted to a Cross-mode decoder composed of a Multi-Head Self-Attention (Multi-Head Self-Attention) module and a forward-Network (Feed-forward Network) module to realize the prediction of the masked text samples.

The embodiment of the description enables the network to predict the covered characters by covering some character features, so that semantic signals can be fully embedded and aligned into visual signals, and the character detection effect is improved.

Step 210: and adjusting parameters of the image encoder according to the inclusion result, the image-text corresponding result and the text prediction result to obtain a pre-training text detection model.

In practical applications, the inclusion result, the text corresponding result, and the text prediction result respectively correspond to a loss function, and the loss function can affect parameters in an image encoder to train the image encoder.

Specifically, the performing parameter adjustment on the image encoder according to the inclusion result, the image-text corresponding result, and the text prediction result to obtain a pre-training text detection model includes:

obtaining a first loss function, a second loss function and a third loss function according to the inclusion result, the image-text corresponding result and the text prediction result;

obtaining a superposition loss function according to the first loss function, the second loss function and the third loss function;

and adjusting parameters of the image encoder according to the superposition loss function to obtain the pre-training character detection model.

In practical applications, the first loss function may be:

wherein Wk represents the characteristics of characters in the Kth dictionary; i, characterizing the features of the image; exp characterizes an exponential function; log represents a logarithmic function; τ characterizes the temperature hyperparameter.

The second loss function is:

where λ 1 and λ 2 are weighting coefficients, preferably, λ 1 and λ 2 both have a value of 0.5.

Wherein Ij characterizes a feature of the jth image; tj: the character features contained in the jth image; n represents the size of the batch; exp characterizes an exponential function; log represents a logarithmic function; τ characterizes the temperature hyperparameter.

The third loss function is:

wherein Wmasked characterizes the masked text features; wmasked characterizes the unmasked text features; e characterizing the expectation function; p represents a probability function; v characterizes all images.

The stacking loss function is:

for example, if the value calculated by the first loss function is 5, the value calculated by the second loss function is 5, and the value calculated by the second loss function is 5, the value calculated by the superposition loss function is 15, and the image encoder is subjected to parameter adjustment according to 15, so as to obtain the pre-training character detection model.

Further, the performing parameter adjustment on the image encoder according to the superposition loss function to obtain the pre-training text detection model includes:

and modifying parameters in the image encoder according to the superposition loss function to obtain the pre-training character detection model, wherein the image encoder comprises a residual error network layer, a characteristic pyramid layer and an attention pooling layer.

In practical application, the loss function is fed back upwards and forwards according to the sequence of the network layers, namely, parameters of the network layers are modified from the network layer behind to the network layer at the forefront in sequence, after parameter adjustment is completed, a trained image encoder is obtained, and a pre-training character detection model can be obtained from the image encoder.

Specifically, the modifying the parameters in the image encoder according to the superposition loss function to obtain the pre-training text detection model includes:

determining parameters of the characteristic pyramid layer according to the superposition loss function and the parameters of the attention pooling layer;

and determining parameters of the residual error network layer according to the parameters of the characteristic pyramid layer, and taking the residual error network layer as the pre-training character detection model.

According to the use example, the parameters of the feature pyramid layer are updated forward according to the value 15 of the superposition loss function and the parameters of the attention pooling layer, the parameters of the residual error network layer are updated forward according to the parameters of the feature pyramid layer, and after the parameters of the residual error network layer are updated, the residual error network layer is used as a pre-training character detection model.

It should be noted that after the pre-training character detection model is obtained, in order to avoid the risks of non-convergence of the model, insufficient optimization of parameters, low accuracy, low generalization capability of the model, easy overfitting and the like, model fine tuning is also required to obtain parameters more suitable for task requirements, and the model more suitable for the task requirements is obtained.

The embodiment of the specification provides a pre-training method for a character detection model, which comprises the following steps: inputting the character samples into a text encoder to obtain character features, and inputting the image samples into an image encoder to obtain image features, wherein the character samples are extracted from the image samples; determining whether the image sample contains a character sample according to the data dictionary and the image characteristics to obtain a character containing result, wherein the data dictionary comprises the character sample; determining the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a corresponding result of the image and text; predicting the covered character samples according to the character features and the image features to obtain character prediction results; and adjusting parameters of the image encoder according to the inclusion result, the image-text corresponding result and the text prediction result to obtain a pre-training text detection model. Through the combined training of the vision and language models, the language models and the corpus information are embedded into the vision representation in the pre-training stage, so that the vision representation has semantic knowledge, more accurate character row positions and row granularity can be obtained in the character detection fine-tuning stage, and the problems of row-row and column-row ambiguity and the like caused by insufficient semantic knowledge are avoided.

The following describes the pre-training method of the text detection model by taking the application of the pre-training method of the text detection model provided in this specification to a server as an example, with reference to fig. 3. Fig. 3 shows a processing flow chart of a method for pre-training a text detection model according to an embodiment of the present specification, which specifically includes the following steps.

Step 302: the server inputs the character sample into the text encoder to obtain character characteristics.

For example, text samples include: "why", "last", "the", inputting "why", "last", "the" into the text encoder to get a plurality of text feature T vectors.

Step 304: and the server inputs the image sample into an image encoder to obtain the image characteristics.

For example, the image samples include picture 1, picture 2, and picture 3, and the image 1, picture 2, and picture 3 are input to the image encoder to obtain a plurality of image feature I vectors.

Step 306: and the server determines whether the image sample contains the character sample according to the data dictionary and the image characteristics to obtain a character containing result.

Step 308: and the server determines the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a text-image corresponding result.

Following the above example, text samples include: "why", "last", and "the", wherein the image sample includes picture 1, picture 2, and picture 3, the corresponding picture of "why" in picture 1, picture 2, and picture 3 is judged according to the character feature and the image feature, the corresponding picture of "last" in picture 1, picture 2, and picture 3 is judged, and the corresponding picture of "the" in picture 1, picture 2, and picture 3 is judged. Judging the corresponding words of the picture 1 in the words of "why", "last", "the", judging the corresponding words of the picture 2 in the words of "why", "last", "the", and judging the corresponding words of the picture 3 in the words of "why", "last", "the", according to the character features and the image features.

Step 310: and the server predicts the covered character samples according to the character features and the image features to obtain a character prediction result.

Following the above example, text samples include: "why", "last", and "the" image samples include picture 1, picture 2, and picture 3, and the "last" in the text samples is masked, so that the text features of "why" and "the" remain, and "last" is predicted according to the image features of picture 1, picture 2, and picture 3.

Step 312: and the server obtains a first loss function, a second loss function and a third loss function according to the inclusion result, the image-text corresponding result and the text prediction result.

Following the above example, the value calculated from the first loss function is 5, the value calculated from the second loss function is 5, and the value calculated from the second loss function is 5.

Step 314: and the server obtains a superposition loss function according to the first loss function, the second loss function and the third loss function.

According to the above example, the superposition loss function is 5+5+5, the calculated value is 15, and the image encoder is subjected to parameter adjustment according to 15 to obtain the pre-training character detection model.

Step 316: and the server adjusts the parameters of the image encoder according to the superposition loss function to obtain the pre-training character detection model.

In the embodiment of the specification, through the joint training of the vision model and the language model, the language model and the corpus information are embedded into the vision representation in the pre-training stage, so that the vision representation has semantic knowledge, more accurate character row positions and row granularity can be obtained in the character detection fine-tuning stage, and the problems of row-column ambiguity and the like caused by insufficient semantic knowledge are avoided.

Corresponding to the above method embodiment, the present specification further provides an embodiment of a pre-training device for a text detection model, and fig. 4 shows a schematic structural diagram of the pre-training device for the text detection model provided in an embodiment of the present specification. As shown in fig. 4, the apparatus includes:

an encoding module 402 configured to input a text sample into a text encoder to obtain a text feature and input an image sample into an image encoder to obtain an image feature, wherein the text sample corresponds to the image sample;

a first task module 404, configured to determine whether the image sample contains the text sample according to the data dictionary and the image feature, and obtain a text containing result, where the data dictionary includes the text sample;

a second task module 406, configured to determine a correspondence between the text sample and the image sample according to the text feature and the image feature, so as to obtain a result corresponding to an image and text;

a third task module 408, configured to predict the masked text samples according to the text features and the image features, and obtain a text prediction result;

a parameter adjusting module 410, configured to perform parameter adjustment on the image encoder according to the inclusion result, the image-text corresponding result, and the text prediction result, so as to obtain a pre-training text detection model.

Optionally, the encoding module 402 is further configured to:

Optionally, the first task module 404 is further configured to:

Optionally, the second task module 406 is further configured to:

Optionally, the third task module 408 is further configured to:

Optionally, the parameter adjusting module 410 is further configured to:

The embodiment of the specification provides a pre-training device for a character detection model, which inputs a character sample into a text encoder to obtain character features, and inputs an image sample into an image encoder to obtain image features, wherein the character sample is extracted from the image sample; determining whether the image sample contains a character sample according to the data dictionary and the image characteristics to obtain a character containing result, wherein the data dictionary comprises the character sample; determining the corresponding relation between the text sample and the image sample according to the text characteristics and the image characteristics to obtain a corresponding result of the image and the text; predicting the covered character samples according to the character features and the image features to obtain character prediction results; and adjusting parameters of the image encoder according to the inclusion result, the image-text corresponding result and the text prediction result to obtain a pre-training text detection model. Through the combined training of the vision and language models, the language models and the corpus information are embedded into the vision representation in the pre-training stage, so that the vision representation has semantic knowledge, more accurate character row positions and row granularity can be obtained in the character detection fine-tuning stage, and the problems of row-row and column-row ambiguity and the like caused by insufficient semantic knowledge are avoided.

The foregoing is a schematic scheme of a pre-training apparatus for a text detection model according to this embodiment. It should be noted that the technical scheme of the pre-training device for the text detection model and the technical scheme of the pre-training method for the text detection model belong to the same concept, and details of the technical scheme of the pre-training device for the text detection model, which are not described in detail, can be referred to the description of the technical scheme of the pre-training method for the text detection model.

FIG. 5 illustrates a block diagram of a computing device 500 provided in accordance with one embodiment of the present description. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.

Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 550 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device structure shown in FIG. 5 is for illustration purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.

The processor 520 is configured to execute computer-executable instructions, which when executed by the processor implement the steps of the text detection model pre-training method.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text detection model pre-training method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the text detection model pre-training method.

An embodiment of the present specification further provides a computer-readable storage medium, which stores computer-executable instructions, and when the computer-executable instructions are executed by a processor, the steps of the text detection model pre-training method are implemented.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the text detection model pre-training method belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text detection model pre-training method.

An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the text detection model pre-training method.

The above is an illustrative scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program and the technical solution of the above-mentioned pre-training method for the text detection model belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned pre-training method for the text detection model.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, and to thereby enable others skilled in the art to best understand the specification and utilize the specification. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A pre-training method of a character detection model comprises the following steps:

determining whether the image sample contains the character sample according to a data dictionary and the image characteristics to obtain a character containing result, wherein the data dictionary comprises a positive sample and a negative sample corresponding to the character sample;

masking the character features according to a preset rule to obtain partial character features, and inputting the partial character features and the image features into a cross decoder to predict masked character samples to obtain a character prediction result, wherein the cross decoder comprises a multi-head self-attention module, a multi-head cross attention module and a forward network module;

2. The method of claim 1, the inputting image samples into an image encoder resulting in image features, comprising:

3. The method of claim 1, further comprising, prior to determining whether the image sample includes the text sample from the data dictionary and the image features:

4. The method of claim 3, wherein determining whether the image sample contains the text sample according to the data dictionary and the image feature to obtain a text containing result comprises:

5. The method of claim 1, the determining a correspondence of the text sample and the image sample from the text feature and the image feature, comprising:

6. The method of claim 5, the determining a correspondence of the text sample and the image sample from the text feature and the image feature, comprising:

and comparing each image characteristic with all the character characteristics to determine the corresponding relation between the character sample and the image sample.

7. The method of claim 1, wherein the obtaining a pre-training text detection model by performing parameter adjustment on the image encoder according to the inclusion result, the image-text correspondence result, and the text prediction result comprises:

8. The method of claim 7, wherein the performing parameter adjustment on the image encoder according to the superposition loss function to obtain the pre-training text detection model comprises:

9. The method of claim 8, wherein modifying the parameters in the image encoder according to the stacking loss function to obtain the pre-trained text detection model comprises:

and determining parameters of the residual error network layer according to the parameters of the characteristic pyramid layer, and taking the residual error network layer as a backbone network of the pre-training character detection model.

10. A text detection model pre-training apparatus, comprising:

the encoding module is configured to input a text sample into a text encoder to obtain text characteristics, and input an image sample into an image encoder to obtain image characteristics, wherein the text sample corresponds to the image sample;

a first task module configured to determine whether the image sample contains the text sample according to a data dictionary and the image features, and obtain a text containing result, wherein the data dictionary comprises a positive sample and a negative sample corresponding to the text sample;

the third task module is configured to mask the character features according to a preset rule to obtain partial character features, and input the partial character features and the image features into a cross decoder to predict the masked character samples to obtain a character prediction result, wherein the cross decoder comprises a multi-head self-attention module, a multi-head cross-attention module and a forward network module;

11. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, which when executed by the processor, implement the steps of the text detection model pre-training method of any one of claims 1 to 9.

12. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the text detection model pre-training method of any one of claims 1 to 9.