CN114743206A

CN114743206A - Text detection method, model training method, device and electronic equipment

Info

Publication number: CN114743206A
Application number: CN202210534115.3A
Authority: CN
Inventors: 汪京晔; 刘威威; 李晨霞; 杜宇宁; 赖宝华; 马艳军; 于佃海
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-07-12
Anticipated expiration: 2042-05-17
Also published as: CN114743206B

Abstract

The disclosure provides a text detection method, a model training device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the fields of deep learning, image processing, text detection and the like. The specific implementation scheme is as follows: performing convolution processing on a first image carrying text information according to at least one first convolution block for representing a light-weight large convolution kernel to obtain characteristics related to text detection in the text information; based on an attention mechanism, evaluating the characteristics related to text detection in the text information to obtain an evaluation result for representing the importance degree of the characteristics; and performing text detection according to the evaluation result to obtain a text detection result. By adopting the method and the device, the text detection precision can be improved.

Description

Text detection method, model training method, device and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the fields of deep learning, image processing, text detection, and the like.

Background

Text in the image is a key information in human-computer interaction, and the application of the text in the field of computer vision is very wide. Text detection (text detection) is a technology capable of extracting text information in an image, the precision of text detection can directly influence the performance of subsequent text recognition, and with the development of the field of artificial intelligence, the text detection is widely applied to various scenes and the precision of text detection needs to be improved.

Disclosure of Invention

The disclosure provides a text detection method, a model training method, a device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a text detection method including:

performing convolution processing on a first image carrying text information according to at least one first convolution block for representing a light-weight large convolution kernel to obtain characteristics related to text detection in the text information;

based on an attention mechanism, evaluating the characteristics related to text detection in the text information to obtain an evaluation result for representing the importance degree of the characteristics;

and performing text detection according to the evaluation result to obtain a text detection result.

According to an aspect of the present disclosure, there is provided a model training method, including:

inputting an image sample carrying text information into a text detection model to be trained;

in the text detection model to be trained, carrying out convolution processing on the image sample carrying the text information according to at least one first convolution block used for representing a light-weight large convolution kernel to obtain characteristics related to text detection in the text information;

in the text detection model to be trained, adopting an auxiliary detection head to predict the characteristics related to text detection in the text information to obtain a first prediction result;

in the text detection model to be trained, evaluating the first prediction result based on an attention mechanism to obtain an evaluation result for representing the feature importance degree;

and performing model training on the text detection model to be trained according to the evaluation result and the pre-training target to obtain the trained text detection model.

According to another aspect of the present disclosure, there is provided a text detection apparatus including:

the first processing module is used for carrying out convolution processing on a first image carrying text information according to at least one first convolution block used for representing a light-weight large convolution kernel to obtain characteristics related to text detection in the text information;

the first evaluation module is used for evaluating the characteristics related to text detection in the text information based on an attention mechanism to obtain an evaluation result for representing the importance degree of the characteristics;

and the detection module is used for carrying out text detection according to the evaluation result to obtain a text detection result.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the input module is used for inputting the image sample carrying the text information into the text detection model to be trained;

the second processing module is used for performing convolution processing on the image sample carrying the text information in the text detection model to be trained according to at least one first convolution block used for representing a light-weight large convolution kernel to obtain characteristics related to text detection in the text information;

the prediction module is used for predicting the characteristics related to text detection in the text information by adopting an auxiliary detection head in the text detection model to be trained to obtain a first prediction result;

the second evaluation module is used for evaluating the first prediction result in the text detection model to be trained on the basis of an attention mechanism to obtain an evaluation result for representing the importance degree of the characteristic;

and the training module is used for carrying out model training on the text detection model to be trained according to the evaluation result and the pre-training target to obtain the trained text detection model.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method as provided by any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the method provided by any one of the embodiments of the present disclosure.

By adopting the method and the device, the first image carrying the text information can be subjected to convolution processing according to the at least one first convolution block for representing the light-weight large convolution kernel so as to obtain the characteristics related to text detection in the text information, the characteristics related to text detection in the text information can be evaluated based on an attention mechanism so as to obtain the evaluation result for representing the importance degree of the characteristics, and the text detection can be carried out according to the evaluation result so as to obtain the text detection result, so that the text detection precision can be improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of a text detection method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a basic component structure of a text detection module according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow diagram of a model training method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a composition structure of a first convolution block for characterizing a lightweight large convolution kernel in an application example according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a feature pyramid part and an auxiliary detection head in a text detection module according to an exemplary application of the present disclosure;

FIG. 7 is a schematic diagram of a feature pyramid portion, an application attention mechanism, and a detection head in a text detection module according to an exemplary application of the present disclosure;

FIG. 8 is a schematic diagram of a structure of a text detection device according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a component structure of a model training apparatus according to an embodiment of the present disclosure;

FIG. 10 is a block diagram of an electronic device for implementing a text detection method or a model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "at least one" herein means any combination of at least two of any one or more of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. The terms "first" and "second" used herein refer to and distinguish one from another in the similar art, without necessarily implying a sequence or order, or implying only two, such as first and second, to indicate that there are two types/two, first and second, and first and second may also be one or more.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

For texts in an image, especially for the case that a long text (such as a long sentence text) exists in the image or the case that a complex background exists in the image, the field of experience of the model is increased by using a deformed convolution, a hole convolution and the like. In combination with the context information, if a feature pyramid is adopted, the importance degree of feature information of each layer is not well considered, so that the text detection precision is not high. Even if the receptive field is increased, the sizes of the models are too large, so that the deployment of a mobile terminal and real-time text detection are not facilitated.

Fig. 1 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present disclosure, where the distributed cluster system is an example of a cluster system, and exemplarily describes that text detection can be performed by using the distributed cluster system, the present disclosure is not limited to text detection on a single machine or multiple machines, and text detection accuracy can be further improved by using distributed processing. As shown in fig. 1, in the distributed cluster system, a plurality of nodes (e.g., server cluster 101, server 102, server cluster 103, server 104, and server 105) are included, the server 105 may further be connected to electronic devices, such as a mobile phone 1051 and a desktop 1052, and the plurality of nodes and the connected electronic devices may jointly perform one or more text detection tasks. Optionally, a plurality of nodes in the distributed cluster system may perform text detection by using a data parallel relationship, so as to improve the text detection speed, thereby considering both the text detection precision and the text detection speed. Optionally, after each round of training of the text detection task is completed, data exchange (e.g., data synchronization) may be performed between multiple nodes.

It should be noted that the present disclosure includes two parts, namely, model use and model training, and the trained text detection model is directly used in text detection. The basic components of the text detection model comprise: a backbone network (backbone), a neck network (nic), and a head network (head). The specific structure of the model employed in model use and model training varies. Wherein, in model use, the feature pyramid structure part in the tack can be set as: and at least one first convolution block for representing the light-weight large convolution kernel uses a channel attention mechanism in the head, the channel attention mechanism is applied to the output of the characteristic pyramid structure, and a detection head is used in the head to perform prediction processing of final text detection so as to obtain a text detection result. In model training, an auxiliary detection head is added compared with the model, so that effective information including more accurate semantic features (namely, information for improving evaluation accuracy) is not lost. Specifically, the feature pyramid structure part in the tack can be set as follows: and at least one first convolution block for representing the light-weight large convolution kernel uses a channel attention mechanism in the head, the channel attention mechanism is applied to the output of the characteristic pyramid structure, an auxiliary detection head and a detection head are arranged in the head, and the detection head is used in the head for prediction processing of final text detection.

According to an embodiment of the present disclosure, a text detection method is provided, and fig. 2 is a schematic flowchart of the text detection method according to the embodiment of the present disclosure, and the method may be applied to a text detection apparatus, for example, the apparatus may be deployed in a situation where a terminal or a server or other processing devices in a single-machine, multi-machine or cluster system execute, and may implement processing such as text detection. The terminal may be a User Equipment (UE), a mobile device, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 2, the method is applied to any node or electronic device (mobile phone or desktop, etc.) in the cluster system shown in fig. 1, and includes:

s201, performing convolution processing on a first image carrying text information according to at least one first convolution block used for representing the light-weight large convolution kernel to obtain characteristics related to text detection in the text information.

S202, evaluating the characteristics related to text detection in the text information based on an attention mechanism to obtain an evaluation result for representing the importance degree of the characteristics.

And S203, performing text detection according to the evaluation result to obtain a text detection result.

In an example of S201-S203, the first convolution block may be a light-weight large convolution block (or referred to as a light-weight large convolution module), and the first convolution block may include multiple layers of convolution kernels, so as to perform layer-by-layer convolution processing on the first image carrying the text information, and obtain a more accurate feature related to text detection in the text information. And based on the attention mechanism, evaluating the characteristics related to text detection in the text information to obtain an evaluation result for representing the importance degree of the characteristics, and performing text detection based on the evaluation result to obtain a more accurate text detection result.

By adopting the method and the device, the convolution processing can be carried out on the first image carrying the text information according to the at least one first convolution block used for representing the light-weight large convolution kernel so as to obtain the characteristics related to the text detection in the text information, the characteristics related to the text detection in the text information can be evaluated based on the attention mechanism so as to obtain the evaluation result used for representing the importance degree of the characteristics, the text detection can be carried out according to the evaluation result so as to obtain the text detection result, and therefore the text detection precision can be improved.

In an embodiment, performing convolution processing on a first image carrying text information according to at least one first convolution block to obtain a feature related to text detection in the text information includes: and performing layer-by-layer convolution processing on the first image carrying the text information according to the multilayer convolution kernels in the at least one first convolution block to obtain the characteristics related to text detection in the text information.

In some examples, the at least one first convolution block is set in the feature pyramid structure of the text detection model, and the multiple layers of convolution kernels in the at least one first convolution block are initially configured according to a convolution kernel size (e.g., the convolution kernel size is 7 pixels or 9 pixels). Wherein the multi-layered convolution kernel includes: the first convolution kernel, the second convolution kernel which receives the input of the first convolution kernel, the third convolution kernel which receives the input of the second convolution kernel after the activation function processing, and the fourth convolution kernel which performs the addition processing together with the third convolution kernel.

The text detection model shown in fig. 3 includes: backbone, neck and head, etc. The backsbone is used as a first module in a text detection model, and for an input first image, the backsbone is mainly used for extracting features in the first image so as to provide the features in the first image for use by the rock and the head; the rock is used as a second module in the text detection model, is positioned between the backbone and the head and is used for better utilizing the features in the first image extracted by the backbone; the head is used as a third module in the text detection model, and the prediction of text detection is carried out by utilizing the image characteristics extracted and optimized by the backbone and/or the rock so as to realize the text detection.

The first roll block shown in fig. 5 includes: a first convolution kernel (or called a first convolution layer), a second convolution kernel (or called a second convolution layer), an activation layer, a third convolution kernel (or called a third convolution layer), and a fourth convolution kernel (or called a fourth convolution layer); when the convolution kernel size is 7 pixels, the first convolution kernel may be a 7x7 convolution kernel, and the first convolution kernel may also be a 9x9 convolution kernel; the second convolution kernel may be a 1x1 convolution kernel connected to the first convolution kernel; the third convolution kernel may be a 1x1 convolution kernel connected to the activation layer (e.g., the activation layer in the case of an RELU activation function); the fourth convolution kernel may be a 1x1 convolution kernel that performs an addition operation with the third convolution kernel.

By adopting the embodiment, at least one first convolution block for representing the light-weight large convolution kernel can be applied to the text detection model, the size of the model is small, the receptive field of the model is increased, and the text detection precision is improved.

In one embodiment, performing layer-by-layer convolution processing on a first image carrying text information according to multiple layers of convolution kernels in at least one first convolution block to obtain a feature related to text detection in the text information, includes: inputting a first image carrying text information into the first convolution kernel, performing convolution processing on the first convolution kernel, inputting the first image into a second convolution kernel, performing convolution processing and activation function processing on the second convolution kernel, inputting the first image carrying text information into a third convolution kernel, inputting the first image carrying text information into a fourth convolution kernel, performing convolution processing on the first image carrying text information and the third convolution kernel, and adding the first image carrying text information and the third convolution kernel to obtain the characteristics related to text detection in the text information.

With the present embodiment, based on the first convolution block shown in fig. 5, the text detection accuracy is improved by the layer-by-layer convolution processing of the plurality of convolution kernels in the first convolution block.

In one embodiment, the method further comprises: based on a feature pyramid structure of a text detection model, feature extraction is carried out on a first image carrying text information to obtain a plurality of feature layers, and the feature layers are spliced to obtain splicing features. Wherein at least one of the plurality of feature layers is obtained by convolution processing performed by at least one first convolution block.

By adopting the embodiment, the feature extraction is performed based on the feature pyramid structure, and a plurality of feature layers exist, the plurality of feature layers can be spliced, optionally, at least one feature layer in the plurality of feature layers can be obtained by performing the feature extraction on the first volume block as shown in fig. 5, that is: the first convolution block shown in fig. 5 can be multiplexed to any part of the feature pyramid structure to achieve more accurate feature extraction, thereby improving text detection accuracy.

In one embodiment, the method for evaluating features related to text detection in text information based on an attention mechanism to obtain an evaluation result for characterizing the importance degree of the features includes: and under the condition that the feature related to text detection in the text information is the splicing feature, performing normalization processing for feature evaluation on the splicing feature based on the attention mechanism to obtain a normalized feature, obtaining a target feature according to the normalized feature and the splicing feature, and taking the target feature as an evaluation result for representing the feature importance degree.

By adopting the embodiment, the target feature is obtained according to the normalized feature and the splicing feature, the normalized feature and the previous splicing feature can be subjected to point multiplication to obtain the target feature, and the previous splicing feature can be modified conveniently, so that the finally used target feature can be subjected to feature description more accurately, and thus the attention mechanism can be effectively combined with context information and applied to a text detection model, the model size is small, and the text detection precision is improved under the condition of not influencing real-time text detection.

In one embodiment, performing text detection according to the evaluation result to obtain a text detection result includes: and under the condition that a head structure is set in the text detection model, performing text detection according to the target characteristics and the head structure to obtain a text detection result.

By adopting the embodiment, the modified target feature is obtained by performing point multiplication on the normalized feature and the previous splicing feature, and the modified target feature is used as the input of the head part of the text detection model to perform final prediction, so that the text detection precision is improved.

According to an embodiment of the present disclosure, a model training method is provided, and fig. 4 is a flowchart of the model training method according to the embodiment of the present disclosure, and the method may be applied to a model training apparatus, for example, the apparatus may be deployed in a situation where a terminal or a server or other processing devices in a single-machine, multi-machine, or cluster system execute, and may implement processing such as semantic relationship extraction. The terminal may be a User Equipment (UE), a mobile device, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 4, the method is applied to any node or electronic device (mobile phone or desktop, etc.) in the cluster system shown in fig. 1, and includes:

s401, inputting the image sample carrying the text information into a text detection model to be trained.

S402, in the text detection model to be trained, according to at least one first convolution block, performing convolution processing on the image sample carrying the text information to obtain the characteristics related to text detection in the text information.

S403, in the text detection model to be trained, an auxiliary detection head is adopted to predict the characteristics related to text detection in the text information, and a first prediction result is obtained.

S404, in the text detection model to be trained, evaluating the first prediction result based on an attention mechanism to obtain an evaluation result for representing the feature importance degree.

And S405, performing model training in the text detection model to be trained according to the evaluation result and the pre-training target to obtain the trained text detection model.

In an example of S401-S405, the first convolution block may be a light-weight large convolution block (or referred to as a light-weight large convolution module), and the first convolution block may include multiple layers of convolution kernels, so as to perform layer-by-layer convolution processing on an image sample carrying text information, and obtain a more accurate feature related to text detection in the text information. And predicting the characteristics related to text detection in the text information by using the auxiliary detection head to obtain a first prediction result. The evaluation result used for representing the feature importance degree can be obtained by evaluating the first prediction result based on an attention mechanism, so that model training is carried out on the text detection model to be trained according to the evaluation result and the pre-training target, the trained text detection model is obtained, the size of the text detection model is small, the text detection model can be deployed at a mobile terminal, text detection is achieved through the text detection model, and text detection precision can be improved.

By adopting the embodiment of the disclosure, in combination with the layer-by-layer convolution operation of the multilayer convolution kernels in the first convolution block, the effective characteristics extracted by the layer-by-layer convolution operation are prevented from being lost through the prediction processing of the auxiliary detection head, the importance of the effective characteristics is evaluated through an attention mechanism, so that the model training is performed in the text detection model to be trained according to the evaluation result and the pre-training target, and the text detection model obtained through training is high in model precision.

In one embodiment, the method further comprises: at least one first convolution block is arranged in a characteristic pyramid structure of a text detection model to be trained, an auxiliary detection head is arranged at the output part of the characteristic pyramid structure, and an attention mechanism is applied to the output part of the characteristic pyramid structure.

In some examples, the at least one first volume block includes: the first convolution kernel, the second convolution kernel which receives the input of the first convolution kernel, the third convolution kernel which receives the input of the second convolution kernel after the activation function processing, and the fourth convolution kernel which performs the addition processing together with the third convolution kernel.

In some examples, the feature pyramid structure includes: a plurality of feature layers; wherein at least one of the plurality of feature layers is obtained by convolution processing performed by the at least one first convolution block.

The first roll block shown in fig. 5 includes: a first convolution kernel (or called a first convolution layer), a second convolution kernel (or called a second convolution layer), an activation layer, a third convolution kernel (or called a third convolution layer), and a fourth convolution kernel (or called a fourth convolution layer); wherein, under the condition that the convolution kernel size adopts 7 pixels, the first convolution kernel can be 7x7 convolution kernel, and the first convolution kernel can also be 9x9 convolution kernel; the second convolution kernel may be a 1x1 convolution kernel concatenated with the first convolution kernel; the third convolution kernel may be a 1x1 convolution kernel connected to the activation layer (e.g., the activation layer in the case of an RELU function); the fourth convolution kernel may be a 1x1 convolution kernel that performs an addition operation with the third convolution kernel.

By adopting the embodiment, at least one first convolution block for representing the light-weight large convolution kernel can be applied to the text detection model, the size of the model is small, the receptive field of the model is increased, and the precision of the model is improved, so that the text detection precision is improved when the model is used.

In one embodiment, in a text detection model to be trained, performing convolution processing on an image sample carrying text information according to at least one first convolution block to obtain a feature related to text detection in the text information, includes: and inputting the image sample carrying text information into the first convolution kernel, performing convolution processing on the image sample by the first convolution kernel, inputting the image sample into the second convolution kernel, and performing convolution processing and activation function processing on the image sample by the second convolution kernel, and inputting the image sample into the third convolution kernel. And inputting the image sample carrying the text information into a fourth convolution kernel for convolution processing, and then adding the image sample carrying the text information and the third convolution kernel for addition processing to obtain the characteristics related to text detection in the text information.

By adopting the embodiment, the feature extraction is carried out through the layer-by-layer convolution operation of the convolution kernels in the first convolution block, and the extracted feature is used for model training, so that the model precision is improved.

In an embodiment, the method further includes performing feature extraction on the first image carrying the text information based on the feature pyramid structure to obtain a plurality of feature layers, and performing splicing processing on the plurality of feature layers to obtain a splicing feature.

By adopting the embodiment, the plurality of characteristic layers are spliced to obtain the splicing characteristics, and the splicing characteristics are used as the fusion characteristics to carry out model training, so that the model precision is improved.

In one embodiment, in a text detection model to be trained, predicting features related to text detection in text information by using an auxiliary detection head to obtain a first prediction result, including: and under the condition that the feature related to the text detection in the text information is the splicing feature, predicting the splicing feature to obtain a first prediction result.

In some examples, the first prediction result may be further evaluated based on an attention mechanism to obtain an evaluation result for characterizing feature importance, for example, the first prediction result is subjected to normalization processing for feature evaluation based on the attention mechanism to obtain a normalized feature, a feature of the head structure to be input is obtained according to the normalized feature and the first prediction result, and the feature of the head structure to be input is used as the evaluation result for characterizing feature importance.

By adopting the embodiment, the head structure is arranged, the target feature of the head structure to be input, which is obtained by applying the attention mechanism to the output part of the feature pyramid structure, can be input into the head structure, and as the target feature is subjected to the prediction of the splicing feature and the feature evaluation of the first prediction result obtained by the prediction based on the attention mechanism, the model training is carried out by adopting the target feature, so that the model precision is improved.

The following is an example of a model training and a model using processing method provided by the embodiment of the present disclosure.

As shown in fig. 5-7, the text detection model is divided into: the three modules of the backbone, the neck and the head, wherein the backbone can use a lightweight network (such as mobile v3) model, the neck can use a detection model based on a feature pyramid, and the head can use a detection head of multi-category text detection (DBNet). The following mainly describes the improvement of the feature pyramid part and the head part of the rock of the text detection model, including the following:

1. selecting a volume block (or convolution module) for convolution processing, such as setting the attribute of the volume block to same (i.e. the filling of the convolution kernel is half of the size of the convolution kernel and rounding down);

2. to reduce the parameter number of the model, at least one convolution block in the convolution block may use the first convolution block, which may be a lightweight large convolution block (or referred to as a lightweight large convolution module). Setting the convolution kernel size k to 7; or setting the convolution kernel size k to 9, and setting the number of input channels and the number of output channels besides the convolution kernel size.

Specifically, the convolution operation performed by the first convolution block includes: 1) initializing a layer-by-layer convolution kernel conv1 by using the set convolution kernel size k, such as deep convolution; 2) initializing convolution kernels conv2 with the convolution size of 1x1, such as point-by-point convolution, wherein the number of output channels is 4 times of the number of input channels; 3) setting an activation function, such as a Rectified Linear Unit (RELU); 4) initializing convolution kernels conv3 with convolution size of 1x1, and adopting point-by-point convolution; 5) a convolution kernel conv4 with convolution size 1x1 is set and residual concatenation is established. A convolution block obtained by splicing 5 network layers of the first convolution kernel (conv1), the second convolution kernel (conv2), the activation layer (RELU layer), the third convolution kernel (conv3) and the fourth convolution kernel (conv4) is used as the first convolution block, and the first convolution block can be applied to the text detection network as a basic accessory, for example, the first convolution block can be applied to any one or more feature layers in the feature pyramid part.

3. Aiming at each layer of output of the characteristic pyramid part of the neck part, each layer of output is connected with an auxiliary detection head, the specific setting of the auxiliary detection head can be consistent with that of the detection head of the head part, and the auxiliary detection head is adopted so as to prevent the loss of effective characteristics caused by the deepening of the network layer number (such as multi-layer up-sampling and multi-layer down-sampling) of the information related to text detection.

4. For the feature pyramid part, the outputs of the layers of the feature pyramid part are also required to be spliced according to the channel dimension, so as to obtain a splicing feature. The channel attention mechanism can be used to evaluate the importance of the features for each layer output of the feature pyramid portion.

Specifically, the spliced characteristic F epsilon R after splicing is output for each layer of the characteristic pyramid part^C×H×W(C is the number of channels, H is the height of the feature, and W is the width of the feature), the channel attention mechanism can be used to evaluate the importance of the feature for each layer output of the pyramid portion of the feature, including: 1) firstly, carrying out maximal pooling and average pooling based on space dimension on F to obtain corresponding characteristics F_m∈R^C×1，F_a∈R^C×1(ii) a Wherein, F_mFeatures obtained for maximal pooling of F based on spatial dimensions, F_aFeatures obtained by performing average pooling on F; 2) f is to be_mAnd F_aSpliced together along the channel dimension to obtain feature F_temp∈R^2C×1；F_tempRepresents an intermediate variable; 3) using a full connection layer to perform full connection processing to obtain F_attNamely: f is to be_temp∈R^2C×1→F_aTT∈R^C×1(ii) a 4) F to be obtained using sigmoid function_aTTCarrying out normalization processing on the numerical values to obtain the importance degree of the corresponding characteristic layer; 5) subjecting the feature F obtained in the above 4) to_aTTPerforming dimension expansion, i.e. F_aTT∈R^C×1→F_aTT∈R^C ^×H×WThe target feature F' epsilon R obtained after the feature importance degree evaluation is carried out on each layer output of the feature pyramid part by utilizing the channel attention mechanism^C×H×W(ii) a 6) Making F' be equal to R^C×H×WInput features as part of the text detection model headCharacterization and prediction of the final text detection.

5. And (5) combining the steps 1-4, improving the text detection model, and applying the improved text detection model to a text detection scene.

It should be noted that, for the feature pyramid structure of the rock portion and the convolution kernel structure of the head portion of the text detection model, the convolution operation is performed by using the first convolution block constructed in the step 2, and for the outputs of the penultimate layer to the penultimate layer of the feature pyramid, two auxiliary detection heads may be used to predict the feature pyramid respectively, the auxiliary detection heads do not participate in the detection of the final file detection, and are only responsible for retaining most of the effective features related to the text detection during the model training, and finally, for the final output of the feature pyramid (for example, the penultimate layer to the last layer of the feature pyramid may be spliced), an evaluation is performed on the importance degree of the output features of each layer of the feature pyramid by using a channel attention-based mechanism, and the correction is performed according to the evaluation result, so as to obtain the more accurate target features.

By adopting the application example, the receptive field of the text detection model can be effectively increased, the precision of the text detection model is improved, context information can be effectively combined, multi-size information is fully utilized, the robustness of the text detection model to multi-scale text detection is improved, effective characteristics related to text detection in each characteristic layer of the text detection model are effectively reserved, and therefore the application example can be well expanded to various visual tasks carrying texts in various images.

According to an embodiment of the present disclosure, a text detection apparatus is provided, fig. 8 is a schematic diagram of a composition structure of the text detection apparatus according to an embodiment of the present disclosure, and as shown in fig. 8, the text detection apparatus includes: a first processing module 801, configured to perform convolution processing on a first image carrying text information according to at least one first convolution block used for representing a lightweight large convolution kernel, to obtain a feature related to text detection in the text information; a first evaluation module 802, configured to evaluate, based on an attention mechanism, features related to text detection in text information, and obtain an evaluation result for representing an importance degree of the features; and the detection module 803 is configured to perform text detection according to the evaluation result to obtain a text detection result.

In one embodiment, the first processing module is configured to: and performing layer-by-layer convolution processing on the first image carrying the text information according to the multilayer convolution kernels in the at least one first convolution block to obtain the characteristics related to text detection in the text information.

In one embodiment, the method further comprises: a configuration module to: setting the at least one first convolution block in a characteristic pyramid structure of a text detection model; configuring the multi-layer convolution kernel in the at least one first convolution block according to a convolution kernel size; wherein the multi-layered convolution kernel includes: the system comprises a first convolution kernel, a second convolution kernel receiving the input of the first convolution kernel, a third convolution kernel receiving the input of the second convolution kernel after the second convolution kernel is processed by an activating function, and a fourth convolution kernel executing addition processing together with the third convolution kernel.

In one embodiment, the first processing module is configured to: inputting the first image carrying the text information into the first convolution kernel, performing convolution processing on the first convolution kernel, inputting the first image into the second convolution kernel, performing convolution processing and activation function processing on the second convolution kernel, and inputting the first image carrying the text information into the third convolution kernel; and inputting the first image carrying the text information into the fourth convolution kernel for convolution processing, and then adding the first image carrying the text information and the third convolution kernel for addition processing to obtain the characteristics related to text detection in the text information.

In one embodiment, the method further comprises: an extraction module to: performing feature extraction on the first image carrying the text information based on the feature pyramid structure of the text detection model to obtain a plurality of feature layers; splicing the plurality of characteristic layers to obtain splicing characteristics; wherein at least one of the plurality of feature layers is obtained by convolution processing performed by the at least one first convolution block.

In one embodiment, the first evaluation module is configured to: under the condition that the feature related to text detection in the text information is the splicing feature, performing normalization processing for feature evaluation on the splicing feature based on the attention mechanism to obtain a normalized feature; obtaining target characteristics according to the normalization characteristics and the splicing characteristics; and taking the target characteristic as the evaluation result of the importance degree of the characteristic.

In one embodiment, the detection module is configured to: and under the condition that a head structure is arranged in the text detection model, performing text detection according to the target feature and the head structure to obtain a text detection result.

According to an embodiment of the present disclosure, a model training apparatus is provided, fig. 9 is a schematic diagram of a composition structure of the model training apparatus according to the embodiment of the present disclosure, and as shown in fig. 9, the model training apparatus includes: an input module 901, configured to input an image sample carrying text information into a text detection model to be trained; a second processing module 902, configured to perform convolution processing on the image sample carrying text information according to at least one first convolution block used for representing a lightweight large convolution kernel in the text detection model to be trained, so as to obtain a feature related to text detection in the text information; a prediction module 903, configured to predict, in the text detection model to be trained, features related to text detection in the text information by using an auxiliary detection head, so as to obtain a first prediction result; a second evaluation module 904, configured to evaluate the first prediction result in the to-be-trained text detection model based on an attention mechanism to obtain an evaluation result for characterizing feature importance degree; and the training module 905 is configured to perform model training on the text detection model to be trained according to the evaluation result and the pre-training target, so as to obtain a trained text detection model.

In one embodiment, the method further comprises: a setup module to: setting the at least one first convolution block in a characteristic pyramid structure of the text detection model to be trained; arranging the auxiliary detection head at the output part of the characteristic pyramid structure; applying the attention mechanism at an output portion of the feature pyramid structure.

In one embodiment, the at least one first volume block comprises: the system comprises a first convolution kernel, a second convolution kernel accepting the input of the first convolution kernel, a third convolution kernel accepting the input of the second convolution kernel after being processed by an activation function, and a fourth convolution kernel executing addition processing together with the third convolution kernel.

In one embodiment, the feature pyramid structure comprises: a plurality of feature layers; wherein at least one of the plurality of feature layers is obtained by convolution processing performed by the at least one first convolution block.

In one embodiment, the second processing module is configured to: inputting the image sample carrying the text information into the first convolution kernel, performing convolution processing on the image sample carrying the text information by the first convolution kernel, inputting the image sample into the second convolution kernel, and performing convolution processing and activation function processing on the image sample by the second convolution kernel, and inputting the image sample into the third convolution kernel; and inputting the image sample carrying the text information into the fourth convolution kernel for convolution processing, and then adding the image sample carrying the text information and the third convolution kernel for addition processing to obtain the characteristics related to text detection in the text information.

In one embodiment, the method further comprises: a third processing module to: based on the characteristic pyramid structure, performing characteristic extraction on the first image carrying the text information to obtain a plurality of characteristic layers;

and splicing the plurality of characteristic layers to obtain splicing characteristics.

In one embodiment, the prediction module is configured to: and under the condition that the feature related to the text detection in the text information is the splicing feature, predicting the splicing feature to obtain the first prediction result.

In one embodiment, the second evaluation module is configured to: performing normalization processing for feature evaluation on the first prediction result based on the attention mechanism to obtain normalized features; obtaining the characteristics of the head structure to be input according to the normalized characteristics and the first prediction result; and taking the characteristics of the head structure to be input as the evaluation result for representing the importance degree of the characteristics.

In one embodiment, the training module is configured to: setting the head structure in the text detection model; and performing model training based on the head structure according to the characteristics of the head structure to be input and the pre-training target to obtain the trained text detection model.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 shows a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 can be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the electronic device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 performs the respective methods and processes described above, such as a document detection method or a model training method. For example, in some embodiments, the document detection method or the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of a computer program may be loaded and/or installed onto the electronic device 1000 via the ROM 1002 and/or the communication unit 1009. When the computer program is loaded into RAM 1003 and executed by the computing unit 1001, one or more steps of the file detection method or the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the file detection method or the model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions of the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A text detection method, comprising:

performing convolution processing on a first image carrying text information according to at least one first convolution block used for representing a light-weight large convolution kernel to obtain characteristics related to text detection in the text information;

2. The method of claim 1, wherein the convolving the first image carrying the text information according to at least one first convolution block used to characterize a lightweight large convolution kernel to obtain features of the text information related to text detection comprises:

and performing layer-by-layer convolution processing on the first image carrying the text information according to the multilayer convolution kernels in the at least one first convolution block to obtain the characteristics related to text detection in the text information.

3. The method of claim 2, further comprising:

setting the at least one first convolution block in a characteristic pyramid structure of a text detection model;

configuring the multi-layer convolution kernel in the at least one first convolution block according to a convolution kernel size;

wherein the multi-layer convolution kernel includes: the system comprises a first convolution kernel, a second convolution kernel accepting the input of the first convolution kernel, a third convolution kernel accepting the input of the second convolution kernel after being processed by an activation function, and a fourth convolution kernel executing addition processing together with the third convolution kernel.

4. The method of claim 3, wherein performing layer-by-layer convolution processing on the first image carrying the text information according to multiple layers of convolution kernels in the at least one first convolution block to obtain a feature related to text detection in the text information comprises:

inputting the first image carrying the text information into the first convolution kernel, inputting the first image carrying the text information into the second convolution kernel after the first convolution kernel performs convolution processing, and inputting the first image carrying the text information into the third convolution kernel after the second convolution kernel performs convolution processing and activation function processing;

and inputting the first image carrying the text information into the fourth convolution kernel for convolution processing, and then adding the first image carrying the text information and the third convolution kernel for addition processing to obtain the characteristics related to text detection in the text information.

5. The method of claim 3 or 4, further comprising:

based on the characteristic pyramid structure of the text detection model, performing characteristic extraction on the first image carrying text information to obtain a plurality of characteristic layers;

splicing the plurality of characteristic layers to obtain splicing characteristics;

wherein at least one of the plurality of feature layers is obtained by convolution processing performed by the at least one first convolution block.

6. The method of claim 5, wherein the evaluating the feature related to text detection in the text information based on the attention mechanism to obtain an evaluation result for characterizing the importance degree of the feature comprises:

under the condition that the feature related to text detection in the text information is the splicing feature, performing normalization processing for feature evaluation on the splicing feature based on the attention mechanism to obtain a normalized feature;

obtaining target characteristics according to the normalization characteristics and the splicing characteristics;

and taking the target characteristic as the evaluation result of the importance degree of the characteristic.

7. The method of claim 6, wherein the performing text detection according to the evaluation result to obtain a text detection result comprises:

and under the condition that a head network structure is arranged in the text detection model, performing text detection according to the target characteristics and the head network structure to obtain the text detection result.

8. A model training method, comprising:

and performing model training on the text detection model to be trained according to the evaluation result and a pre-training target to obtain a trained text detection model.

9. The method of claim 8, further comprising:

setting the at least one first convolution block in a characteristic pyramid structure of the text detection model to be trained;

arranging the auxiliary detection head at the output part of the characteristic pyramid structure;

applying the attention mechanism at an output portion of the feature pyramid structure.

10. The method of claim 8 or 9, wherein the at least one first volume block comprises: the system comprises a first convolution kernel, a second convolution kernel accepting the input of the first convolution kernel, a third convolution kernel accepting the input of the second convolution kernel after being processed by an activation function, and a fourth convolution kernel executing addition processing together with the third convolution kernel.

11. The method of claim 9, wherein the feature pyramid structure comprises a plurality of feature layers, wherein at least one feature layer of the plurality of feature layers is obtained by convolution processing performed by the at least one first convolution block.

12. The method of claim 10, wherein the convolving the image sample carrying the text information in the text detection model to be trained according to the at least one first convolution block to obtain features of the text information related to text detection comprises:

inputting the image sample carrying the text information into the first convolution kernel, performing convolution processing on the image sample carrying the text information by the first convolution kernel, inputting the image sample into the second convolution kernel, and performing convolution processing and activation function processing on the image sample by the second convolution kernel, and inputting the image sample into the third convolution kernel;

and inputting the image sample carrying the text information into the fourth convolution kernel for convolution processing, and then adding the image sample carrying the text information and the third convolution kernel for addition processing to obtain the characteristics related to text detection in the text information.

13. The method of claim 11, further comprising:

based on the characteristic pyramid structure, performing characteristic extraction on the first image carrying the text information to obtain a plurality of characteristic layers;

14. The method of claim 13, wherein the predicting, in the text detection model to be trained, features related to text detection in the text information by using an auxiliary detection head to obtain a first prediction result comprises:

and under the condition that the feature related to the text detection in the text information is the splicing feature, predicting the splicing feature to obtain the first prediction result.

15. The method of claim 13, wherein the evaluating the first prediction result based on the attention mechanism to obtain an evaluation result for characterizing feature importance comprises:

performing normalization processing for feature evaluation on the first prediction result based on the attention mechanism to obtain normalized features;

obtaining the characteristics of the head network structure to be input according to the normalization characteristics and the first prediction result;

and taking the characteristics of the head network structure to be input as the evaluation result for representing the importance degree of the characteristics.

16. A text detection apparatus comprising:

17. The apparatus of claim 16, wherein the first processing module is configured to:

18. The apparatus of claim 17, further comprising: a configuration module to:

wherein the multi-layered convolution kernel includes: the system comprises a first convolution kernel, a second convolution kernel accepting the input of the first convolution kernel, a third convolution kernel accepting the input of the second convolution kernel after being processed by an activation function, and a fourth convolution kernel executing addition processing together with the third convolution kernel.

19. The apparatus of claim 18, wherein the first processing module is configured to:

20. The apparatus of claim 18 or 19, further comprising: an extraction module to:

21. The apparatus of claim 20, wherein the first evaluation module is to:

22. The apparatus of claim 21, wherein the detection module is to:

and under the condition that a head network structure is set in the text detection model, performing text detection according to the target characteristics and the head network structure to obtain a text detection result.

23. A model training apparatus comprising:

24. The apparatus of claim 23, further comprising: a setup module to:

25. The apparatus of claim 23 or 24, wherein the at least one first volume block comprises: the system comprises a first convolution kernel, a second convolution kernel accepting the input of the first convolution kernel, a third convolution kernel accepting the input of the second convolution kernel after being processed by an activation function, and a fourth convolution kernel executing addition processing together with the third convolution kernel.

26. The apparatus of claim 24, wherein the feature pyramid structure comprises a plurality of feature layers, wherein at least one feature layer of the plurality of feature layers is obtained by convolution processing performed by the at least one first convolution block.

27. The apparatus of claim 25, wherein the second processing module is configured to:

28. The apparatus of claim 26, further comprising: a third processing module to:

29. The apparatus of claim 28, wherein the prediction module is to:

30. The apparatus of claim 28, wherein the second evaluation module is to:

31. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

32. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-15.

33. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-15.