CN112818975A

CN112818975A - Text detection model training method and device and text detection method and device

Info

Publication number: CN112818975A
Application number: CN202110109985.1A
Authority: CN
Inventors: 张鹏远; 李长亮
Original assignee: Beijing Kingsoft Software Co Ltd
Current assignee: Beijing Kingsoft Software Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-18
Anticipated expiration: 2041-01-27
Also published as: CN112818975B

Abstract

The application provides a text detection model training method and device and a text detection method and device, wherein the text detection model training method comprises the following steps: inputting a target training image into the text detection model, wherein the target training image is marked with a corresponding marking frame; extracting a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature maps with different scales through the feature pooling layer to obtain a plurality of enhanced feature maps with different scales; fusing the enhanced feature maps with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stop condition is reached.

Description

Text detection model training method and device and text detection method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text detection model training method and apparatus, a text detection method and apparatus, a computing device, and a computer-readable storage medium.

Background

With the rapid development of computer technology, the image processing field has also been rapidly developed, wherein text detection is also a very important branch in the image processing field.

Most of the existing text detection is based on the training data of a manually marked text picture as a model, the training data needs to be marked by a large amount of manpower and material resources, or the marked data is purchased at a high cost, the cost is very high, in the existing text detection model, the relation between image channels is mostly not considered, when detecting text regions of complex backgrounds (such as complex colors, complex textures and the like), the missing phenomenon often occurs, the finally determined text detection position is often inaccurate, and the condition of misjudgment also occurs.

Therefore, how to solve the above problems becomes an urgent problem to be solved by the skilled person.

Disclosure of Invention

In view of this, embodiments of the present application provide a text detection model training method and apparatus, a text detection method and apparatus, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.

According to a first aspect of the embodiments of the present application, there is provided a text detection model training method, including:

inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer;

extracting a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer;

pooling the plurality of initial feature maps with different scales through the feature pooling layer to obtain a plurality of enhanced feature maps with different scales;

fusing the enhanced feature maps with different scales through the feature fusion layer to obtain a plurality of prediction frames;

and determining a target prediction frame in the prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stop condition is reached.

According to a second aspect of the embodiments of the present application, there is provided a text detection method, including:

acquiring an image to be detected, wherein the image to be detected comprises a text to be detected;

inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training through the text detection model training method;

and the text detection model responds to the image to be detected as input to generate a predicted text box corresponding to the text to be detected.

According to a third aspect of the embodiments of the present application, there is provided a text detection model training apparatus, including:

the system comprises an acquisition module, a text detection module and a processing module, wherein the acquisition module is configured to input a target training image into a text detection model, the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer;

the extraction module is configured to extract a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer;

a pooling module configured to pool the plurality of initial feature maps of different scales through the feature pooling layer to obtain a plurality of enhanced feature maps of different scales;

a fusion module configured to fuse the plurality of enhanced feature maps of different scales through the feature fusion layer to obtain a plurality of prediction boxes;

the training module is configured to determine a target prediction box in the prediction boxes, determine a loss value based on the target prediction box and a label box corresponding to the target training image, and train the text detection model according to the loss value until a training stop condition is reached.

According to a fourth aspect of embodiments of the present application, there is provided a text detection apparatus, including:

the device comprises an acquisition module, a detection module and a display module, wherein the acquisition module is configured to acquire an image to be detected, and the image to be detected comprises a text to be detected;

the input module is configured to input the image to be detected to a pre-trained text detection model, wherein the text detection model is obtained by training through the text detection model training method;

and the generation module is configured to respond to the image to be detected as input by the text detection model and generate a predicted text box corresponding to the text to be detected.

According to a fifth aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the text detection model training method or text detection method when executing the instructions.

According to a sixth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the text detection model training method or the steps of the text detection method.

According to a seventh aspect of the embodiments of the present application, there is provided a chip storing computer instructions, which when executed by the chip, implement the steps of the text detection model training method or the text detection method.

The text detection model training method provided by the embodiment of the application comprises the following steps: inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer; extracting a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature maps with different scales through the feature pooling layer to obtain a plurality of enhanced feature maps with different scales; fusing the enhanced feature maps with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stop condition is reached. The text detection model provided by the method can effectively enhance the relation between the features through the feature extraction layer, effectively enhance the precision of the text in a complex background area, and simultaneously increase the network structure of the feature pooling layer, thereby effectively increasing the receptive field of a target area, reducing the missing detection phenomenon of small target objects, enhancing the recognition accuracy of the text detection model on the whole and improving the recognition efficiency.

Secondly, a novel data amplification form is adopted, the problem of inaccurate identification caused by insufficient artificial marking data and target shielding is solved, and meanwhile the generalization of a text detection model is enhanced.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flowchart of a text detection model training method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a text detection model training method according to another embodiment of the present application;

fig. 4 is a schematic flowchart of a text detection method provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a text detection model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a text detection apparatus according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if," as used herein, may be interpreted as "responsive to a determination," depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Text detection: given a text image, the location of the text is automatically located.

K-means clustering: the K-means clustering algorithm is a clustering analysis algorithm for iterative solution, and comprises the steps of dividing data into K groups in advance, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. For each object assigned, the cluster center of the cluster is recalculated based on the objects existing in the cluster.

Yolov 3: the method is based on a Darknet-53 target detection network structure, and the Darknet is a feature extraction network based on a residual error structure.

FPN: a Feature pyramid network (Feature pyramid Networks) is a multi-scale target detection method.

An attention mechanism is as follows: attention, a mechanism for resource allocation, can be understood as reallocating resources according to the importance of an attribute object for the otherwise evenly allocated resources.

ASPP: the void space convolutional pooling pyramid (empty spatial convolutional pyramid), a method of convolutional parallel sampling with voids of different sampling rates for a given input.

logistic layer: and the network structure is used for classifying the detection frames.

IOU: an index for calculating and evaluating the degree of coincidence between detection frames.

In the present application, a text detection model training method and apparatus, a text detection method and apparatus, a computing device, and a computer-readable storage medium are provided, and detailed descriptions are made one by one in the following embodiments.

FIG. 1 shows a block diagram of a computing device 100 according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

The processor 120 may perform the steps of the text detection model training method shown in fig. 2. Fig. 2 shows a flowchart of a text detection model training method according to an embodiment of the present application, including steps 202 to 210.

Step 202: inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer.

The target training image is a text image used for training a text detection model, a corresponding marking box is marked in the text image, and the marking box is used for marking a text area needing to be recognized in the text image.

The text detection module at least comprises a feature extraction layer, a feature pooling layer and a feature fusion layer, wherein the feature extraction layer is preferably a feature extraction layer with a fusion attention mechanism.

In practical applications, before inputting the target training image into the text detection model, the method further includes:

and acquiring a target training image in a preset training set.

The preset training set is a training set comprising a plurality of text images, and a large number of target training images are stored in the preset training set.

In practical application, the target training images in the preset training set are target detection images based on a large number of manual labels, the labeling images can consume a large amount of labor and time through a manual labeling method, time and labor are wasted, and therefore training data in the training set can be increased through a data capacity expansion mode for the training set, and specifically, the target training images in the preset training set are obtained, and the method comprises the following steps:

acquiring an initial training set, wherein the initial training set comprises a plurality of training images;

and performing data amplification processing on the plurality of training images to generate a training set after data amplification.

Performing data amplification processing on the plurality of training images, wherein the data amplification processing comprises the following steps: and carrying out amplification processing on the data in a mode of any one or combination of several of random clipping, random translation, contrast change, brightness change, transparency change, random shielding and random filling on the plurality of training images.

After an initial training set is obtained, data amplification needs to be carried out on sample images in the training set, the data amplification method comprises a Cutout algorithm (random occlusion) and an FMix algorithm (random filling) besides the processing of random cutting, stretching, contrast, brightness and transparency, the Cutout algorithm is that a square area with a fixed size is randomly selected, and then all 0 is adopted for filling; the FMix algorithm binarizes an image according to high-frequency and low-frequency regions of the image, then uses a mask to weight pixels, and introduces the two data amplification algorithms to solve the problem of insufficient target shielding. Data amplification can increase samples of a training set, can effectively relieve the overfitting condition of the model, and can bring stronger generalization capability to the model.

In target detection, there is usually a concept of a prior frame, where the prior frame is preset with the width and height of a common target, and when performing prediction, the width and height that have been preset can be used to help perform target detection, the size of the prior frame is usually obtained by K-means clustering, for example, in Yolov3, 9 prior frames are usually obtained by K-means clustering algorithm clustering, 3 prior frames are respectively set for the large, medium, and small scales, the size of the prior frame of each size is generated according to training data clustering in practical application, and is not limited in this application, for example, the size of the prior frame may be 116 × 90, 156 × 198, 373 × 326, 30 × 61, 62 × 45, 59 × 119, 10 × 13, 16 × 30, 33 × 23; or 5 × 24, 5 × 36, 6 × 25, 9 × 65, 9 × 48, 9 × 70, 14 × 155, 15 × 178, 16 × 180, etc., wherein the application of larger prior frames (e.g., 14 × 155, 15 × 178, 16 × 180) on smaller scale signatures is suitable for detecting larger objects; applying medium a priori boxes (e.g. 9 x 65, 9 x 48, 9 x 70) on the medium scale feature map, suitable for detecting medium sized objects; on larger scale feature maps, smaller a priori boxes (e.g., 5 x 24, 5 x 36, 6 x 25) are applied, appropriate for detecting smaller objects.

The training images may include images of various scenes, such as live scenes, game scenes, outdoor scenes, and the like, and the training images include text information of various types of characters, shapes, languages, and the like, and at least one of the images or the texts may be recognized from the training images. The method includes the steps that a training image comprises a manually marked marking frame, the position of the marking frame is the position needing to be identified, the content in the marking frame is the content needing to be identified, the marking frame is usually a rectangular marking frame and can also be other polygonal marking frames, the method is not limited in the application, in practical application, a preset training set is divided into two parts, namely a training subset and a testing subset, in the model training process, a target training image is obtained from the training subset, and after model training is completed, a target detection image is obtained from the testing subset and used for detecting the performance of a model.

The text detection model trained by the application is used for detecting the position of the text area in the text image, so that the position of the text in the image can be quickly and accurately positioned, the time can be saved when the subsequent text recognition is carried out, and the efficiency is improved. The text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer which are fused with an attention mechanism.

In a specific embodiment provided by the application, the target training image is a photograph of the resume a, labeling boxes are marked at the name, age and telephone of the resume a, and the resume a of the target training image is input into a text detection model for training, wherein the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer which are fused with an attention mechanism.

Step 204: and extracting a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer.

The feature extraction layer preferably fuses an attention mechanism, the feature extraction layer of the fused attention mechanism includes a plurality of channels, the attention mechanism is fused among the plurality of channels, and accordingly, a plurality of initial feature maps of different scales corresponding to the target training image are extracted through the feature extraction layer, and the method includes:

extracting a plurality of initial feature maps of different scales corresponding to the target training image through the plurality of channels and the attention mechanism fused among the plurality of channels.

The feature extraction layer of the fusion attention mechanism preferably uses a modified Darknet-53 structure in Yolov3, namely, on the basis of the Darknet-53 structure in Yolov3, the attention mechanism between channels is increased. The Darknet-53 in the Yolov3 is a full convolution network and is used for extracting a plurality of initial feature maps with different scales corresponding to a target training image, specifically, feature extraction is performed on the target training image through different feature channels, and feature notability is screened and weighted on channel dimensions through an attention mechanism, so that detection performance is improved, contact among channel features is enhanced, and a good effect is achieved for detecting a text region with complex features.

The feature extraction layer is used for extracting initial feature maps of the target training image on different scales, outputting 3 feature images X1, X2 and X3 of different scales, wherein the depths of X1, X2 and X3 are 255, the side length rule is 13:26:52, and 3 prediction frames are output in each feature image and the total number of the prediction frames is 9.

In a specific embodiment provided by the present application, the above example is used, and the photograph of the resume a is input to a Darknet-53-attention (feature extraction layer of the fusion attention mechanism) for feature extraction, so as to obtain a feature image X1 of a scale for detecting a large target, a feature image X2 of a scale for detecting a medium target, and a feature image X3 of a scale for detecting a small target.

Step 206: and pooling the plurality of initial feature maps with different scales through the feature pooling layer to obtain a plurality of enhanced feature maps with different scales.

In practical applications, in order to ensure that the characteristics of the picture have a larger receptive field and that the resolution of the characteristic map is not reduced too much (the resolution is reduced too much and detail information of the image boundary is lost), the above problem may be solved by a hole convolution method, and preferably, in the text detection model provided by the present application, the characteristic pooling layer includes a hole space convolution pooling pyramid;

correspondingly, pooling the plurality of initial feature maps of different scales by the feature pooling layer includes:

pooling the plurality of initial feature maps of different scales by the void space convolution pooling pyramid.

The empty space convolution pooling pyramid (ASPP) layer functions to increase the receptive field similarly without the use of pooling and downsampling. Each output of the convolution has information in a large range, the sensing visual field of a target area is increased, the cavity convolution with different sampling rates can effectively capture more scale information, and the omission phenomenon of small target objects is reduced.

In a specific embodiment provided by the present application, the feature images X1, X2, and X3 at different scales are input into the feature pooling layer for processing, and a plurality of feature-enhanced feature maps Y1, Y2, and Y3 at different scales are obtained.

Step 208: and fusing the plurality of enhanced feature graphs with different scales through the feature fusion layer to obtain a plurality of prediction frames.

Preferably, the feature fusion layer comprises a feature map pyramid network;

fusing the enhanced feature maps of different scales through the feature fusion layer to obtain a plurality of prediction frames, including:

and fusing the plurality of enhanced feature maps with different scales through the feature map pyramid network to obtain a plurality of prediction frames and a score corresponding to each prediction frame.

The method comprises the steps that a characteristic graph pyramid network (FPN) is adopted, the FPN solves the multi-scale problem in object detection, the network connection can be changed, the performance of small object detection is greatly improved on the premise that the original model calculation amount is not increased basically, in the characteristic graphs with different scales, the characteristic semantic information of the bottom layer is less, the target position is accurate, the characteristic semantic information of the high layer is rich, the target position is rough, the FPN prediction is independently carried out on the basis of the characteristic graphs with different scales, the top sampling is carried out through the characteristics of the high layer, the connection from top to bottom is carried out through the characteristics of the bottom layer, the corresponding prediction can be carried out on each layer, a plurality of different prediction results are output, a plurality of prediction boxes are finally generated, and meanwhile, the score corresponding to each prediction box is generated.

In a specific embodiment provided by the application, along with the above example, by inputting a plurality of enhanced feature maps Y1, Y2, and Y3 with different scales into the feature fusion layer for processing, prediction boxes with a plurality of scales can be generated, and each prediction box has a corresponding score.

Step 210: and determining a target prediction frame in the prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stop condition is reached.

In practical applications, after obtaining the prediction boxes in the above steps, a score corresponding to each prediction box may also be obtained, and accordingly, determining a target prediction box in the plurality of prediction boxes includes: and determining the prediction box with the highest score as a target prediction box.

In practical application, determining a loss value based on the target prediction frame and the labeling frame corresponding to the target training image includes:

and determining a loss value based on the position information of the target prediction frame and the position information of the labeling frame corresponding to the target training image.

After a predicted target prediction frame is obtained, the position information of the target prediction frame can be determined according to the coordinate of a certain vertex of the target prediction frame and the length and width of the target prediction frame, meanwhile, the position information of a labeling frame can be determined according to the coordinate of a certain vertex of the labeling frame corresponding to a target training object and the length and width of the labeling frame, a loss value can be determined based on the position information of the target prediction frame and the position information of the labeling frame corresponding to the target training image, and methods for determining the loss value are various, such as a cross entropy loss function, a maximum loss function, an average loss function and the like.

Optionally, training the text detection model according to the loss value includes:

and adjusting model parameters in a feature extraction layer, a feature pooling layer and a feature fusion layer in the text detection model according to the loss value.

Training the text detection model according to the loss value specifically comprises the step of adjusting model parameters in a feature extraction layer, a feature pooling layer and a feature fusion layer in the text detection model according to the loss value.

Selecting a prediction frame with the highest score as a target prediction frame according to scores corresponding to prediction frames with different scales, taking the target prediction frame as a prediction detection position of each region, determining a loss value based on a label frame in a target training image, and adjusting parameters of the text detection model according to back propagation of the loss value until a training stop condition is reached, wherein the training stop condition can be a preset training turn, the loss value can be lower than a preset threshold value, or a test can be performed according to a target detection image in a test subset, an obtained coincidence region of the position of the target prediction frame and the label frame is larger than the preset threshold value, and the training stop condition is not specifically limited in the application.

In a specific embodiment provided by the application, a target prediction frame 1 corresponding to a name in a resume, a target prediction frame 2 corresponding to an age and a target prediction frame 3 corresponding to a telephone are respectively determined, loss values are calculated according to the target prediction frame 1, the target prediction frame 2, the target prediction frame 3 and a labeled frame labeled in the resume, parameters of a text detection model are adjusted according to back propagation of the loss values, after a preset turn is trained, the text detection model is detected according to a target detection image in a test subset, and when the coincidence degree of the prediction frame output by the text detection model and the labeled frame in the target test image reaches more than 95%, namely the IOU value is more than 0.95, the text detection model is successfully trained.

The text detection model training method provided by the embodiment of the application comprises the steps of inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer; extracting a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature maps with different scales through the feature pooling layer to obtain a plurality of enhanced feature maps with different scales; fusing the enhanced feature maps with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stop condition is reached. The text detection model provided by the method can effectively enhance the relation between the features through the feature extraction layer, effectively enhance the precision of the text in a complex background area, and simultaneously increase the network structure of the feature pooling layer, thereby effectively increasing the receptive field of a target area, reducing the missing detection phenomenon of small target objects, enhancing the recognition accuracy of the text detection model on the whole and improving the recognition efficiency.

Fig. 3 is a schematic diagram illustrating a text detection model training method according to an embodiment of the present application, and as shown in fig. 3, the method includes steps 302 to 312.

Step 302: an initial training set is obtained.

Step 304: and performing data amplification processing on the plurality of training images to generate a training set after data amplification.

Step 306: and determining a prior frame of the training image through K-means clustering, and inputting the training image to the text detection model.

The text detection model comprises a feature extraction layer fused with an attention mechanism, a hollow space convolution pooling pyramid and a feature map pyramid network, and a training image is input to the feature extraction layer fused with the attention mechanism for feature extraction to obtain a plurality of initial feature maps with different scales; inputting a plurality of initial feature maps with different scales into a spatial cavity convolution pooling pyramid network for feature enhancement to obtain a plurality of enhanced feature maps with different scales; and inputting the plurality of initial feature maps with different scales and the plurality of enhanced feature maps with different scales into a feature map pyramid network for feature fusion, and outputting a plurality of prediction frames.

Step 308: obtaining a plurality of prediction boxes output by a text detection model, and determining a target prediction box in the prediction boxes.

Step 310: and determining a loss value based on the target prediction frame and the labeling frame corresponding to the target training image.

Step 312: and training the text detection model according to the loss value until a training stopping condition is reached.

According to the text detection model training method provided by the embodiment of the application, the relation among the features can be effectively enhanced through the feature extraction layer of the text strategy model, the precision of the text in a complex background area is effectively enhanced, meanwhile, the network structure of the feature pooling layer is added, the receptive field of a target area can be effectively increased, the phenomenon that small target objects are missed to be detected is reduced, the recognition accuracy of the text detection model is integrally enhanced, and the recognition efficiency is improved.

Fig. 4 shows a flowchart of a text detection method according to an embodiment of the present application, where the text detection method is described by taking text detection on a resume as an example, and includes steps 402 to 406.

Step 402: and acquiring an image to be detected, wherein the image to be detected comprises a text to be detected.

In a specific embodiment provided by the application, the obtained resume picture is an image to be detected, and the contents of name, gender, birth year and month, native place, contact way, work experience and the like in the resume are texts to be detected.

Step 404: and inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training through the text detection model training method.

In a specific embodiment provided by the present application, the resume picture is input to a pre-trained text detection model.

Step 406: and the text detection model responds to the image to be detected as input to generate a predicted text box corresponding to the text to be detected.

In a specific embodiment provided by the present application, the text detection model generates a predicted text box on the resume picture in response to the resume picture as an input, where the predicted text box corresponds to the content of the name, the gender, the year and month of birth, the native place, the contact way, the work experience, and the like in the resume picture.

Optionally, the method further comprises:

performing text recognition on the content in the predicted text box based on the predicted text box;

and acquiring text content information corresponding to the text to be detected.

In a specific embodiment provided by the application, text recognition is performed on the content in the resume picture based on the corresponding predicted text boxes for the content such as name, gender, birth year and month, native place, contact way, work experience and the like in the resume picture, so as to obtain the text content in the corresponding predicted text boxes in the resume picture, such as name: zhang III, sex: male, year and month of birth: year, month, by: somewhere, etc. And filling the obtained character content into a preset structured table to realize the process of converting the resume pictures into the text resumes.

The text detection method comprises the steps of obtaining an image to be detected, wherein the image to be detected comprises a text to be detected; inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training through the text detection model training method; the text detection model responds to the image to be detected as input to generate the predicted text box corresponding to the text to be detected.

Corresponding to the above embodiment of the text detection model training method, the present application further provides an embodiment of a text detection model training apparatus, and fig. 5 shows a schematic structural diagram of the text detection model training apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:

an obtaining module 502, configured to input a target training image into a text detection model, where the target training image is labeled with a corresponding labeling box, and the text detection model includes a feature extraction layer, a feature pooling layer, and a feature fusion layer;

an extraction module 504 configured to extract a plurality of initial feature maps of different scales corresponding to the target training image through the feature extraction layer;

a pooling module 506 configured to pool the plurality of different scales of initial feature maps through the feature pooling layer to obtain a plurality of different scales of enhanced feature maps;

a fusion module 508 configured to fuse the plurality of enhanced feature maps of different scales through the feature fusion layer to obtain a plurality of prediction boxes;

a training module 510 configured to determine a target prediction box among the prediction boxes, determine a loss value based on the target prediction box and a label box corresponding to the target training image, and train the text detection model according to the loss value until a training stop condition is reached.

Optionally, the obtaining module 502 is further configured to obtain a target training image in a preset training set.

Optionally, the obtaining module 502 is further configured to:

and carrying out any data amplification processing of random clipping, random translation, contrast change, brightness change, transparency change, random shielding and random filling on the plurality of training images.

Optionally, the feature extraction layer of the fused attention mechanism comprises a plurality of channels, and the fused attention mechanism is formed among the plurality of channels;

the extraction module 504, further configured to:

Optionally, the feature pooling layer comprises a void space convolution pooling pyramid;

the pooling module 506, further configured to:

Optionally, the feature fusion layer comprises a feature map pyramid network;

the fusion module 508, further configured to:

Optionally, the training module 510 is further configured to:

and determining the prediction box with the highest score as a target prediction box.

Optionally, the training module 510 is further configured to:

The text detection model training device provided by the embodiment of the application comprises a step of inputting a target training image into a text detection model, wherein the target training image is marked with a corresponding marking frame, and the text detection model comprises a feature extraction layer, a feature pooling layer and a feature fusion layer; extracting a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer; pooling the plurality of initial feature maps with different scales through the feature pooling layer to obtain a plurality of enhanced feature maps with different scales; fusing the enhanced feature maps with different scales through the feature fusion layer to obtain a plurality of prediction frames; and determining a target prediction frame in the prediction frames, determining a loss value based on the target prediction frame and a labeling frame corresponding to the target training image, and training the text detection model according to the loss value until a training stop condition is reached. The text detection model provided by the device can effectively enhance the relation between the features through the feature extraction layer, effectively enhance the precision of the text under the complex background region, and simultaneously increase the network structure of the feature pooling layer, thereby effectively increasing the receptive field of the target region, reducing the phenomenon of missed detection of small target objects, enhancing the recognition accuracy of the text detection model on the whole and improving the recognition efficiency. Secondly, a novel data amplification form is adopted, the problem of inaccurate identification caused by insufficient artificial marking data and target shielding is solved, and meanwhile the generalization of a text detection model is enhanced.

The above is a schematic scheme of the text detection model training apparatus of this embodiment. It should be noted that the technical solution of the text detection model training apparatus and the technical solution of the text detection model training method belong to the same concept, and details of the technical solution of the text detection model training apparatus, which are not described in detail, can be referred to the description of the technical solution of the text detection model training method.

Corresponding to the above text detection method embodiment, the present application further provides a text detection apparatus embodiment, and fig. 6 shows a schematic structural diagram of the text detection apparatus according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:

an obtaining module 602 configured to obtain an image to be detected, where the image to be detected includes a text to be detected;

an input module 604, configured to input the image to be detected to a pre-trained text detection model, where the text detection model is obtained by training the text detection model by a training method;

a generating module 606 configured to generate a predicted text box corresponding to the text to be detected by the text detection model in response to the image to be detected as an input.

Optionally, the apparatus further comprises:

an identification module configured to perform text identification on content in the predictive text box based on the predictive text box; and acquiring text content information corresponding to the text to be detected.

The text detection device comprises the steps of obtaining an image to be detected, wherein the image to be detected comprises a text to be detected; inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training through the text detection model training method; the text detection model responds to the image to be detected as input to generate the predicted text box corresponding to the text to be detected.

The above is a schematic scheme of a text detection apparatus of the present embodiment. It should be noted that the technical solution of the text detection apparatus and the technical solution of the text detection method belong to the same concept, and details that are not described in detail in the technical solution of the text detection apparatus can be referred to the description of the technical solution of the text detection method.

It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

An embodiment of the present application further provides a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, where the processor implements the text detection model training method or the text detection method when executing the instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text detection model training method or the text detection method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the text detection model training method or the text detection method.

An embodiment of the present application further provides a computer readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, the method for training a text detection model or the steps of the method for detecting a text are implemented as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text detection model training method or the text detection method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text detection model training method or the text detection method.

The embodiment of the application discloses a chip, which stores computer instructions, and the instructions are executed by a processor to realize the steps of the text detection model training method or the text detection method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A text detection model training method is characterized by comprising the following steps:

2. The method of training a text detection model according to claim 1, further comprising, before inputting the target training image to the text detection model:

and acquiring a target training image in a preset training set.

3. The method for training the text detection model according to claim 2, wherein the obtaining of the target training images in the preset training set comprises:

4. The method of training a text detection model according to claim 3, wherein the performing data augmentation processing on the plurality of training images comprises:

5. The training method for the text detection model according to claim 1, wherein the feature extraction layer of the fusion attention mechanism comprises a plurality of channels, and the fusion attention mechanism is arranged among the plurality of channels;

extracting a plurality of initial feature maps with different scales corresponding to the target training image through the feature extraction layer, wherein the method comprises the following steps:

6. The text detection model training method of claim 1, wherein the feature pooling layer comprises a hole space convolution pooling pyramid;

pooling the plurality of different scales of initial feature maps by the feature pooling layer, comprising:

7. The text detection model training method of claim 1, wherein the feature fusion layer comprises a feature map pyramid network;

8. The method of text detection model training of claim 7, wherein determining a target prediction box among the plurality of prediction boxes comprises:

9. The method for training the text detection model according to claim 1, wherein determining the loss value based on the target prediction box and the label box corresponding to the target training image comprises:

10. The method of claim 1, wherein training the text detection model based on the loss values comprises:

11. A text detection method, comprising:

inputting the image to be detected into a pre-trained text detection model, wherein the text detection model is obtained by training through the training method of any one of the claims 1 to 10;

12. The document detection method of claim 11, wherein the method further comprises:

13. A text detection model training device, comprising:

14. A text detection apparatus, comprising:

an input module, configured to input the image to be detected to a pre-trained text detection model, wherein the text detection model is obtained by training according to the training method of any one of claims 1 to 10;

15. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any of claims 1-10 or 11-12 when executing the instructions.

16. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1-10 or 11-12.