CN111091123A

CN111091123A - Text region detection method and equipment

Info

Publication number: CN111091123A
Application number: CN201911215779.8A
Authority: CN
Inventors: 周康明; 吴昊
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-05-01

Abstract

The application provides a text region detection method and equipment, which can detect the corner points of a text region of an image containing a text needing character recognition to obtain corresponding corner points, and further determine a prediction frame of the text region according to the obtained corner points, so that the text region with a plurality of display directions can be recognized, the precision of text region detection is improved, and meanwhile, the text region is more accurately positioned and closer to text characters.

Description

Text region detection method and equipment

Technical Field

The present application relates to the field of image recognition, and in particular, to a text region detection method and apparatus.

Background

Currently, optical character recognition technology is used for allowing a computer to automatically recognize text characters contained in an image, and as a basis for optical character recognition, a text positioning technology is firstly required to perform positioning, that is, an area where the text characters are located is positioned in the image.

Currently, a positioning classification method similar to target detection is mostly adopted for positioning a text region, and the text region is used as a detection target in an image for detection. The method has the problems that most text regions to be detected are rectangles with long sides, the aspect ratio distribution is extreme, the difference from a target object in common target detection is large, the aspect ratio of the target object is about 1, and therefore the text regions cannot be detected well. In addition, in general, a target object in target detection has a sharp closed edge contour, and a text region does not have a sharp closed edge, so that it is difficult to distinguish text characters from an image background when extracting image features of the text region by using the existing target detection scheme.

Disclosure of Invention

An object of the present application is to provide a text region detection method and device, which are used to solve the problems of low recognition accuracy and poor positioning effect of the text region detection in the prior art.

In order to achieve the above object, the present application provides a text region detection method, wherein the method includes:

constructing a text region detection model, wherein the text region detection model determines corners through corner detection and determines a text region prediction frame according to the corners;

and performing text region detection on the text image to be detected according to the text region detection model, and determining a corresponding text region prediction frame, wherein the text regions have different display directions.

Further, constructing a text region detection model, comprising:

extracting the characteristics of the sample text image to obtain a corresponding characteristic image;

performing corner detection on the feature image, and determining corners in the sample text image, wherein the types of the corners comprise an upper left corner, an upper right corner, a lower left corner and a lower right corner of the text region prediction box;

classifying and combining the angular points, and determining a text region prediction box according to the classifying and combining result;

determining the difference between the text region prediction frame and a pre-labeled text region identification frame, and adjusting the parameters of a text region detection model according to the difference;

and when a preset model training stopping condition is met, determining the current parameters of the text region detection model as the final parameters of the text region detection model.

Further, the feature extraction is performed on the sample text image, and a corresponding feature image is obtained, including:

inputting a sample text image into a stacked plurality of hourglass networks, and acquiring a characteristic image output by the plurality of hourglass networks, wherein the hourglass network comprises a convolution layer, a pooling layer, a down-sampling layer and an up-sampling layer.

Further, performing corner detection on the feature image, and determining corners in the sample text image, including:

performing convolution operation on the characteristic image to obtain a feature image after convolution;

performing pooling operation on the feature images after convolution to obtain pooled feature images corresponding to different types of corner points;

generating thermodynamic diagrams corresponding to the different types of corner points according to the pooled feature images corresponding to the different types of corner points;

determining the position of the corner point of the corresponding type in the thermodynamic diagram corresponding to the different types of corner points, wherein the activation response of the corner point of the corresponding type exceeds a preset threshold value, as the position of the corner point of the corresponding type;

determining the direction of the corner points corresponding to the corner points of the corresponding types according to the pooled feature images corresponding to the corner points of the different types;

and calculating the information difference between the corresponding type corner and the pre-marked corresponding type corner according to a preset loss function, optimizing the information difference according to a preset optimization method, and determining the optimized corresponding type corner.

Further, the size of the convolution kernel used in the convolution operation on the feature image is 3 × 5.

Further, performing pooling operation on the convolved feature images to obtain pooled feature images corresponding to different types of corners, including:

acquiring a first pixel point in the convolved characteristic image;

traversing other pixel points which belong to the same channel and the same row as the first pixel point and are positioned in the preset horizontal direction, and determining the maximum value of the pixel point as the maximum value in the horizontal direction, wherein the preset horizontal direction is determined according to the type of the angular point;

traversing other pixel points which belong to the same channel and the same column as the first pixel point and are positioned in a preset vertical direction, and determining the maximum value of the pixel point as the maximum value in the vertical direction, wherein the preset vertical direction is determined according to the type of the angular point;

and taking the sum of the maximum value in the horizontal direction and the maximum value in the vertical direction as the value of a second pixel point on the pooled feature image corresponding to the type corner, wherein the position of the second pixel point on the pooled feature image corresponding to the type corner corresponds to the position of the first pixel point on the feature image after convolution.

Further, determining the corner direction corresponding to the corner of the corresponding type according to the pooled feature images corresponding to the corner of the different types, including:

inputting the pooled feature images corresponding to the different types of corner points into a normalized exponential function for classification prediction, determining the angle partition to which the corresponding type of corner point belongs as the corner point direction corresponding to the corresponding type of corner point, wherein the angle partition is a plurality of partitions obtained by carrying out average partition on the peripheral corners.

Further, a preset loss function L used in the process of performing corner detection on the feature image is defined as follows:

L＝L_conf+αL_off

wherein L is_confFor the angular direction loss function, the following is defined:

L_offfor the corner position loss function, the following is defined:

where N is the number of text regions in the feature image, p_ijThe confidence that a pixel point with coordinates (i, j) in the feature image belongs to a corner point of a corresponding type, H is the column number of the feature image, W is the line number of the feature image, α is a weight parameter of a corner point position loss function, gamma is a weight factor, p is_kIs the k-th corner point, smooth_L1Is a smooth L1 loss function.

Further, the classifying and combining the corners, and determining the text region prediction box according to the classifying and combining result includes:

matching two types of corner points opposite in angle direction to obtain a corner point pair;

determining a plurality of candidate text region prediction boxes according to the corner point pairs;

and if the intersection ratio of the first candidate text region prediction frame and the second candidate text region prediction frame in the candidate text region prediction frames is greater than a preset threshold value, determining the prediction frame after the first candidate text region prediction frame and the second candidate text region prediction frame are combined according to the maximum value and the minimum value of the coordinates of the first candidate text region prediction frame and the second candidate text region prediction frame, and determining the prediction frame as the text region prediction frame.

Based on another aspect of the present application, the present application also provides an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the apparatus to perform the aforementioned text region detection method.

The present application further provides a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the aforementioned text region detection method.

Compared with the prior art, the scheme provided by the application can be used for detecting the corner points of the text region of the image containing the text needing character recognition to obtain the corresponding corner points, and then determining the prediction frame of the text region according to the obtained corner points, so that the text region with a plurality of display directions can be recognized, the precision of text region detection is improved, and meanwhile, the text region is more accurately positioned and closer to text characters.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a flowchart illustrating a text region detection method according to some embodiments of the present disclosure.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal and the network device each include one or more processors (CPUs), input/output interfaces, network interfaces, and memories.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

Fig. 1 illustrates a text region detection method according to some embodiments of the present application, which may specifically include the following steps:

step S101, a text region detection model is constructed, wherein the text region detection model determines corners through corner detection and determines a text region prediction frame according to the corners;

and S102, carrying out text region detection on the text image to be detected according to the text region detection model, and determining a corresponding text region prediction frame, wherein the text regions have different display directions.

The scheme is particularly suitable for scenes in which text characters are expected to be detected, corner detection can be performed on text regions in the image through the constructed text region detection model to obtain corresponding corners, and then corresponding text region prediction frames are determined according to the corners, and can be used as input regions for subsequent optical character recognition.

In step S101, a text region detection model for detecting a text region in an image is first constructed. The constructed text region detection model firstly determines corner points through corner point detection, and then determines a corresponding text region prediction frame according to the obtained corner points, wherein the text region prediction frame is used for determining text regions in the image. The text region detection model builds the structure of the model by a convolutional neural network.

In some embodiments of the present application, constructing the text region detection model may specifically include the following steps:

1) extracting the characteristics of the sample text image to obtain a corresponding characteristic image;

2) carrying out corner detection on the characteristic image, and determining corners in the sample text image; here, the types of the corner points include an upper left corner point, an upper right corner point, a lower left corner point, and a lower right corner point of the text region prediction box;

3) carrying out classification combination on the angular points, and determining a text region prediction box according to a classification combination result;

4) determining the difference between the text region prediction frame and a pre-marked text region identification frame, and adjusting the parameters of a text region detection model according to the difference;

5) and when the preset model training stopping condition is met, determining the current parameters of the text region detection model as the final parameters of the text region detection model.

The sample text image is used for training the text region detection model, and the image may include a plurality of text regions, and each text region may have a certain interval therebetween, for example, an article includes a plurality of paragraphs, and each paragraph may serve as a text region. In addition, the plurality of text regions in the sample text image may also have different display directions, for example, the text in one text region is displayed in the horizontal direction, the text in another text region is displayed in a direction that is at an angle of 45 degrees to the horizontal direction, or the like. In addition, the sample text image has a corresponding pre-labeled text region identification frame, the text region identification frame can be labeled automatically or manually and is used for correcting the text region detection frame output by the text region detection model, the labeling mode is that four vertexes of the text region identification frame are marked, the labeling direction can be clockwise or anticlockwise, and the type of the text region identification frame is labeled as a text.

In some embodiments of the present application, feature extraction is performed on a sample text image to obtain a corresponding feature image, and the specific steps include: and inputting the sample text image into the stacked multiple hourglass networks, and acquiring the characteristic image output by the multiple hourglass networks. Here, the hourglass network may include a convolutional layer, a pooling layer, a downsampling layer, and an upsampling layer. The Stacked hourglass network (Stacked HG) was first in 2016, and was originally used to locate human body key points, thereby completing the estimation of human body posture. Preferably, two stacked hourglass networks can be adopted to perform feature extraction on the sample text image, convolution and pooling are performed on the sample text image through a convolution layer and a pooling layer in a single hourglass network to obtain a feature image after convolution and pooling, multiple downsampling operations are performed on the feature image after convolution and pooling through a downsampling layer to obtain a downsampled feature image, multiple upsampling is performed on the downsampled feature image through an upsampling layer to obtain a feature image output by the hourglass network, and the feature image is input into the next hourglass network to be processed.

Downsampling, also known as reducing an image, is generally performed to generate a thumbnail of the corresponding image. A feature image is M × N in size, and is down-sampled by a factor of S to obtain an image of (M/S) × (N/S) size, where S should be a common divisor of M and N. The feature image is subjected to down-sampling operation, so that the high-level semantic features of the feature image with low resolution can be obtained, and the down-sampled feature image has low resolution, namely small length and width, so that the number of pixels needing to be calculated is small, and the calculation complexity can be reduced. Here, the high-level semantic features refer to texture structures, semantic information, and the like of the image, and the convolutional layer initiated by the neural network learns the low-level shape features, color features, and the like of the image.

Upsampling is also called image amplification, and the main purpose is to amplify an image, and an interpolation method is usually adopted, that is, a suitable interpolation algorithm is adopted to insert new pixels between pixel points on the basis of the original image pixels. The characteristic image is up-sampled through the up-sampling layer, so that the resolution of the characteristic image can be improved, and the capability of predicting the text region position through the hourglass network is improved.

In some embodiments of the present application, performing corner detection on the feature image, and determining corners in the sample text image may specifically include the following steps:

a) performing convolution operation on the characteristic image to obtain a feature image after convolution;

performing convolution operation on the characteristic image through a convolution kernel, and obtaining and outputting a feature image after convolution according to the input feature image; preferably, the size of the convolution kernel used is 3 × 5, and since the text region is usually a rectangle with a long side, performing convolution operation using the convolution kernel with the size of 3 × 5 is more suitable for the shape of the text region, and can improve the accuracy of model prediction.

b) Performing pooling operation on the feature images after convolution to obtain pooled feature images corresponding to different types of corner points;

the method specifically comprises the following steps: firstly, acquiring a first pixel point in a feature image after convolution; traversing other pixel points which belong to the same channel and the same row as the first pixel point and are positioned in the preset horizontal direction, and determining the maximum value of the pixel point as the maximum value in the horizontal direction, wherein the preset horizontal direction is determined according to the type of the angular point; traversing other pixel points which belong to the same channel and the same column as the first pixel point and are positioned in the preset vertical direction, and determining the maximum value of the pixel point as the maximum value in the vertical direction, wherein the preset vertical direction is determined according to the type of the angular point; and taking the sum of the maximum value in the horizontal direction and the maximum value in the vertical direction as the value of a second pixel point on the pooled characteristic image corresponding to the type corner, wherein the position of the second pixel point on the pooled characteristic image corresponding to the type corner corresponds to the position of the first pixel point on the convolved characteristic image. The first pixel point is any one pixel point in the convolved characteristic image, the second pixel point is a pixel point corresponding to the position of the first pixel point on the pooled characteristic image, and the value of the second pixel point is the sum of the maximum value of other pixel points of the first pixel point in the preset horizontal direction and the maximum value of other pixel points in the preset vertical direction. Preferably, if the sum of the maximum values is greater than 255, the sum of the maximum values is set to 255.

Here, the data used in the convolution operation is in three-dimensional form, and can be regarded as a plurality of two-dimensional pictures, each of which is called a feature image and is also called a channel. The input image of the convolution operation is only one feature image if it is a grayscale image, and generally three feature images (red, green, and blue) if it is a color image.

There are four types of corner points: the method comprises the following steps of obtaining a left upper corner point, a right upper corner point, a left lower corner point and a right lower corner point, wherein specific pooling operation corresponding to each corner point has certain difference. Pooling operations of different types of corner points are performed on the feature images to obtain pooled feature images corresponding to four types of corner points respectively, for example, pooled feature images corresponding to corner points of the upper left corner point type, pooled feature images corresponding to corner points of the upper right corner point type, pooled feature images corresponding to corner points of the lower left corner point type, and the like. Here, the corresponding preset directions are different when the different types of corner points are subjected to pooling operation, for example, when the corner points of the upper left corner type are subjected to pooling, the preset horizontal direction is rightward, and the preset vertical direction is downward; when the angular points of the upper right corner type are pooled, the horizontal direction is preset to the left, and the vertical direction is preset to the down; when pooling angular points of the lower left corner type, presetting the horizontal direction as right and the vertical direction as up; when pooling is performed on the corner points of the lower right corner type, the horizontal direction is preset to the left, and the vertical direction is preset to the upward.

c) Generating thermodynamic diagrams corresponding to the different types of corner points according to the pooled feature images corresponding to the different types of corner points;

here, the numerical value of the pixel point in the thermodynamic diagram represents the score of the pixel point belonging to a certain type of corner point, and generally, the higher the numerical value is, the higher the confidence that the pixel point belongs to a certain type of corner point is, for example, the thermodynamic diagram corresponding to the top left corner point type corner point may be generated according to the pooled feature image corresponding to the top left corner point type corner point, and the value of the pixel point in the thermodynamic diagram represents the score of the pixel point belonging to the top left corner point type corner point. Specifically, the pooled feature images may be further subjected to a non-deformation convolution operation with a convolution kernel of 3 × 3, and further connected to an hourglass network to output the feature images, and subjected to a Relu activation function and a further convolution operation with a convolution kernel of 3 × 3 to obtain a corresponding thermodynamic diagram.

d) Determining the position of the corner point of the corresponding type in the thermodynamic diagram corresponding to the corner points of the different types, wherein the activation response of the corner point of the corresponding type exceeds a preset threshold value, as the position of the corner point of the corresponding type;

the pixel points in the thermodynamic diagram are activation responses of the pixel points to the corner points of the corresponding types, the higher the value of the pixel points is, the higher the activation response of the pixel points to the corner point types corresponding to the thermodynamic diagram is, the more likely the pixel points are to be a part of the corner points, if the value of the pixel points exceeds a preset threshold, the pixel points are used as a part of the corner points, if a plurality of pixel points exceed the preset threshold, the plurality of pixel points form the corner points, and the central positions of the plurality of pixel points can be determined as the positions of the corner points.

e) Determining the direction of the corner points corresponding to the corner points of the corresponding types according to the pooled feature images corresponding to the corner points of the different types;

specifically, the pooled feature images corresponding to different types of corner points are input into a normalized exponential function for classification prediction, an angle partition to which the corresponding type of corner point belongs is determined, the angle partition to which the corresponding type of corner point belongs is determined as the corner point direction corresponding to the corresponding type of corner point, and the angle partition is a plurality of partitions obtained by performing angle average partition on peripheral corners. For example, the peripheral angle 360 degrees may be divided into K partitions, each partition containing 360/K degrees, preferably 8 partitions, each partition being 45 degrees, and then the pooled feature images may be subjected to a normalized exponential function, such as Softmax, to predict which partition the angle corresponding to the corner belongs to.

f) Calculating the information difference between the corresponding type corner point and the pre-marked corresponding type corner point according to a preset loss function, optimizing the information difference according to a preset optimization method, and determining the optimized corresponding type corner point;

here, the predetermined loss function L may be defined as follows:

L＝L_conf+αL_off

L_offfor the corner position loss function, the following is defined:

where N is the number of text regions in the feature image, p_ijThe confidence that a pixel point with coordinates (i, j) in the feature image belongs to a corner point of a corresponding type, H is the column number of the feature image, W is the line number of the feature image, α is a weight parameter of a corner point position loss function, gamma is a weight factor, p is_kIs the k-th corner point, smooth_L1Is a smooth L1 loss function preferably α is set to 0.8 by default, γ being typicalIs 2.

By using the loss function L in the form, the punishment for the position with large offset and the sample text image which is difficult to classify is increased, so that the method is favorable for finding the corner position belonging to the text region.

In some embodiments of the present application, the classifying and combining the corner points, and determining the text region prediction box according to the classifying and combining result may specifically include the following steps: firstly, matching two types of corner points opposite in angle direction to obtain a corner point pair; determining a plurality of candidate text region prediction boxes according to the corner pairs; and if the intersection ratio of the first candidate text region prediction frame and the second candidate text region prediction frame in the candidate text region prediction frames is larger than a preset threshold value, determining the prediction frame after the first candidate text region prediction frame and the second candidate text region prediction frame are combined according to the maximum value and the minimum value of the coordinates of the first candidate text region prediction frame and the second candidate text region prediction frame, and determining the prediction frame as the text region prediction frame.

In the four types of corner points, the corner points of the upper left corner point type and the corner points of the lower right corner point type are opposite in angular direction, the corner points of the upper right corner point type and the corner points of the lower left corner point type are also opposite in angular direction, and a corresponding region prediction frame can be determined according to the two corner points opposite in angular direction, so that the corner points in the corner point set of the upper left corner point type and the corner points in the corner point set of the lower right corner point type can be matched in pairs, and a plurality of corner point pairs can be obtained; similarly, the corner points in the corner point set of the upper right corner point type and the corner points in the corner point set of the lower left corner point type can be matched in pairs to obtain a plurality of corner point pairs.

Determining a plurality of candidate text region prediction frames according to the obtained plurality of corner point pairs, and then determining whether to combine according to an intersection ratio between the candidate text region prediction frames, wherein the intersection ratio is a ratio of an intersection and a union of the two candidate text region prediction frames, and if the intersection ratio is greater than a preset threshold value, combining the two candidate text region prediction frames. Preferably, the preset threshold is generally set to 0.7. And after combining the two candidate text region prediction frames, determining the combined text region prediction frame according to the position of the outermost periphery, wherein the position of the outermost periphery is a rectangular frame determined by combining the maximum value and the minimum value of the X coordinates and the maximum value and the minimum value of the Y coordinates of the corner points of the two candidate text region prediction frames, and the rectangular frame is the combined text region prediction frame. And merging all the candidate text region prediction frames which can be merged to obtain a final text region prediction frame, wherein if a plurality of text regions exist, a plurality of corresponding text region prediction frames exist.

And after the text region prediction frame is obtained, comparing the text region prediction frame with a pre-labeled text region identification frame to determine the difference between the text region prediction frame and the pre-labeled text region identification frame, wherein the text region identification frame is a rectangular frame pre-labeled to the text region and is used as a reference target during model training. The difference between the text region prediction box and the text region recognition box may be calculated by a preset loss function, such as cross entropy or the like. And then, reducing the difference between the two through an optimization method of a loss function, such as a gradient descent method, and the like, so as to continuously optimize the parameters of the text region detection model (namely training of the model), and enable the text region prediction frame to gradually approach the text region identification frame.

And when the training of the text region detection model meets the preset model training stopping condition, determining the current model parameters as the final parameters of the text region detection model, and completing the construction of the text region detection model. The preset model training stopping condition may be, for example, that the training times reach a preset number, the loss function value is smaller than a preset threshold, and the like.

In step S102, text region detection is performed on the text image to be detected according to the text region detection model, and a corresponding text region prediction frame is determined. Here, the text image to be detected does not have a corresponding pre-labeled text region identification frame, and the image needs to be identified by a text region detection model to obtain a text region prediction frame. Similar to the sample text image, the text region in the text image to be detected can also have different display directions, and the text region detection model can identify the text region with different display directions.

Some embodiments of the present application also provide an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the apparatus to perform the aforementioned text region detection method.

Some embodiments of the present application also provide a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the aforementioned text region detection method.

To sum up, the scheme that this application provided can carry out the corner detection in text region to the image that contains the text that needs carry out character recognition, obtains corresponding corner, and the prediction frame in text region is further confirmed according to the corner that obtains again to can discern the text region that has a plurality of display directions, improve the precision that the text region detected, it is more accurate to the location in text region simultaneously, more press close to the text character.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises a device comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A text region detection method, wherein the method comprises:

2. The method of claim 1, wherein building a text region detection model comprises:

3. The method of claim 2, wherein performing feature extraction on the sample text image to obtain a corresponding feature image comprises:

4. The method of claim 2, wherein performing corner detection on the feature image, determining corners in the sample text image, comprises:

5. The method of claim 4, wherein the size of the convolution kernel used in the convolution operation on the feature image is 3 x 5.

6. The method of claim 4, wherein performing pooling operation on the convolved feature images to obtain pooled feature images corresponding to different types of corner points comprises:

acquiring a first pixel point in the convolved characteristic image;

7. The method of claim 4, wherein determining the corner direction corresponding to the corner of the corresponding type according to the pooled feature images corresponding to the corner of the different types comprises:

8. The method according to claim 4, wherein the preset loss function L used in the corner detection process of the feature image is defined as follows:

L＝L_conf+αL_off

L_offfor the corner position loss function, the following is defined:

9. The method of claim 2, wherein the classifying and combining the corners and determining the text region prediction box according to the classifying and combining result comprises:

10. An apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, cause the apparatus to perform the method of any of claims 1 to 9.

11. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 9.