CN111582021A

CN111582021A - Method and device for detecting text in scene image and computer equipment

Info

Publication number: CN111582021A
Application number: CN202010223195.1A
Authority: CN
Inventors: 高远
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2020-08-25
Anticipated expiration: 2040-03-26
Also published as: WO2021189889A1; CN111582021B

Abstract

The invention relates to the technical field of image processing, in particular to a text detection method, a text detection device and computer equipment for scene images, wherein the method comprises the following steps: detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening high-confidence pixel points in the text prediction box; calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle; and cutting the scene image to obtain a text image to be recognized and recognizing text information in the text image. The method provided by the embodiment of the invention can correct and adjust the width of the text prediction box through the high-confidence-degree area on the basis of realizing text detection by using the EAST method, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.

Description

Method and device for detecting text in scene image and computer equipment

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for detecting texts in scene images and computer equipment.

Background

Computer vision-based text recognition has great use significance in today's big data age. Which is the basis for the implementation of many intelligent functions, such as recommendation systems, machine translation, etc. Text detection is used as a precondition for a character recognition process, and the detection accuracy of the text detection has a remarkable influence on the character recognition effect.

Under a complex natural scene, the text has the characteristics of distribution of various different positions, various arrangement forms, inconsistent distribution directions, multi-language mixing and the like, so the task of text detection is extremely challenging.

In the conventional technology, a text detection algorithm called CTPN exists, which implements text detection in a natural scene based on the idea of dividing and detecting a complete text first and then combining the complete text. In the conventional technology, a text detection method called east (actual and actual scene text detector) is proposed based on the fact that a text is detected in a manner of segmentation and recombination, that is, the detection precision is not accurate, that is, the detection time is excessively consumed, and that the user experience is poor. Feature extraction and learning are carried out by means of the FCN framework, end-to-end training and optimization are directly carried out, and unnecessary intermediate steps are eliminated.

However, in the practical application process of EAST, there are still many limitations, and the requirements of practical use cannot be well met. For example, the width of the finally obtained text prediction box does not conform to the actual text in the scene, so the conventional technology needs to be further improved on the basis of the actual application of EAST.

Disclosure of Invention

The invention aims to solve the technical problem that the identification precision of the conventional EAST algorithm cannot meet the actual use requirement.

In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a method for detecting a text in a scene image, including: training and optimizing the full convolution network model;

detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening pixel points with confidence degrees larger than a preset confidence degree threshold value in the text prediction box as high-confidence-degree pixel points, wherein the confidence degrees are the probability that the pixel points belong to the text prediction box and are output by the full convolution network model; according to the high-confidence-degree pixel points, calculating a minimum circumscribed rectangle corresponding to the text prediction box, wherein the minimum circumscribed rectangle is a rectangle which contains all the high-confidence-degree pixel points in the text prediction box and has the smallest area; calculating the overlapping degree between the text prediction box and the corresponding minimum bounding rectangle; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle; cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized; and identifying characters in the text image to be identified.

Optionally, before calculating the overlap degree between the text prediction box and the corresponding minimum bounding rectangle, the method further includes:

calculating the confidence coefficient average value of the high-confidence-coefficient pixel points in the minimum circumscribed rectangle;

and when the confidence coefficient average value is smaller than a preset screening threshold value, rejecting the minimum external rectangle.

Optionally, the training and optimizing the full convolution network model includes: constructing a full convolution network model; labeling a training label, and constructing a training data set; and training and optimizing the full convolution network model through the training data set and a preset loss function.

Optionally, the calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle includes:

determining pixel points in the text prediction box and the minimum circumscribed rectangle at the same time as first pixel points; determining pixel points which only belong to the text prediction box or the minimum circumscribed rectangle as second pixel points; calculating the sum of the number of the first pixel points and the second pixel points; and calculating the ratio of the number of the first pixel points to the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.

Optionally, when the overlap degree is greater than a preset overlap degree threshold, the text prediction box is adjusted by the following formula:

P1＝w*p+(1-w)*d，

wherein P1 is the width of the adjusted text prediction box, w is a weight coefficient, P is the width of the text prediction box, and d is the width of the corresponding minimum bounding rectangle.

Optionally, the calculating, according to the high-confidence pixel point, a minimum bounding rectangle corresponding to the text prediction box includes:

determining two high-confidence pixel points with the farthest distance from the high-confidence pixel points as length calibration pixel points;

determining two high-confidence-degree pixel points which are farthest away in a second direction perpendicular to the first direction as width calibration pixel points by taking a connecting line between the length calibration pixel points as the first direction;

and taking a first line segment which passes through the length calibration pixel points and is vertical to a connecting line between the length calibration pixel points as a length, and taking a second line segment which passes through the width calibration pixel points and is vertical to a connecting line between the width calibration pixel points as a width to enclose the minimum circumscribed rectangle.

In a second aspect, an embodiment of the present invention provides a text detection apparatus for a scene image, including:

the training unit is used for training and optimizing the full convolution network model; the text prediction box detection unit is used for detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; the screening unit is used for screening pixel points with confidence degrees larger than a preset confidence degree threshold value in the text prediction box as high-confidence-degree pixel points, wherein the confidence degrees are output by the full convolution network model, and the pixel points belong to the probability of the text prediction box; the minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence-degree pixel points, wherein the minimum circumscribed rectangle is a rectangle which contains all the high-confidence-degree pixel points in the text prediction box and has the smallest area; the overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle; the adjusting unit is used for adjusting the width of the text prediction box through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value; the cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized; and the text recognition unit is used for recognizing the text information in the text image to be recognized.

Optionally, the method further comprises: the confidence coefficient calculation unit is used for calculating the confidence coefficient average value of the high-confidence-coefficient pixel points in the minimum circumscribed rectangle; and the minimum circumscribed rectangle screening unit is used for rejecting the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the text detection method for a scene image when executing the computer program.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, causes the processor to execute the method for detecting text of a scene image.

The text detection method provided by the embodiment of the invention can correct and adjust the width of the text prediction box through the high-confidence-degree area on the basis of realizing text detection by using an EAST method, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for detecting a text of a scene image according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of step 20 of FIG. 1;

fig. 4 is a schematic flow chart of screening a minimum bounding rectangle according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a text detection apparatus for a scene image according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a text detection apparatus for a scene image according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the invention firstly provides a text detection method of a scene image, and the text detection method of the scene image can adjust the width of a text detection box through a high-confidence-degree area on the basis of realizing text detection by using an EAST method, so as to realize more accurate text recognition.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a computer device 100 according to an embodiment of the present invention. The computer apparatus 100 may be a computer, a cluster of computers, a main stream computer, a computing device dedicated to providing online content, or a computer network comprising a set of computers operating in a centralized or distributed manner.

As shown in fig. 1, the computer apparatus 100 includes: a processor 102, memory and network interface 105 connected by a system bus 101; the memory may include, among other things, a non-volatile storage medium 103 and an internal memory 104.

In the embodiment of the present invention, the Processor 102 may be a Central Processing Unit (CPU), and the Processor 102 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. according to the type of hardware used. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The number of processors 102 may be one or more, and the one or more processors 102 may execute sequences of computer program instructions to perform various methods of text detection of images of a scene as will be described in more detail below.

The computer program instructions are stored by, accessed from, and read from the non-volatile storage medium 103 to be executed by the processor 10, thereby implementing the tuning method disclosed in the following embodiments of the present invention. For example, the nonvolatile storage medium 103 stores a software application that executes the adjustment method described below. Further, the non-volatile storage medium 103 may store the entire software application or only a portion of the software application that may be executed by the processor 102. It should be noted that although only one block is shown in fig. 1, the non-volatile storage medium 103 may comprise a plurality of physical devices installed on a central processing device or different computing devices.

The network interface 105 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 1 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 100 to which aspects of the present invention may be applied, and that a particular computing device 100 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The embodiment of the invention also provides a computer readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the method for detecting text of a scene image disclosed by the embodiments of the present invention. The computer program product is embodied on one or more computer readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer program code.

In the case of implementing the computer device 100 in software, fig. 2 shows a schematic diagram of an adjusting method of a scene text according to an embodiment, and the method in fig. 2 is described in detail below. Referring to fig. 2, the method includes the following steps:

and 20, training and optimizing the full convolution network model.

The full convolution network model is one of neural network models. Before use, off-line training is required by using training data, and transfer weight parameters among neurons are determined.

In some embodiments, as shown in fig. 3, the step 20 specifically includes the following steps:

and 200, constructing a full convolution network model.

In the step, the feature extraction is carried out on the image data related to the input scene picture through a full convolution network model, and finally a single-channel pixel-level text score feature map and a multi-channel geometric figure feature map are generated. Specifically, the network structure of the full convolution network model can be decomposed into three parts: a feature extraction layer, a feature merging and outputting layer.

Firstly, a general convolutional network is adopted as a basic network by a feature extraction layer. During training, the parameters of the convolutional network are initialized and then feature extraction is carried out. And after the training is finished, obtaining optimized convolution network parameters. In practical application, a basic network such as an acceleration model Performance (Pvanet, Performance Vs access), a VGG16 model (Visual Geometry Group16), and the like can be selected and used according to the needs of actual situations. The embodiment of the invention can obtain four levels of feature maps with the sizes of 1/32, 1/16, 1/8 and 1/4 of the input image data in sequence through the convolutional network extraction. The large receptive field is needed for locating large text, and the small receptive field is needed for locating small text area. Therefore, the use requirements of the natural scene that the size of the text area is greatly different can be met by using the feature maps of different levels.

Secondly, the feature maps of the four levels are combined layer by using a U-shaped idea, so that the effect of reducing the later-period calculation overhead is realized. The layer-by-layer combination method can be represented by the following formula:

the specific process of the above formula is as follows: in each merging stage, the feature map from the previous stage is first input into the upper pooling layer (unprool layer) to expand its size. Then, it is merged with the current layer feature map. Finally, the number of channels and the amount of computation is reduced by the convolutional layers (conv layers), specifically the conv1 x 1 layers, and the local information is fused by the conv3 x 3 layers to finally generate the output of the merging stage. After the last merge stage (i.e., i ═ 4), the conv3 × 3 layer generates the final signature graph of the merge branch and sends it to the output layer.

Finally, the text score feature map and the geometric figure feature map with the size of the original map 1/4 are output on the output layer, the number of channels of the text score feature map is 1, and the number of channels of the geometric figure feature map is 5. And the text score feature map represents the confidence degree of each pixel point belonging to the text prediction box.

And 202, marking training labels and constructing a training data set.

In this step, the labeling of the training labels can be completed by any suitable method, and the labeling can be used as a training data set to train the full convolution network model. In some cases, the existing training data set may also be used directly for training or testing.

And 204, training and optimizing the full convolution network model through the training data set and a preset loss function.

Training optimization is a learning optimization process for parameters of the full convolution network model. After parameter optimization is completed, the trained full convolution network model can be applied to text detection of an actual scene.

Besides the labeled training data, the optimization process needs to provide a suitable loss function for evaluating the effect of the full convolution network model, and parameter optimization is realized by minimizing loss.

In the present application, the loss function can be expressed by the following equation:

L＝Ls+λgLg

wherein, L is a loss function, Ls is a loss of the text feature score map, Lg is a loss of the geometric feature map, and λ g represents the importance between the two losses and can be set to 1.

In particular, for the loss of the text feature score map, class balance cross entropy may be used for calculation. The loss of the geometric feature map can be calculated by using an overlap over loss function.

And step 22, detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model.

And determining a text prediction box in the scene image to be detected through the trained full convolution network model. That is, the scene image includes the text area.

As described above, the output layer of the full convolution network model may include a text score feature map and a geometric feature map. The text score feature map records the probability that each pixel belongs to a text prediction box when the pixel is mapped to an image to be detected. The geometric figure characteristic graph records the distance between each pixel point and the text prediction box when each pixel point is mapped to the image to be detected.

The full convolution network model will typically output a larger number of candidate text prediction boxes. Thus, in a preferred embodiment, a non-maximum suppression algorithm may be applied to eliminate redundant text prediction boxes to determine the location of the best text prediction box, which is the text prediction box in the embodiments of the present invention.

The scene picture is a picture that can be interpreted as being taken in a real scene in the present embodiment, for example, a picture obtained by framing through any suitable terminal with a camera.

And 24, screening pixel points with confidence degrees larger than a preset confidence degree threshold value in the text prediction box to serve as high-confidence-degree pixel points.

And the confidence coefficient is the probability that the pixel point belongs to the text prediction box and is output by the full convolution network model. That is, the confidence of each pixel point is represented in the text feature score graph, so that the situation that text prediction boxes possibly exist in different positions is reflected. In the step, through a proper screening mode, some pixel points with higher confidence coefficient are screened out and can be used for further adjusting and optimizing the text prediction box.

Specifically, high-confidence pixel points can be screened in the text feature score image in a mode of setting a proper confidence threshold. For example, a confidence threshold may be set to 0.7, and then it is sequentially determined whether a pixel point in the text feature score map is greater than the confidence threshold. If yes, the pixel point is determined as a high-confidence pixel point. If not, the pixel point is abandoned.

In an image to be detected, there may be a plurality of different text prediction boxes. Thus, these high confidence pixels may belong to different text boxes in the scene. Accordingly, to avoid errors in adjustment or correction, high-confidence pixels need to be marked and distinguished. Specifically, which text prediction box the pixel specifically belongs to can be determined according to the position of the pixel, so that the high-confidence pixels are classified into the corresponding text prediction boxes respectively.

And 26, calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points.

The Minimum Bounding Rectangle (MBR) is represented by two-dimensional coordinates, and the maximum range of the high-confidence pixel points in the same text prediction frame is obtained. The method is characterized in that a rectangular area given by high-confidence pixel points of the same text prediction box is represented by the rectangle which contains all the high-confidence pixel points in the text prediction box and has the smallest area.

Any suitable algorithm may be used to calculate the minimum bounding rectangle that determines each text prediction box.

In some embodiments, the method specifically includes the following steps:

firstly, two high-confidence pixel points with the farthest distance are determined as length calibration pixel points in the high-confidence pixel points.

And then, taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence-degree pixel points which are farthest away in a second direction perpendicular to the first direction as width calibration pixel points.

And finally, taking a first line segment which passes through the length calibration pixel points and is vertical to a connecting line between the length calibration pixel points as a length, and taking a second line segment which passes through the width calibration pixel points and is vertical to a connecting line between the width calibration pixel points as a width, so that the minimum external rectangle can be enclosed.

And step 28, calculating the overlapping degree between the text prediction box and the corresponding minimum bounding rectangle.

The degree of overlap (IOU), which may also be referred to as an "intersection ratio," characterizes the degree of overlap between a text prediction box and a corresponding minimum bounding rectangle. Which is calculated from the area ratio between the intersection and the union between the two boxes. A higher degree of overlap indicates a higher degree of match between the two boxes.

In some embodiments, the overlapping degree between the text prediction box and the corresponding minimum bounding rectangle may be specifically calculated by the following steps:

firstly, respectively determining pixel points which are simultaneously in the text prediction box and the minimum circumscribed rectangle as first pixel points and pixel points which only belong to the text prediction box or the minimum circumscribed rectangle as second pixel points;

then, the sum of the number of the first pixel points and the second pixel points is calculated.

And finally, calculating the ratio of the number of the first pixel points to the sum of the number of the first pixel points and the number of the second pixel points, and taking the ratio as the overlapping degree.

And step 30, when the overlapping degree is larger than a preset overlapping degree threshold value, adjusting the width of the text prediction box through the minimum circumscribed rectangle.

The overlap threshold is an empirical value and can be set by a technician as required by the actual situation. Typically, the width of the minimum bounding rectangle is less than the width of the text prediction box, which indicates that the region within the minimum bounding rectangle has a greater likelihood of belonging to a text region. Therefore, the text prediction box can be properly adjusted through the minimum bounding rectangle, and the width of the text prediction box is correspondingly reduced.

Specifically, when the overlapping degree is greater than a preset overlapping degree threshold, the text prediction box is adjusted by the following formula:

P1＝w*p+(1-w)*d，

Through the above formula, after a proper w value is given, the width of the text prediction box can be corrected and adjusted according to the smaller effective minimum bounding rectangle, so that the width of the text prediction box can be reliably reduced, and more accurate text recognition can be realized.

And step 32, cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized.

The adjusted text prediction box prompts the position of the scene image containing the text. Therefore, the text prediction boxes can be cut out from the scene image to be used as the text image to be recognized.

And step 34, identifying text information in the text image to be identified.

Any type of algorithm or mode can be selected to identify and acquire the text information in the text image, so as to obtain the final text detection result of the scene image. Which are well known to those skilled in the art and will not be described in detail herein.

By applying the text detection method provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, more accurate text recognition is realized, the difficulty of subsequent processing is reduced, and the text detection accuracy is improved.

Since the minimum bounding rectangle is used as a criterion for finally adjusting the width of the text detection box. Therefore, it is necessary to ensure that the minimum bounding rectangle has good reliability, otherwise, the subsequent adjustment process may cause adverse consequences.

In some embodiments, before performing step 28, the method may further include the step of screening the minimum bounding rectangle as shown in fig. 4:

step 401: and calculating the confidence coefficient average value of the high-confidence-coefficient pixel points in the minimum circumscribed rectangle.

The confidence coefficient average value is the confidence coefficient average value of the high-confidence coefficient pixel points, and represents the probability that the minimum bounding rectangle generally belongs to the text region.

Step 402: and judging whether the confidence coefficient average value is smaller than a preset screening threshold value. If yes, go to step 403. If not, go to step 404.

Step 403: and removing the minimum external rectangle.

It will be appreciated that the minimum bounding rectangle with a low confidence average does not actually have a high reliability or probability of belonging to text, and is not sufficient as a criterion for correction. Therefore, the minimum bounding rectangles can be eliminated, and the width correction of the text prediction box can be carried out without using the minimum bounding rectangles.

Step 404: and reserving the minimum bounding rectangle as an effective minimum bounding rectangle. These valid minimum bounding rectangles can be used for further processing as references for adjusting the text detection box.

Referring to fig. 5, fig. 5 provides a block diagram of a structure of a text detection apparatus for a scene image according to an embodiment of the present invention, and as shown in fig. 5, the text detection apparatus 500 includes: a training unit 50, a text prediction box detection unit 52, a filtering unit 54, a minimum bounding rectangle determination unit 56, an overlap calculation unit 58, an adjustment unit 60, a cutting unit 62, and a text recognition unit 64.

The training unit 50 is used for training and optimizing the full convolution network model.

The text prediction box detection unit 52 is configured to filter, as high-confidence pixels, pixels whose confidence is greater than a preset confidence threshold in the text prediction box, where the confidence is the probability that a pixel belongs to the text prediction box and is output by the full convolution network model; the minimum circumscribed rectangle determining unit 54 is configured to calculate a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence-degree pixel points, where the minimum circumscribed rectangle is a rectangle that includes all the high-confidence-degree pixel points in the text prediction box and has a minimum area; the overlap calculating unit 58 is configured to calculate an overlap between the text prediction box and the corresponding minimum bounding rectangle. The adjusting unit 60 is configured to adjust the width of the text prediction box through the minimum bounding rectangle when the overlap degree is greater than a preset overlap degree threshold. The cutting unit 62 is configured to cut the adjusted text prediction box in the scene image to obtain a text image to be recognized. The text recognition unit 64 is used for recognizing the text information in the text image to be recognized.

The text detection device for the scene image, provided by the embodiment of the invention, can correct and adjust the width of the text prediction box through the high-confidence-degree area on the basis of realizing text detection by using an EAST method, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.

In some embodiments, as shown in fig. 6, in addition to the functional modules shown in fig. 5, the text detection apparatus 500 may further include: a confidence calculation unit 66 and a minimum bounding rectangle screening unit 68.

The confidence calculating unit 66 is configured to calculate a confidence average value of the high-confidence pixel points in the minimum bounding rectangle. The minimum circumscribed rectangle screening unit 68 is configured to reject the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold.

The Minimum Bounding Rectangle (MBR) is the maximum range of high-confidence pixel points in the same text prediction box, which is expressed by two-dimensional coordinates. Which represents a rectangular region given by high confidence pixel points of the same text prediction box. The minimum bounding rectangle may be determined or calculated in any suitable manner, and calculation of the minimum bounding rectangle with the knowledge of a plurality of pixel points is well known to those skilled in the art and is not outlined here.

By applying the text detection device of the scene image provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, and more accurate text recognition can be realized.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting a text of a scene image, comprising:

training and optimizing the full convolution network model;

detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model;

screening pixel points with confidence degrees larger than a preset confidence degree threshold value in the text prediction box as high-confidence-degree pixel points, wherein the confidence degrees are the probability that the pixel points belong to the text prediction box and are output by the full convolution network model;

according to the high-confidence-degree pixel points, calculating a minimum circumscribed rectangle corresponding to the text prediction box, wherein the minimum circumscribed rectangle is a rectangle which contains all the high-confidence-degree pixel points in the text prediction box and has the smallest area;

calculating the overlapping degree between the text prediction box and the corresponding minimum bounding rectangle;

when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle;

cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized;

and identifying text information in the text image to be identified.

2. The method of detecting text in an image of a scene as recited in claim 1, wherein before calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle, the method further comprises:

3. The method for detecting the texts of the scene images according to claim 2, wherein the training optimization of the full convolution network model comprises:

constructing a full convolution network model;

labeling a training label, and constructing a training data set;

and training and optimizing the full convolution network model through the training data set and a preset loss function.

4. The method for detecting the text of the scene image according to claim 1, wherein the calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle comprises:

determining pixel points in the text prediction box and the minimum circumscribed rectangle at the same time as first pixel points;

determining pixel points which only belong to the text prediction box or the minimum circumscribed rectangle as second pixel points;

calculating the sum of the number of the first pixel points and the second pixel points;

and calculating the ratio of the number of the first pixel points to the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.

5. The method for detecting the text of the scene image according to claim 1, wherein when the degree of overlap is greater than a preset threshold value of the degree of overlap, the text prediction box is adjusted according to the following formula:

P1＝w*p+(1-w)*d，

6. The method for detecting the text of the scene image according to claim 1, wherein the calculating the minimum bounding rectangle corresponding to the text prediction box according to the high-confidence pixel point includes:

7. A text detection apparatus for detecting a scene image, comprising:

the training unit is used for training and optimizing the full convolution network model;

the text prediction box detection unit is used for detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model;

the screening unit is used for screening pixel points with confidence degrees larger than a preset confidence degree threshold value in the text prediction box as high-confidence-degree pixel points, wherein the confidence degrees are output by the full convolution network model, and the pixel points belong to the probability of the text prediction box;

the minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence-degree pixel points, wherein the minimum circumscribed rectangle is a rectangle which contains all the high-confidence-degree pixel points in the text prediction box and has the smallest area;

the overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle;

the adjusting unit is used for adjusting the width of the text prediction box through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value;

the cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized;

and the text recognition unit is used for recognizing the text information in the text image to be recognized.

8. The apparatus of claim 7, further comprising:

the confidence coefficient calculation unit is used for calculating the confidence coefficient average value of the high-confidence-coefficient pixel points in the minimum circumscribed rectangle;

and the minimum circumscribed rectangle screening unit is used for rejecting the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a text detection method for a scene image according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to execute the text detection method of a scene image according to any one of claims 1 to 6.