CN107545262B

CN107545262B - Method and device for detecting text in natural scene image

Info

Publication number: CN107545262B
Application number: CN201710642311.1A
Authority: CN
Inventors: 王凯; 陈院林; 乔宇; 贺通
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2020-11-06
Anticipated expiration: 2037-07-31
Also published as: CN107545262A

Abstract

A method and a device for detecting texts in natural scene images are used for solving the problem that in the prior art, the text detection precision is low from natural scene images with different complexity degrees. The method comprises the following steps: acquiring a natural scene image, performing convolution operation on the acquired natural scene image through an FCN (fuzzy C-means) model to obtain convolution characteristics of the natural scene image, determining a text candidate area sequence included in the natural scene image according to the convolution characteristics of the natural scene image, and executing: extracting convolution characteristics of the text candidate region through the region-of-interest pooling layer, converting the convolution characteristics of the text candidate region into characteristic vectors with fixed dimensionality k through characteristic transformation, and determining the positions of text lines included in the text candidate region according to a time recursive network model and the characteristic vectors with the fixed dimensionality k, wherein k is a positive integer.

Description

Method and device for detecting text in natural scene image

Technical Field

The present application relates to the field of text detection technologies, and in particular, to a method and an apparatus for detecting a text in a natural scene image.

Background

The natural scene image is an image directly photographed by various photographing apparatuses (for example, a camera, a mobile phone having a photographing function, and the like) on a scene actually existing in life without a specific limitation. Texts in the natural scene image can provide rich semantic information, for example, text information for identifying streets, license plates, menus and the like in the natural scene image can assist people in conveniently understanding scene information, and therefore, it is necessary to accurately detect texts in the natural scene image. However, detecting text in natural scene images is a challenging task due to differences in fonts, colors, formats, etc. of text in natural scene images and highly cluttered backgrounds, etc.

At present, methods for detecting texts in natural scene images can be divided into two categories, which are: a sliding window based detection method and a connected domain based detection method. Specifically, the method comprises the following steps:

the working principle of the detection method based on the sliding window is as follows: the method comprises the steps of scanning an original natural scene image by using sliding windows with different scales to obtain a series of natural scene image sub-regions possibly comprising texts, extracting texture features of the sub-regions, training a classifier by using the extracted texture features, and verifying whether the sub-regions comprise the texts.

The working principle of the detection method based on the connected domain is as follows: the method comprises the steps of extracting a connected region from a natural scene image through characteristics such as color of character pixel points and stroke width of characters, analyzing the characteristics of the connected region, obtaining a text character string through a character combination rule, verifying the character string, removing non-characters, and obtaining a final detection result.

Both the above two methods distinguish text and background in natural scene images through low-level features, such as stroke width of characters, image texture features, etc., and the detection precision is low, so how to accurately detect text from natural scene images with different complexity is an urgent problem to be solved.

Disclosure of Invention

The application provides a method and a device for detecting texts in natural scene images, which are used for solving the problem of low text detection precision in the natural scene images with different complexity degrees in the prior art.

In a first aspect, a method for detecting a text in a natural scene image is provided, where the method includes first obtaining a natural scene image, performing convolution operation on the obtained natural scene image through a Full Convolution Network (FCN) model to obtain convolution characteristics of the natural scene image, determining a text candidate region sequence included in the natural scene image according to the convolution characteristics of the natural scene image, and executing, for each text candidate region in the text candidate region sequence: extracting convolution characteristics of the text candidate regions through a region-of-interest pooling layer (roi-posing), converting the convolution characteristics of the text candidate regions into characteristic vectors with a fixed dimension k through characteristic transformation, and determining the positions of text lines included in the text candidate regions according to a time recursive network model and the characteristic vectors with the fixed dimension k, wherein k is a positive integer, each text candidate region in the text candidate region sequence at least comprises one text line, and the text lines are single lines of text included in the text candidate regions.

In the embodiment of the application, the FCN model is used for roughly detecting the text candidate region sequence in the natural scene image, and for each text candidate region in the text candidate region sequence, the position of a text line included in the text candidate region is determined through the time recursive network model, so that the text in the natural scene image is accurately detected. Compared with the method for distinguishing the text and the background in the natural scene image through low-level features in the prior art, the method for detecting the text in the natural scene image based on the FCN model and the time recursive network model is not dependent on the low-level features such as the stroke width of characters and the image texture to distinguish the text and the background in the natural scene image, and the positions of text lines in the natural scene image can be accurately determined by fully utilizing the context information in the natural scene image and the semantic information of the text through the deep learning capacity of the FCN model and the time recursive network model.

In one possible design, determining a text candidate region sequence included in the natural scene image according to a convolution feature of the natural scene image includes: determining convolution characteristics representing the text position in the natural scene image by combining the convolution characteristics of the natural scene image; mapping the natural scene image by using convolution characteristics representing text positions in the natural scene image, and labeling the text positions in the natural scene image and non-text positions in the natural scene image; and determining at least one region marked as a text position in the natural scene image as the text candidate region sequence.

In the above design, a method for determining a text candidate region sequence included in a natural scene image is performed through a convolution feature of the natural scene image extracted by an FCN, and based on a learning capability at an FCN pixel level, context information in the natural scene image and semantic information of the text are fully utilized to separate the text from the background in the natural scene image, and then the text candidate region sequence included in the natural scene image is determined.

In the embodiment of the application, different text candidate regions included in a text candidate region sequence are determined to have different sizes through convolution characteristics of a natural scene image, in order to enable the text candidate region sequence to be uniformly processed according to a time recursive network model, in the embodiment of the application, the text candidate region sequence is normalized through roi-posing, the convolution characteristics of the text candidate region are converted into a characteristic vector with a fixed dimension k, and after the convolution characteristics of the text candidate region are converted into the characteristic vector with the fixed dimension k, the position of a text line included in the text candidate region is determined according to the time recursive network model and the characteristic vector with the fixed dimension k.

In one possible design, the temporal recursive network model includes N layers of long-short term memory (LSTM), where N is set to a positive integer equal to or greater than a maximum number of text lines in a text candidate region including a largest number of text lines in the sequence of text candidate regions.

Based on a time recursive network model comprising N layers of LSTMs, determining the positions of text lines included in the text candidate region according to the time recursive network model and the feature vector of the fixed dimension k, specifically comprising: inputting the feature vector of the fixed dimension k as the time frame of the N layers of LSTMs, and inputting the LSTMs included in the time recursive network model one by one, wherein only the feature vector of the fixed dimension k is input into the first layer of LSTM in the time recursive network model for the first time, then inputting the output result of the previous layer of LSTM and the feature vector of the fixed dimension k into the next layer of LSTM each time, and training the time recursive network model by using the feature vector of the fixed dimension k and the pre-calibrated text position to obtain a text line candidate box; performing regression, detection and communication on the upper edge, the lower edge, the left edge and the right edge of the text line candidate box, and determining the inclination angle of the text line candidate box; and determining the positions of the text lines in the text candidate region according to the text line candidate box and the inclination angle of the text line candidate box.

In the above design, the feature vector of the fixed dimension k is successively input into N layers of LSTM included in the time recursive network model, except for the first layer of LSTM, the detection result of the previous layer of LSTM is input into each subsequent layer of LSTM, and based on the network design of the N layers of recursive LSTM, the time recursive network model can use the information of the text line candidate box determined by the previous LSTM when determining the current text line candidate box, so that the determination of the current text candidate box is more accurate. Furthermore, the inclined angle of the text line candidate box is determined through a time recursive network model of N layers of LSTMs, so that the detection of the inclined text can be realized.

In one possible design, after determining the positions of the text lines included in the text candidate region, the method further includes:

matching the positions of the text lines included in the determined text candidate areas with the pre-calibrated text positions through a matching algorithm, and determining the text line with the highest matching degree with the pre-calibrated text positions; and determining the error between the text line with the highest matching degree and the calibrated text position through an error algorithm, and updating the network parameters of the FCN model and the time recursive network model according to the error.

In the design, the text line with the highest matching degree with the pre-calibrated text position is determined through a matching algorithm, the process simultaneously ensures that only the position of one text line with the highest matching degree is reserved for the same text line, and the text line detected in the natural scene image is more accurate through the design.

In one possible design, N may be set to 5, but may be set to other values.

In a second aspect, the present application provides an apparatus for detecting a text in a natural scene image, where the apparatus for detecting a text in a natural scene image has a function of implementing the method in the first aspect, and the function may be implemented by hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions. The modules may be software and/or hardware.

In a third aspect, the present application provides an apparatus that may include a memory and a processor. Wherein the memory is adapted to store a program and the processor is adapted to execute the program in the memory to perform the method of detecting text in images of natural scenes as referred to in the first aspect or any possible design of the first aspect.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which some instructions are stored, and when the instructions are called by a computer, the instructions can make the computer perform the method for detecting text in a natural scene image, which is involved in any one of the possible designs of the first aspect and the first aspect.

In a fifth aspect, the present application provides a computer program product which, when invoked by a computer, performs the method for detecting text in images of natural scenes according to the first aspect and any possible design of the first aspect described above.

Drawings

Fig. 1 is a schematic diagram of a network structure for detecting text in a natural scene image according to the present application;

FIG. 2 is a flowchart of a method for detecting text in a natural scene image according to the present application;

FIG. 3 is a flowchart of a method for determining a sequence of candidate regions of text included in an image of a natural scene according to the present application;

FIG. 4 is a flowchart of a method for determining locations of lines of text included in candidate regions of text according to the present application;

FIG. 5 is a schematic diagram of another network structure for detecting text in a natural scene image according to the present application;

FIG. 6 is a schematic diagram of a text line matching provided by the present application;

FIG. 7 is a schematic diagram of an apparatus for detecting text in a natural scene image according to the present application;

fig. 8 is an apparatus for detecting text in a natural scene image according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiment of the application provides a method and a device for detecting texts in natural scene images, which are used for solving the problem of low text detection precision in the natural scene images with different complexity degrees in the prior art. The method and the device are based on the same inventive concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.

The method and the device for detecting the text in the natural scene image, provided by the embodiment of the application, can be applied to equipment for detecting the text in the natural scene image, such as a computer, a tablet computer, a smart phone, a server and the like.

The application fields of the embodiment of the application include, but are not limited to, a field of detecting text in a natural scene image, a field of detecting text-like small objects in a natural scene image, or a field of detecting other types of objects.

Fig. 1 shows a schematic diagram of a network structure for detecting a text in a natural scene image according to an embodiment of the present application, and referring to fig. 1, the network structure includes an FCN model, a time recursive network model, and a roi-posing, the FCN model obtains the natural scene image, obtains a text candidate region sequence in the natural scene image by processing the obtained natural scene image, obtains a feature vector of a fixed dimension by processing the text candidate region sequence through the roi-posing, and determines a position of a text line included in the text candidate region according to the time recursive network model and the feature vector of the fixed dimension.

In the embodiment of the present application, the network structure for detecting a text in a natural scene image includes, but is not limited to, the network structure shown in fig. 1.

In the embodiment of the present application, the FCN model may be reconstructed based on an existing convolutional neural network structure, and the convolutional neural network structure for constructing the FCN model is not limited in the embodiment of the present application, for example, the FCN model may be constructed by using a ResNet-101 network structure in a deep residual error network (ResNet), specifically, a full connection layer in a ResNet-101 network architecture is replaced by an deconvolution layer, and the number of the convolutional layers and the pooling layer may be selected as appropriate according to actual applications. The FCN model reconstructed based on the convolutional neural network structure is composed of a convolutional layer and a pooling layer and does not contain a full connection layer any more, so that an input image can be in any size, low-resolution spatial position information can be reserved, and end-to-end pixel level prediction can be realized.

Fig. 2 is a flowchart illustrating a method for detecting a text in a natural scene image according to an embodiment of the present application, referring to fig. 1, including:

s101: and acquiring a natural scene image. In the embodiment of the present application, the natural scene image refers to an image directly captured of a scene that actually exists in life by various capturing devices (for example, a camera, a mobile phone with a capturing function, and the like) without specific limitation.

It should be noted that the manner of acquiring the natural scene image includes, but is not limited to: the method comprises the steps of collecting natural scene images through sensing equipment, obtaining the natural scene images from a database in which the natural scene images are stored in advance, and the like. The sensing devices include, but are not limited to: optical fiber sensing equipment, camera equipment, acquisition equipment and the like. The database includes, but is not limited to: a local database, a cloud database, a U disk, a hard disk and the like.

S102: and performing convolution operation on the natural scene image through the FCN model to obtain the convolution characteristic of the natural scene image. In the embodiment of the application, convolution operation is performed on the acquired natural scene image based on the constructed FCN model, the convolution layer of the natural scene image is acquired, the convolution characteristic of the last convolution layer of the natural scene image is acquired through the deconvolution layer of the FCN model, and then the convolution characteristic of the natural scene image is acquired.

S103: and determining a text candidate region sequence included in the natural scene image according to the convolution characteristics of the natural scene image. Each text candidate region in the text candidate region sequence at least comprises one text line, and the text line is a single line of text included in the text candidate region.

In the embodiment of the present application, at least one text line is included in the text candidate region determined in the natural scene image according to the convolution characteristic of the natural scene image, and the final purpose of detecting the text in the natural scene image is to output all independent text lines, in order to accurately determine the at least one text line included in the text candidate region, the following operations of S104 and S105 may be performed for each text candidate region in the text candidate region sequence.

S104: extracting convolution characteristics of the text candidate region through roi-posing, and converting the convolution characteristics of the text candidate region into a characteristic vector of a fixed dimension k through characteristic transformation, wherein k is a positive integer. In the embodiments of the present application, the feature vectors of the fixed dimension k appearing below have the same meaning as the feature vectors of the fixed dimension k.

In the embodiment of the application, different text candidate regions in the text candidate region sequence are determined to have different sizes through the convolution characteristics of the natural scene image, and in order to enable the text candidate region sequence to be uniformly processed subsequently according to the time recursive network model, the text candidate region sequence is normalized through roi-posing, and the convolution characteristics of the text candidate region are converted into the characteristic vector with the fixed dimension k.

S105: and determining the positions of the text lines included in the text candidate region according to the time recursive network model and the feature vector of the fixed dimension k.

In the embodiment of the application, based on the FCN and time recursive network technology, a network structure of an FCN model fusion time recursive network model is designed, a text is detected in a natural scene image, an FCN model and a time recursive network model are utilized to learn effective feature expressions from a large number of natural scene image training samples, a fusion network capable of detecting text lines in the natural scene image is trained, specifically, a text candidate region sequence in the natural scene image is roughly detected through the FCN model, and the position of the text line included in the text candidate region is determined through the time recursive network model for each text candidate region in the text candidate region sequence, so that the text in the natural scene image is accurately detected. Compared with the method for distinguishing the text and the background in the natural scene image through low-level features in the prior art, the method for detecting the text in the natural scene image based on the FCN model and the time recursive network model is not dependent on the low-level features such as the stroke width of characters and the image texture to distinguish the text and the background in the natural scene image, and the position of a text line in the natural scene image can be accurately determined by fully utilizing the context information in the natural scene image and the semantic information of the text through the deep learning capacity of the FCN model and the time recursive network model.

Referring to fig. 3, a process of determining a text candidate region sequence included in a natural scene image according to a convolution feature of the natural scene image is specifically described:

s201: and determining the convolution characteristics representing the text position in the natural scene image by combining the convolution characteristics of the natural scene image.

In this embodiment of the application, the convolution feature of the natural scene image extracted through the FCN model may include features of multiple dimensions (for example, 1024 dimensions) of the natural scene image, and in order to determine the text candidate region sequence in the natural scene image, the convolution feature representing the text position in the natural scene image is determined in the natural scene image by merging the convolution features of the natural scene image.

S202: and mapping the natural scene image by using the convolution characteristic representing the text position in the natural scene image, and labeling the text position in the natural scene image and the non-text position in the natural scene image by using a classification function. In the embodiment of the present application, the classification function used for labeling the text position in the natural scene image and the non-text position in the natural scene image is not limited, and may be a logistic function, a softmax function, or the like.

S203: and determining at least one region marked as a text position in the natural scene image as a text candidate region sequence.

In this embodiment of the present application, after the text candidate region sequence in the natural scene image is determined by the FCN model, the convolution feature of the text candidate region is extracted by roi-posing, and the convolution feature of the text candidate region is converted into a feature vector of a fixed dimension k by feature transformation, which may be specifically referred to as S104, after the convolution feature of the text candidate region is converted into a feature vector of a fixed dimension k, according to the time recursive network model and the feature vector of the fixed dimension k, the position of the text line included in the text candidate area is determined.

In an embodiment of the present application, the temporal recursive network model may include N layers of LSTM, where N is set to a positive integer greater than or equal to a maximum number of text lines in a text candidate region including a maximum number of text lines in the sequence of text candidate regions. For example, if the number of text candidate regions determined in the natural scene image is four, which are respectively marked as a text candidate region a, a text candidate region B, a text candidate region C, and a text candidate region D, and the number of text lines included in the text candidate region a, the text candidate region B, the text candidate region C, and the text candidate region D is determined to be 2, 3, 1, 2 by counting the number of text lines in the four text candidate regions, N is set to a positive integer greater than or equal to 3.

In the embodiment of the present application, taking an example that the temporal recursive network model includes N layers of LSTM, a process of determining the positions of text lines included in a text candidate region according to the temporal recursive network model and a feature vector of a fixed dimension k is specifically described, which is shown in fig. 4:

s301: and inputting the feature vector of the fixed dimension k as the time frame of the N layers of LSTMs, and gradually inputting the LSTMs included in the time recursive network model.

The method comprises the steps of inputting a feature vector of a fixed dimension k into a first layer of LSTM in a time recursive network model for the first time, inputting a result output by the previous layer of LSTM and the feature vector of the fixed dimension k into a next layer of LSTM each time, and training the time recursive network model by using the feature vector of the fixed dimension k and a pre-calibrated text position to obtain a text line candidate box.

In the embodiment of the application, the feature vector of the fixed dimension k is successively input into N layers of LSTMs included in the time recursive network model, except the first layer of LSTM, the detection result of the previous layer of LSTM is input into each next layer of LSTM, and through the design of the N layers of LSTM network model, when the time recursive network model determines the current text line candidate box, the information of the previously determined text line candidate box can be utilized, so that the determination of the current text candidate box is more accurate.

S302: and performing regression, detection and communication on the upper edge, the lower edge, the left edge and the right edge of the text line candidate box to determine the inclination angle of the text line candidate box.

In the embodiment of the application, the upper edge, the lower edge, the left edge and the right edge of the text line candidate box are regressed, detected and communicated through the time recursive network model, and then the inclination angle of the text line candidate box can be determined, so that the method for detecting the text in the natural scene image provided by the embodiment of the application can support the detection of the inclined text.

S303: and determining the positions of the text lines included in the text candidate area according to the text line candidate box and the inclination angle of the text line candidate box.

In the embodiment of the application, the positions of single text lines in the text candidate regions are determined one by one through N LSTMs included in the time recursive network model, and the accurate detection of the text lines and the inclination angles of the text lines is realized by combining the characteristics of the text candidate regions extracted by the FCN.

In practical tests, the number of text lines included in the text candidate region determined in the natural scene image does not exceed 4, and therefore, in one possible design of the embodiment of the present application, the number of LSTM layers included in the temporal recursive network model is set to 5, that is, N is set to 5, so that the positions of the text lines in all the text candidate regions can be determined by the temporal recursive network model designed in the embodiment of the present application, and a network structure in which the number of LSTM layers N in the temporal recursive network model is set to 5 can be referred to in fig. 5.

It should be noted that, if the number of text lines included in the text candidate region is less than N, after the positions of all text lines included in the text candidate region are determined by the front M (M is a positive integer less than N) layers of LSTM, the remaining N minus M layers of LSTM are output as a null value.

In the embodiment of the present application, the positions of the text lines determined by the N layers of LSTM may not be output in order, for example, if the text candidate region includes 3 lines of text, the output order of the positions of the text lines determined by the N layers of LSTM may be the second line text position, the first line text position, and the third line text position, and the actually expected output order is the first line text position, the second line text position, and the third line text position; moreover, the text line positions determined by the N layers of LSTMs may have false detection, for example, a text candidate area actually comprises three lines of texts, and four text line positions are determined by the N layers of LSTMs; due to the above problems, in the embodiment of the present application, after the positions of the text lines included in the text candidate region are determined, the positions of the text lines included in the determined text candidate region are matched with the pre-calibrated text positions through a matching algorithm, the text line with the highest matching degree with the pre-calibrated text positions is determined, the text line with the highest matching degree and the error between the text line with the calibrated text positions are determined through an error algorithm, and the network parameters of the entire fusion network are updated according to the error.

In the embodiment of the application, the positions of the text lines included in the determined text candidate area are matched with the pre-calibrated text positions through a matching algorithm, the text line with the highest matching degree with the pre-calibrated text positions is determined, and in the process of determining the text line with the highest matching degree with the pre-calibrated text positions, only one position of the text line with the highest matching degree can be reserved for the same text line. Specifically, in the matching process, the matching degree between the position of the text line included in the text candidate region and the pre-calibrated text position can be represented by setting the matching score, the higher the matching score is, the higher the matching degree between the position of the text line included in the text candidate region and the pre-calibrated text position is, and the text line with the highest matching degree with the pre-calibrated text position is obtained by filtering out the position of the text line with the matching score lower than the pre-set threshold value.

In the following, a process of matching positions of text lines included in a determined text candidate region with pre-calibrated text positions is described as an example, referring to fig. 6, assuming that a currently detected text candidate region includes two text lines, indicated by a solid-line box in fig. 6, the text line positions included in the text candidate region currently determined by the N-layer LSTM are indicated by a dashed-line box, as the

dotted boxes

1, 2, 3 and 4 in fig. 6 indicate the text line positions included in the text candidate region determined by the N-layer LSTM, the dashed

frames

1, 2, 3 and 4 in fig. 6 are matched with the solid frame by a matching algorithm, the dashed frame with the highest matching degree with the solid frame is determined, the dashed

frames

2 and 4 with the highest matching degree with the solid frame are finally determined in fig. 6, the text lines in the natural scene image can be determined from the text line positions corresponding to the dashed

boxes

2 and 4 in fig. 6.

In the embodiment of the application, a matching algorithm for matching the positions of the text lines included in the determined candidate text regions with the pre-calibrated text positions is not limited, and for example, the matching algorithm can be a hungarian-los algorithm (hungary-loss), wherein the hungary algorithm is an algorithm for maximum matching by using an augmented path to find a bipartite graph, and the text line position with the highest matching degree with the pre-calibrated text positions can be effectively determined.

In the embodiment of the application, the network parameters of the whole fusion network are adjusted by determining the error between the text line with the highest matching degree and the calibrated text position, so that the performance of the FCN fusion time recursive network designed in the embodiment of the application is improved.

In the embodiment of the present application, an error algorithm for determining an error between a text line with the highest matching degree and a calibrated text position is not limited, and for example, the error algorithm may be a cross entropy error algorithm.

Based on the same conception as the method embodiment, the embodiment of the application also provides a device for detecting the text in the natural scene image. Fig. 7 is a schematic logical structure diagram of an apparatus for detecting text in a natural scene image, where the apparatus is applicable to a device for detecting text in a natural scene image, and referring to fig. 7, an apparatus 100 for detecting text in a natural scene image includes an obtaining unit 101 and a processing unit 102, where the obtaining unit 101 is configured to obtain a natural scene image, and the obtaining unit 101 may be a communication interface or transceiver provided in the apparatus itself, such as a remote device that transmits the natural scene image to the transceiver or communication interface of the apparatus in a wireless or wired manner, and may of course be an input interface (e.g., a keyboard, a USB interface, a touch screen, etc.) provided in the apparatus itself through which a user can input the natural scene image into the apparatus. A processing unit 102, configured to perform convolution operation on the natural scene image acquired by the acquiring unit 101 through an FCN model to obtain a convolution feature of the natural scene image, determine a text candidate region sequence included in the natural scene image according to the convolution feature of the natural scene image, and execute, for each text candidate region in the text candidate region sequence: extracting convolution characteristics of the text candidate region through a region-of-interest pooling layer roi-posing, converting the convolution characteristics of the text candidate region into characteristic vectors of a fixed dimension k through characteristic transformation, wherein k is a positive integer, and determining the positions of text lines included in the text candidate region according to a time recursive network model and the characteristic vectors of the fixed dimension k.

Wherein each text candidate region in the sequence of text candidate regions comprises at least one line of text, the line of text being a single line of text comprised in the text candidate region.

In a possible design, the processing unit 102 may determine a convolution feature representing a text position in the natural scene image, specifically by combining convolution features of the natural scene image; mapping the natural scene image by using convolution characteristics representing text positions in the natural scene image, and labeling the text positions in the natural scene image and non-text positions in the natural scene image through a classification function; and determining at least one region marked as a text position in the natural scene image as the text candidate region sequence.

In another possible design, the time-recursive network model includes N layers of long-short term memory (LSTM), where N is set to a positive integer greater than or equal to a maximum number of text lines in a text candidate region with the largest number of text lines included in the sequence of text candidate regions. The processing unit 102 may specifically input the feature vector of the fixed dimension k as a time frame input of the N layers of LSTMs, and sequentially input the LSTMs included in the time recursive network model, where only the feature vector of the fixed dimension k is input to a first layer of LSTM in the time recursive network model for the first time, and then each time a result output by a previous layer of LSTM and the feature vector of the fixed dimension k are input to a next layer of LSTM, and train the time recursive network model by using the feature vector of the fixed dimension k and a pre-calibrated text position, so as to obtain a text line candidate box; performing regression, detection and communication on the upper edge, the lower edge, the left edge and the right edge of the text line candidate box, and determining the inclination angle of the text line candidate box; and determining the positions of the text lines in the text candidate region according to the text line candidate box and the inclination angle of the text line candidate box.

In yet another possible design, after determining the positions of the text lines included in the text candidate region, the processing unit 102 may further match the determined positions of the text lines included in the text candidate region with pre-calibrated text positions through a matching algorithm, and determine a text line with a highest degree of matching with the pre-calibrated text positions; and determining the error between the text line with the highest matching degree and the calibrated text position through an error algorithm, and updating the network parameters according to the error.

The value of N mentioned in the above embodiments may be set to 5, but is not limited thereto.

The division of the modules in the embodiments of the present application is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, may also exist alone physically, or may also be integrated in one module by two or more modules. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

When the integrated module is implemented in the form of hardware, as shown in fig. 8, fig. 8 is a schematic diagram of an apparatus 1000 for detecting a text in a natural scene image according to an embodiment of the present application. The apparatus 1000 may be used to perform the methods referred to in fig. 2-4. As shown in fig. 8, the device 1000 includes a processor 1001 and a memory 1002. The memory 1002 stores computer programs, instructions or code. The processor 1001 may call and execute the program, instructions or codes stored in the memory 1002 to implement the steps and functions in the foregoing embodiments, which are not described herein again. The specific implementation of the processor 1001 may refer to the specific descriptions in the obtaining unit 101 and the processing unit 102 in the implementation of fig. 6, which is not described herein again.

It will be appreciated that fig. 8 only shows a simplified design of an apparatus for detecting text in images of natural scenes. In practical applications, the device for detecting a text in a natural scene image is not limited to the above structure, and in practical applications, any number of interfaces, processors, memories, and the like may be included, and all devices that can detect a text in a natural scene image according to the embodiments of the present application are within the scope of the embodiments of the present application.

It can be further understood that the apparatus 100 for detecting a text in a natural scene image and the device 1000 for detecting a text in a natural scene image according to the embodiments of the present application can be used to implement corresponding functions in the foregoing method embodiments according to the embodiments of the present application, so that for places where descriptions of the embodiments of the present application are not detailed enough, reference may be made to descriptions of related method embodiments, and details of the embodiments of the present application are not repeated here.

It is further understood that the processor referred to in the embodiments of the present application may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

The bus system may include a power bus, a control bus, a status signal bus, and the like, in addition to the data bus. For clarity of illustration, however, the various buses are labeled as a bus system in the figures.

In implementation, the steps involved in the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of the method for detecting text in a natural scene image disclosed in the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and combines hardware thereof to complete steps involved in the above method embodiments. To avoid repetition, it is not described in detail here.

Based on the same concept as the method embodiment, the embodiment of the present application further provides a computer-readable storage medium, on which some instructions are stored, and when the instructions are called by a computer and executed, the instructions may cause the computer to perform the method involved in any one of the possible designs of the method embodiment and the method embodiment.

Based on the same concept as the above method embodiments, the present application also provides a computer program product, which when called by a computer can perform the method as referred to in the method embodiments and any possible design of the above method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for detecting text in an image of a natural scene, comprising:

acquiring a natural scene image;

performing convolution operation on the natural scene image through a Full Convolution Network (FCN) model to obtain convolution characteristics of the natural scene image;

determining a text candidate region sequence included in the natural scene image according to the convolution characteristics of the natural scene image, wherein each text candidate region in the text candidate region sequence at least comprises one text line, and the text line is a single line of text included in the text candidate region;

for each text candidate region in the sequence of text candidate regions, performing:

extracting convolution characteristics of the text candidate region through a region-of-interest pooling layer roi-posing, and converting the convolution characteristics of the text candidate region into a characteristic vector of a fixed dimension k through characteristic transformation, wherein k is a positive integer;

and determining the positions of the text lines in the text candidate region according to a time recursive network model and the feature vector of the fixed dimension k.

2. The method of claim 1, wherein determining the sequence of text candidate regions included in the natural scene image based on the convolution characteristics of the natural scene image comprises:

determining convolution characteristics representing the text position in the natural scene image by combining the convolution characteristics of the natural scene image;

mapping the natural scene image by using convolution characteristics representing text positions in the natural scene image, and labeling the text positions in the natural scene image and non-text positions in the natural scene image through a classification function;

and determining at least one region marked as a text position in the natural scene image as the text candidate region sequence.

3. The method of claim 2, wherein the temporal recursive network model comprises N layers of long-short term memory, LSTM, where N is set to a positive integer equal to or greater than a maximum number of text lines in a text candidate region having a maximum number of text lines included in the sequence of text candidate regions;

determining the positions of the text lines included in the text candidate region according to the time recursive network model and the feature vector of the fixed dimension k, including:

inputting the feature vector of the fixed dimension k as the time frame of the N layers of LSTMs, and inputting the LSTMs included in the time recursive network model one by one, wherein only the feature vector of the fixed dimension k is input into the first layer of LSTM in the time recursive network model for the first time, then inputting the output result of the previous layer of LSTM and the feature vector of the fixed dimension k into the next layer of LSTM each time, and training the time recursive network model by using the feature vector of the fixed dimension k and the pre-calibrated text position to obtain a text line candidate box;

performing regression, detection and communication on the upper edge, the lower edge, the left edge and the right edge of the text line candidate box, and determining the inclination angle of the text line candidate box;

and determining the positions of the text lines in the text candidate region according to the text line candidate box and the inclination angle of the text line candidate box.

4. The method of claim 3, wherein after determining the locations of the lines of text included in the text candidate region, the method further comprises:

matching the positions of the text lines included in the determined text candidate areas with the pre-calibrated text positions through a matching algorithm, and determining the text line with the highest matching degree with the pre-calibrated text positions;

and determining the error between the text line with the highest matching degree and the calibrated text position through an error algorithm, and updating the network parameters according to the error.

5. The method of claim 3 or 4, wherein N is set to 5.

6. An apparatus for detecting text in an image of a natural scene, comprising:

an acquisition unit configured to acquire a natural scene image;

a processing unit, configured to perform convolution operation on the natural scene image through a full convolution network FCN model to obtain convolution characteristics of the natural scene image, and determine a text candidate region sequence included in the natural scene image according to the convolution characteristics of the natural scene image, where each text candidate region in the text candidate region sequence at least includes one text line, the text line is a single line of text included in the text candidate region, and for each text candidate region in the text candidate region sequence, the execution is performed: extracting convolution characteristics of the text candidate region through a region-of-interest pooling layer roi-posing, converting the convolution characteristics of the text candidate region into characteristic vectors of a fixed dimension k through characteristic transformation, wherein k is a positive integer, and determining the positions of text lines included in the text candidate region according to a time recursive network model and the characteristic vectors of the fixed dimension k.

7. The apparatus according to claim 6, wherein the processing unit, when determining the sequence of text candidate regions included in the natural scene image according to the convolution feature of the natural scene image, is specifically configured to:

8. The apparatus of claim 7, wherein the temporal recursive network model comprises N layers of long-short term memory (LSTM), where N is set to a positive integer equal to or greater than a maximum number of text lines in a text candidate region having a maximum number of text lines included in the sequence of text candidate regions;

when the processing unit determines the position of the text line included in the text candidate region according to the time recursive network model and the feature vector of the fixed dimension k, the processing unit is specifically configured to:

9. The apparatus as recited in claim 8, said processing unit to further:

after the positions of the text lines in the text candidate region are determined, matching the positions of the text lines in the determined text candidate region with the pre-calibrated text positions through a matching algorithm, and determining the text line with the highest matching degree with the pre-calibrated text positions;

10. The apparatus of claim 8 or 9, wherein N is set to 5.

11. An apparatus comprising the device for detecting text in an image of a natural scene of any one of claims 6 to 10.

12. A computer-readable storage medium having stored thereon computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-5.

13. A computer program product, which, when called by a computer, causes the computer to perform the method of any one of claims 1 to 5.