WO2022262239A1

WO2022262239A1 - Text identification method, apparatus and device, and storage medium

Info

Publication number: WO2022262239A1
Application number: PCT/CN2021/139972
Authority: WO
Inventors: 赵坤; 杨争艳; 吴嘉嘉; 殷兵; 胡金水; 刘聪; 胡国平
Original assignee: 科大讯飞股份有限公司
Priority date: 2021-06-16
Filing date: 2021-12-21
Publication date: 2022-12-22
Also published as: CN113392825B; CN113392825A

Abstract

Disclosed in the present application are a text identification method, apparatus and device, and a storage medium. The present application comprises: obtaining a text picture corresponding to an image area where text to be identified is located, further extracting image features in at least two different directions from the text picture to be identified, and further identifying text content contained in the text picture on the basis of the extracted image features in the at least two different directions. Hence, for the text picture to be identified, in view of diversification in text content directions, extraction direction information is enhanced during image feature extraction in the present application, that is, feature extraction is performed on the text picture from two or more different directions, so that the extracted image features comprise feature information of the text in the text picture in a plurality of directions. On this basis, on the basis of the extracted image features, the text content comprised in the text picture can be identified more accurately, and the accuracy of text identification is improved.

Description

Text recognition method, device, equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202110666915.6 and the title of the invention "text recognition method, device, equipment and storage medium" submitted to the China Patent Office on June 16, 2021, the entire contents of which are incorporated herein by reference In this application.

technical field

The present application relates to the technical field of natural language processing, and more specifically, relates to a text recognition method, device, device and storage medium.

Background technique

With the development of text recognition technology, text recognition is more and more widely used in real life. For example, road sign recognition in automatic driving, photo translation, document scanning and recognition, etc.

In real life, the direction distribution of the text area in the scene picture is various, as shown in Figure 1, including horizontal text, oblique text, vertical text, etc. Due to the diversification of the directionality of text pictures to be recognized, this brings greater challenges to text recognition. How to accurately perform text recognition on such pictures has become a problem that needs to be solved urgently in the industry.

Contents of the invention

In view of the above problems, the present application is proposed to provide a text recognition method, device, device and storage medium to accurately perform text recognition on text pictures to be recognized with diverse directions. The specific plan is as follows:

A text recognition method, comprising:

Acquiring a text picture to be recognized, where the text picture is the image area where the text to be recognized is located;

extracting image features in at least two different directions from the text picture;

Based on the extracted image features in the at least two different directions, identify text content included in the text picture.

Preferably, said extracting image features in at least two different directions from said text picture includes:

Input the text image into a pre-built convolutional network;

Using the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is formed by fusing at least two feature submaps, the The at least two feature submaps are obtained by performing a convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation by the same convolution kernel.

Preferably, said utilizing said convolutional network to extract image features in at least two different directions of said text picture includes:

Use the convolution kernel of each convolution layer in the convolution network to perform convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel, and the convolution of each convolution layer The kernel includes the original convolution kernel and its convolution kernel after at least one rotation;

The original convolution kernel and the feature submaps extracted by each convolution kernel after rotation are fused, and the fused feature map is input into the next convolution layer;

The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.

Perform at least one rotation on the convolution kernel of each convolution layer in the convolution network, and use the convolution kernel before and after rotation to perform a convolution operation on the feature map output by the previous convolution layer to obtain the before and after rotation. The feature subgraph extracted by each convolution kernel after rotation;

Fusing the feature submaps extracted by the convolution kernel and its rotated convolution kernels, and inputting the fused feature map into the next convolution layer;

Preferably, the at least two feature subgraphs include:

The feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation; and,

After the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the feature submap obtained by performing convolution operation on the feature map output by the previous convolution layer by the rotated convolution kernel.

Preferably, the identifying the text content contained in the text picture based on the extracted image features in the at least two different directions includes:

Inputting the extracted image features in at least two different directions into a pre-built recognition network to obtain the text content contained in the text picture output by the recognition network;

Wherein, the recognition network and the convolutional network form a text recognition model, and the text recognition model is trained by using sample picture training data labeled with text content recognition results.

Preferably, said acquiring the text picture to be recognized includes:

Obtain the original text image to be recognized;

If it is detected that the original text picture is tilted relative to the horizontal direction, the original text picture is rotated to the horizontal direction as the text picture to be recognized.

Preferably, after said rotating the original text picture to the horizontal direction, the method further includes:

Calculate the aspect ratio of the original text image in the horizontal direction;

If it is determined that the aspect ratio exceeds the set threshold, the original text picture in the horizontal direction is rotated by 90 degrees as the text picture to be recognized.

Preferably, before inputting the text picture into the pre-built convolutional network, the method also includes:

Using the text picture as a forward text picture, rotating the forward text picture by 180 degrees to obtain a reverse text picture;

Then the described text picture is input into the pre-built convolutional network, including:

Input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, and obtain the text content contained in the forward text picture output by the text recognition model and its confidence level, as well as the text The text content contained in the reverse text picture output by the recognition model and its confidence level;

Among the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with higher confidence is taken as the final recognition result.

A text recognition device, comprising:

A picture acquisition unit, configured to acquire a text picture to be recognized, where the text picture is an image area where the text to be recognized is located;

A feature extraction unit, configured to extract image features in at least two different directions from the text picture;

A text content identification unit, configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.

A text recognition device, comprising: a memory and a processor;

The memory is used to store programs;

The processor is configured to execute the program to implement each step of the above-mentioned text recognition method.

A storage medium, on which a computer program is stored, and when the computer program is executed by a processor, each step of the above-mentioned text recognition method is realized.

With the above technical solution, the present application obtains the text picture corresponding to the image area where the text to be recognized is located, and further extracts image features in at least two different directions from the text picture to be recognized, and then based on the extracted at least two different The image features in the orientation identify the text content contained in the text image. It can be seen that, for the text picture to be recognized, in view of the diversification of the direction of the text content, the present application strengthens the extracted direction information when performing image feature extraction, that is, the text is analyzed from two or more different directions. The feature extraction of the picture makes the extracted image features contain the feature information in multiple directions of the text to be recognized in the text picture. On this basis, based on the extracted image features, the text contained in the text picture can be more accurately identified. content, improving the accuracy of text recognition.

Description of drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the application. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:

Figure 1 illustrates schematic diagrams of several text images distributed in different directions;

FIG. 2 is a schematic flow chart of a text recognition method provided in an embodiment of the present application;

Fig. 3 illustrates a schematic diagram of a process of rotating an original text image to a horizontal direction;

Fig. 4 illustrates a schematic diagram of a process of rotating a text image to be placed horizontally;

Figure 5 illustrates a schematic diagram of a feature extraction process in which two adjacent convolutional layers share a rotation convolution kernel;

Fig. 6 illustrates a schematic diagram of an identification network architecture of a codec structure;

FIG. 7 illustrates a schematic diagram of a process of performing a rotation operation on a text image with text flipping;

FIG. 8 is a schematic structural diagram of an identification processing device provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a text recognition device provided by an embodiment of the present application.

detailed description

The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

The solution of this application can be implemented based on a terminal capable of data processing, and the terminal can be a mobile phone, a computer, a server, a cloud, and the like.

Next, as described in conjunction with FIG. 2, the text recognition method of the present application may include the following steps:

Step S100, acquiring a text image to be recognized.

Specifically, the picture of the text to be recognized is the image area where the text to be recognized is located. In this step, the text picture that needs to be recognized can be obtained directly, or the text region detection can be performed on the original picture containing the text to be recognized to obtain the image region where the text to be recognized is located.

Further, in order to facilitate text recognition, the text picture acquired in this step may be a text line picture, that is, an image area where a line of text is located.

Step S110, extracting image features in at least two different directions from the text image.

Specifically, the direction of the text content of the text picture to be recognized is not fixed. In order to adapt to the recognition of text content in various directions, in this step, the extracted direction information is strengthened during image feature extraction, that is, from Feature extraction is performed on the text image in two or more different directions, so that the extracted image features include feature information in multiple different directions of the text to be recognized in the text image. Of course, there may be many different implementations for the extraction of image features, which are not strictly limited in this step.

Step S120, based on the extracted image features in the at least two different directions, identify the text content contained in the text picture.

Specifically, after extracting the image features in at least two different directions of the text picture to be recognized, based on the extracted image features, the text content contained in the text picture can be more accurately recognized, improving the accuracy of text recognition .

The text recognition method provided in the embodiment of the present application obtains the text picture corresponding to the image area where the text to be recognized is located, and further extracts at least two image features in different directions from the text picture to be recognized, and then based on the extracted at least two Image features in different directions to identify the text content contained in the text image. It can be seen that, for the text picture to be recognized, in view of the diversification of the direction of the text content, the present application strengthens the extracted direction information when performing image feature extraction, that is, the text is analyzed from two or more different directions. The feature extraction of the picture makes the extracted image features contain the feature information in multiple directions of the text to be recognized in the text picture. On this basis, based on the extracted image features, the text contained in the text picture can be more accurately identified. content, improving the accuracy of text recognition.

In some embodiments of the present application, the foregoing step S100, the process of acquiring the text image to be recognized, is further introduced.

In this embodiment, the original text picture to be recognized is first obtained, and the original text picture is a picture obtained through text region detection. Generally, the original text picture may be a text line picture. Depending on the detection method used for text region detection, the shape of the original text picture may also be different. For example, the original text picture may be a rectangle, a parallelogram or other optional shapes. Normally, this application can select a rectangular original text image.

Since the position, trend, and text direction of the text area may be different in different scenarios, the obtained original text image may be inclined relative to the horizontal direction, that is, any side of the original text image is not parallel to the horizontal direction. As shown in FIG. 3 , the left original text image in FIG. 3 is inclined relative to the horizontal direction.

In order to better extract image features in multiple different directions from the text picture in subsequent steps, in this embodiment, the original text picture may be rotated to a horizontal direction as the text picture to be recognized.

The rotation direction of the original text picture can be counterclockwise or clockwise, as long as one side of the rotated text picture is parallel to the horizontal direction.

It can be understood that after the above rotation processing, when using multiple convolution kernels to extract image features, the image features can be extracted along the four main directions of the horizontal and vertical axis of the text, and the extracted image features are more convenient. Follow up with text content recognition.

Furthermore, the text picture rotated in the horizontal direction may have two forms, one is that the text picture is placed vertically, and the other is that the text picture is placed horizontally. For vertically placed text pictures, it is more difficult to process in the subsequent network. At the same time, in order to facilitate the uniformity of data, in this embodiment, all text pictures can be adjusted to be placed horizontally. The specific implementation process may include:

After the original text image is rotated to the horizontal direction, the aspect ratio of the original text image in the horizontal direction is further calculated.

Specifically, by calculating the aspect ratio, it can be judged whether the text picture is placed vertically or horizontally. For example, the aspect ratio threshold may be preset, such as being set to 2 or other optional values. When it is determined that the calculated aspect ratio exceeds the set threshold, it can be considered that the text picture is placed vertically, so the original text picture in the horizontal direction can be further rotated by 90 degrees as the text picture to be recognized; if it is determined that the calculated If the aspect ratio does not exceed the set threshold, it can be considered that the text picture is placed horizontally, and it can be directly used as the text picture to be recognized without rotation processing.

Wherein, the above-mentioned process of rotating the original text image in the horizontal direction by 90 degrees may be performed clockwise or counterclockwise. As shown in FIG. 4 , for the text picture on the left side in FIG. 4 , it is placed vertically and can be rotated 90 degrees clockwise.

In some embodiments of the present application, the above step S110, the process of extracting image features in at least two different directions from the text picture is introduced.

This embodiment introduces an optional solution for image feature extraction through convolutional networks. Specifically, this application can pre-train a convolutional network for extracting image features.

Convolutional networks can use Resnet29 networks or other forms of networks with convolutional layers.

A convolutional network consists of several convolutional layers. In order to realize the extraction of image features in at least two different directions from the text picture, in this embodiment, the convolution kernel of the convolution layer is set as a shared rotation convolution kernel. Here, the convolution kernel is the weight matrix, and the convolution operation is performed on the feature map output by the previous convolution layer through the convolution kernel. The so-called shared rotation convolution kernel is to keep the parameters of the convolution kernel unchanged (that is, the weights in the weight matrix remain unchanged), and rotate the convolution kernel at least once in units of 90 degrees. The rotated convolution kernel The convolution kernel and the convolution kernel before rotation are shared rotation convolution kernels.

By sharing the rotation convolution kernel, feature information in different directions of the text image can be obtained.

The feature map output by each convolutional layer is fused by at least two feature submaps, the at least two feature submaps include the same convolution kernel before rotation and after at least one rotation, the output of the previous convolution layer The feature map is obtained by convolution operation.

Specifically, the above-mentioned at least two feature subgraphs may include:

After the same convolution kernel is rotated 90 degrees, 180 degrees and/or 270 degrees according to the set direction (such as clockwise or counterclockwise), the feature map output by the previous convolution layer is convoluted by the rotated convolution kernel The resulting feature subgraph.

It can be understood that the convolution kernel performs a convolution operation on the feature map output by the previous convolution layer before rotation, and a corresponding feature submap can be obtained. On this basis, the convolution kernel can be rotated by 90 degrees, 180 degrees and/or 270 degrees in the above-mentioned manner, and each time it is rotated, the feature map output by the previous convolution layer is used to perform convolution operations on the rotated convolution kernel, namely An additional feature submap can be obtained. For example, if the convolution kernel is rotated once, an additional feature subgraph can be obtained. If the convolution kernel is rotated twice, two additional feature subgraphs can be obtained. If the convolution kernel is rotated three times, an additional feature subgraph can be obtained. Three feature subgraphs.

The feature submaps obtained before and after the rotation of the convolution kernel are fused to obtain the output of the current convolution layer.

As shown in FIG. 5 , FIG. 5 illustrates a schematic diagram of a feature extraction process in which two adjacent convolutional layers share a rotation convolution kernel.

Define the output feature of the previous layer in the convolutional network as F _i-1 , and the size is C*H*W, where C is the number of channels of the feature, H is the height of the feature, and W is the width of the feature.

Use feature F _i-1 as the input of the current convolution layer, set the convolution kernel size to 3*3 (Figure 5 is only an example, the size of the convolution kernel can also be other sizes, this application is not strict limited).

In order to enable the convolution kernel to capture text features in different directions, the convolution kernel may be rotated in this embodiment, for example, rotated 90 degrees, 180 degrees and 270 degrees counterclockwise, respectively. Further, each convolution kernel before and after rotation is used to perform a convolution operation on the feature map output by the previous layer to obtain feature information in four main directions. The feature subgraphs obtained after the convolution operation of each convolution kernel are F _i,0 , F _i,90 , F _i,180 , and F _i,270 . The output of the current layer is the fusion F _i of the above four feature submaps. The calculation formula of each feature is as follows:

F _i,0 = conv ₀ (F _i-1 )

F _i,90 = conv ₉₀ (F _i-1 )

F _i,180 = conv ₁₈₀ (F _i-1 )

F _i,270 = conv ₂₇₀ (F _i-1 )

F _i ＝cat(F _i,0 ,F _i,90 ,F _i,180 ,F _i,270 )

Among them, cat() represents the splicing operation of features.

The size of the fused output feature F _i is 4C'*H'*W'. Among them, C', H' and W' are the number of channels of the current convolutional layer feature, the height and width of the feature, respectively.

The fused output feature F _i is input to the next convolutional layer for processing, and so on. The features extracted by each convolutional layer contain different direction information, which strengthens the directionality of the extracted image features. The last convolution The features output by the layers are used as the image features F extracted by the convolutional network.

Based on the convolutional network introduced above, when extracting image features in at least two different directions from a text image, the text image can be input into the convolutional network. Use the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is fused by at least two feature submaps, and the at least two features The sub-images include the convolution operation of the same convolution kernel on the feature map output by the previous convolution layer before rotation and after at least one rotation.

It can be understood that after one rotation of the convolution kernel, image features in one direction can be extracted. If the rotation method shown in Figure 5 is followed, image features in four directions can be extracted.

Of course, compared to the method in which the convolution kernel does not perform any rotation processing, every time the convolution kernel rotates one more time, image features in one direction can be extracted, which in turn can provide more accurate image features for subsequent text recognition and improve the quality of text. recognition accuracy.

In some embodiments of the present application, two optional implementation architectures of the above-mentioned convolutional network and an implementation process of image feature extraction using the convolutional network are further introduced.

In an optional manner, the convolution kernel of each convolution layer in the convolution network may include an original convolution kernel and a convolution kernel after at least one rotation of the original convolution kernel.

On this basis, the process of using a convolutional network to extract image features in at least two different directions of the text picture may include:

S1. Use the convolution kernel of each convolution layer in the convolution network to perform a convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel.

Among them, since the convolution kernel of each convolution layer includes the original convolution kernel and its convolution kernel after at least one rotation, the features extracted by the original convolution kernel and its rotated convolution kernels respectively can be obtained subplot.

S2. Fuse the original convolution kernel and the feature submaps extracted by the rotated convolution kernels, and input the fused feature map to the next convolution layer.

Specifically, the fusion process of the feature submaps extracted by the original convolution kernel and its rotated convolution kernels can be combined with the previous introduction, for example, the feature submaps are stitched together in the channel dimension.

S3. The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.

In another optional manner, the convolution kernel of each convolution layer in the convolutional network includes the original convolution kernel, but does not include the rotated convolution kernel.

S1. Rotate the convolution kernel of each convolution layer in the convolutional network at least once, and use the convolution kernel before and after rotation to perform a convolution operation on the feature map output by the previous convolution layer to obtain the before and after rotation. The feature submap extracted by each convolution kernel after rotation.

It can be understood that when only the original convolution kernel is included in the convolutional layer of the convolutional network, in order to extract image features from multiple different directions, when using the convolutional network for feature extraction, it is necessary to first extract each volume The convolution kernel of the stacked layer is rotated at least once, and then the feature map output by the previous convolution layer is convolved with the convolution kernel before and after rotation, and the features extracted by each convolution kernel before and after rotation are obtained. subplot.

S2. Fuse the feature submaps extracted by the convolution kernel and its rotated convolution kernels, and input the fused feature map to the next convolution layer.

Comparing the architectures of the two convolutional networks, it can be seen that the former convolutional network is pre-configured with multiple convolution kernels before and after rotation, and then when performing image feature extraction, each convolution kernel can be directly used for feature extraction. extract. In the latter convolutional network, only the convolution kernel before rotation is configured. Therefore, when performing image feature extraction, the convolution kernel needs to be rotated at least once before the convolution kernel before and after rotation can be used. The kernel extracts features. Both implementations can realize the extraction of multi-directional image features, which can be selected by technicians according to actual needs.

Based on the use of the convolutional network to extract image features in at least two different directions from the text picture described in the foregoing embodiments, the embodiment of the present application further introduces step S120, based on the extracted images in the at least two different directions feature, the process of identifying the text content contained in the text image.

In this embodiment, the neural network model can be selected to process the text recognition task, that is, the recognition network can be pre-trained, and the recognition network and the convolutional network together form a text recognition model. Specifically, the output of the convolutional network is used as the input of the recognition network, and the convolutional network and the recognition network are jointly trained.

When training the text recognition model, use the sample picture training data labeled with the text content recognition results for training.

On this basis, by inputting the text picture to be recognized into the convolutional network of the text recognition model, image features in at least two different directions can be extracted, and further the extracted image features are input into the recognition network, and the recognition network outputs the text picture Included text content.

Among them, the recognition network can choose a variety of neural network architectures, for example, an Encoder-Decoder encoding and decoding architecture can be used, as shown in FIG. 6 .

The encoder Encoder can adopt a bidirectional LSTM (Long Short-Term Memory, long-term short-term memory network) structure, which takes the image feature F output by the convolutional network in the previous step as input, and outputs the hidden state h _i of each frame of the encoder.

The decoder Decoder can adopt a GRU (Gate Recurrent Unit, gated recurrent unit) or LSTM structure. For the hidden state _st of the decoder at the current moment, the attention mechanism Attention can be used to calculate the correlation between the hidden state _st and the hidden state h _i of each frame of the encoder, and obtain the context feature vector c _t . The calculation process is as follows:

e _ti =o(s _t ,h _i )

Among them, o represents the dot multiplication operation, and T represents the length of the encoder.

Finally, the text prediction value y _t of the decoder at the current moment is solved through the linear classification layer W by the hidden state _st at the current moment and the context feature vector c _t together.

In some embodiments of the present application, another implementation process of the text recognition method is introduced.

For the acquired text picture to be recognized, there may be a problem of text flipping, as shown in Figure 7, the text in the upper picture in Figure 7 is flipped.

In addition, in combination with the process of rotating the text image corresponding to FIG. 3 and FIG. 4 , the final text image may also be reversed during the rotation process.

If a text image with a text flip problem is input into the text recognition model for recognition, the final recognized text content may be inaccurate, or the word order of the recognized text content may be reversed.

For this reason, in this embodiment, before inputting the text picture into the pre-built convolutional network, the following processing steps are further added:

A text picture is used as a forward text picture, and the forward text picture is rotated by 180 degrees to obtain a reverse text picture.

On this basis, input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, and obtain the text content contained in the forward text picture output by the text recognition model and its confidence level, and The text content contained in the reverse text image output by the text recognition model and its confidence level.

By inputting the forward text picture and the reverse text picture into the text recognition model respectively, and selecting the recognized text content with higher confidence as the final recognition result, it can adapt to text pictures in different directions, and the final recognition result is more accurate .

The text recognition device provided by the embodiment of the present application is described below, and the text recognition device described below and the text recognition method described above can be referred to in correspondence.

Referring to FIG. 8 , FIG. 8 is a schematic structural diagram of a text recognition device disclosed in an embodiment of the present application.

As shown in Figure 8, the device may include:

The picture acquisition unit 11 is used to acquire the text picture to be recognized, and the text picture is the image area where the text to be recognized is located;

A feature extraction unit 12, configured to extract image features in at least two different directions for the text picture;

The text content identification unit 13 is configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.

Optionally, the above-mentioned feature extraction unit may include:

A convolutional network processing unit, configured to input the text picture into a pre-built convolutional network; use the convolutional network to extract image features in at least two different directions of the text picture, wherein each of the convolutional networks The feature map output by a convolutional layer is fused by at least two feature submaps, and the at least two feature submaps include the same convolution kernel before rotation and after at least one rotation, the output of the previous convolution layer The feature map is obtained by convolution operation.

Optionally, the embodiment of the present application provides two optional implementation structures of the convolutional network processing unit, which are as follows:

The first type, the convolutional network processing unit includes:

The first convolution operation unit is used to use the convolution kernel of each convolution layer in the convolution network to perform a convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel. , the convolution kernel of each convolution layer includes the original convolution kernel and its convolution kernel after at least one rotation;

The first feature fusion unit is used to fuse the feature submaps extracted by the original convolution kernel and the rotated convolution kernels, and input the fused feature map into the next convolution layer;

The first convolutional output unit is used to use the feature map output by the last convolutional layer of the convolutional network as the image feature of the text picture.

The second type, the convolutional network processing unit includes:

A convolution kernel rotation unit, used to rotate the convolution kernel of each convolution layer in the convolution network at least once;

The second convolution operation unit is used to perform convolution operation on the feature map output by the previous convolution layer by using the convolution kernel before and after rotation, and obtain the feature submap extracted by each convolution kernel before and after rotation ;

The second feature fusion unit is used to fuse the feature submaps extracted by the convolution kernel and the rotated convolution kernels, and input the fused feature map into the next convolution layer;

The second convolution output unit is used to use the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.

Optionally, the above at least two feature subgraphs may include:

Optionally, the above-mentioned text content identification unit may include:

A recognition network processing unit, configured to input the extracted image features in at least two different directions into a pre-built recognition network to obtain the text content contained in the text picture output by the recognition network; wherein, the recognition network and The convolutional network forms a text recognition model, and the text recognition model is trained by using sample image training data labeled with text content recognition results.

Optionally, the above image acquisition unit may include:

An original picture acquisition unit, configured to acquire an original text picture to be recognized, the original text picture being a rectangle;

The first rotating unit is configured to rotate the original text picture to the horizontal direction as the text picture to be recognized if it is detected that the original text picture is tilted relative to the horizontal direction.

Further optionally, the above picture acquisition unit may also include:

an aspect ratio calculation unit, configured to calculate the aspect ratio of the original text image in the horizontal direction after the processing by the first rotation unit;

The second rotation unit is configured to rotate the original text picture in the horizontal direction by 90 degrees as the text picture to be recognized if it is determined that the aspect ratio exceeds the set threshold.

Optionally, the device of the present application may also include:

The third rotation unit is configured to use the text picture as a forward text picture and rotate the forward text picture by 180 degrees to obtain a reverse text picture before inputting the text picture into the pre-built convolutional network. On this basis, the above-mentioned convolutional network processing unit may include:

The forward and reverse text picture input unit is used to input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, so that the forward text picture output by the text recognition model contains The text content and its confidence level, and the text content and its confidence level contained in the reverse text picture output by the text recognition model;

Confidence degree selection unit is used to use the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with high confidence as the final recognition result.

The text recognition device provided in the embodiment of the present application can be applied to text recognition devices, such as terminals: mobile phones, computers, and the like. Optionally, FIG. 9 shows a block diagram of the hardware structure of the text recognition device. Referring to FIG. 9, the hardware structure of the text recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus4;

In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete the mutual communication through the communication bus 4;

Processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;

The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory;

Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:

Based on the extracted image features in the at least two different directions, identify text content contained in the text picture.

Optionally, the detailed functions and extended functions of the program can refer to the above description.

The embodiment of the present application also provides a storage medium, which can store a program suitable for execution by a processor, and the program is used for:

Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on the difference from other embodiments. The various embodiments can be combined as needed, and the same and similar parts can be referred to each other. .

The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the present application will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

A text recognition method is characterized in that, comprising:

Acquiring a text picture to be recognized, where the text picture is the image area where the text to be recognized is located;

extracting image features in at least two different directions from the text picture;

Based on the extracted image features in the at least two different directions, identify text content contained in the text picture.
The method according to claim 1, wherein said extracting image features in at least two different directions for said text picture comprises:

Input the text image into a pre-built convolutional network;

Using the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is fused by at least two feature submaps, the The at least two feature submaps are obtained by performing a convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation by the same convolution kernel.
The method according to claim 2, wherein said utilizing said convolutional network to extract image features in at least two different directions of said text picture comprises:

Use the convolution kernel of each convolution layer in the convolution network to perform convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel, and the convolution of each convolution layer The kernel includes the original convolution kernel and its convolution kernel after at least one rotation;

The original convolution kernel and the feature submaps extracted by each convolution kernel after rotation are fused, and the fused feature map is input into the next convolution layer;

The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
The method according to claim 2, wherein said utilizing said convolutional network to extract image features in at least two different directions of said text picture comprises:

Perform at least one rotation on the convolution kernel of each convolution layer in the convolution network, and use the convolution kernel before and after rotation to perform a convolution operation on the feature map output by the previous convolution layer to obtain the before and after rotation. The feature subgraph extracted by each convolution kernel after rotation;

Fusing the feature submaps extracted by the convolution kernel and its rotated convolution kernels, and inputting the fused feature map into the next convolution layer;

The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
The method according to claim 2, wherein the at least two feature subgraphs comprise:

The feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation; and,

After the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the feature submap obtained by performing convolution operation on the feature map output by the previous convolution layer by the rotated convolution kernel.
The method according to claim 2, wherein the identifying the text content contained in the text picture based on the extracted image features in the at least two different directions comprises:

Inputting the extracted image features in at least two different directions into a pre-built recognition network to obtain the text content contained in the text picture output by the recognition network;

Wherein, the recognition network and the convolutional network form a text recognition model, and the text recognition model is trained by using sample picture training data labeled with text content recognition results.
The method according to any one of claims 1-6, wherein said acquiring the text picture to be recognized comprises:

Obtain the original text image to be recognized;

If it is detected that the original text picture is tilted relative to the horizontal direction, the original text picture is rotated to the horizontal direction as the text picture to be recognized.
The method according to claim 7, wherein after said rotating the original text picture to the horizontal direction, the method further comprises:

Calculate the aspect ratio of the original text image in the horizontal direction;

If it is determined that the aspect ratio exceeds the set threshold, the original text picture in the horizontal direction is rotated by 90 degrees as the text picture to be recognized.
The method according to claim 6, wherein, before inputting the text picture into a pre-built convolutional network, the method further comprises:

Using the text picture as a forward text picture, rotating the forward text picture by 180 degrees to obtain a reverse text picture;

Then the described text picture is input into the pre-built convolutional network, including:

Input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, and obtain the text content contained in the forward text picture output by the text recognition model and its confidence level, as well as the text The text content contained in the reverse text picture output by the recognition model and its confidence level;

Among the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with higher confidence is taken as the final recognition result.
A text recognition device is characterized in that it comprises:

A picture acquisition unit, configured to acquire a text picture to be recognized, where the text picture is an image area where the text to be recognized is located;

A feature extraction unit, configured to extract image features in at least two different directions from the text picture;

A text content identification unit, configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.
A text recognition device, characterized in that it includes: a memory and a processor;

The memory is used to store programs;

The processor is configured to execute the program to realize each step of the text recognition method according to any one of claims 1-9.
A storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, each step of the text recognition method according to any one of claims 1-9 is realized.