CN113392825B - Text recognition method, device, equipment and storage medium - Google Patents

Text recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN113392825B
CN113392825B CN202110666915.6A CN202110666915A CN113392825B CN 113392825 B CN113392825 B CN 113392825B CN 202110666915 A CN202110666915 A CN 202110666915A CN 113392825 B CN113392825 B CN 113392825B
Authority
CN
China
Prior art keywords
text
convolution
picture
feature
text picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110666915.6A
Other languages
Chinese (zh)
Other versions
CN113392825A (en
Inventor
赵坤
杨争艳
吴嘉嘉
殷兵
胡金水
刘聪
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
iFlytek Co Ltd
Original Assignee
University of Science and Technology of China USTC
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC, iFlytek Co Ltd filed Critical University of Science and Technology of China USTC
Priority to CN202110666915.6A priority Critical patent/CN113392825B/en
Publication of CN113392825A publication Critical patent/CN113392825A/en
Priority to PCT/CN2021/139972 priority patent/WO2022262239A1/en
Application granted granted Critical
Publication of CN113392825B publication Critical patent/CN113392825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)

Abstract

The application discloses a text recognition method, a device, equipment and a storage medium. Therefore, for the text picture to be identified, in view of diversification of the text content direction, the method strengthens the extracted direction information when extracting the image features, namely, the text picture is subjected to feature extraction from two or more different directions, so that the extracted image features contain feature information of the text picture to be identified in a plurality of directions, and on the basis, the text content contained in the text picture can be identified more accurately based on the extracted image features, and the accuracy of text identification is improved.

Description

Text recognition method, device, equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for text recognition.
Background
With the development of text recognition technology, text recognition is increasingly used in real life. Examples are landmark recognition in automatic driving, photo translation, document scan recognition, and the like.
In real life, the text regions are distributed in various directions in the scene pictures, as shown in fig. 1, including horizontal text, oblique text, vertical text, and the like. Because of the diversity of the directionality of the text pictures to be identified, this presents a greater challenge to text recognition, and how to accurately identify such pictures is a problem that needs to be solved in the industry.
Disclosure of Invention
In view of the above problems, the present application is directed to providing a text recognition method, apparatus, device, and storage medium, so as to accurately recognize text for a text picture to be recognized with diversified directions. The specific scheme is as follows:
A text recognition method, comprising:
acquiring a text picture to be identified, wherein the text picture is an image area where the text to be identified is located;
extracting image features in at least two different directions from the text picture;
and identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Preferably, the extracting image features in at least two different directions from the text picture includes:
Inputting the text picture into a pre-constructed convolution network;
And extracting image features in at least two different directions of the text picture by using the convolution network, wherein the feature map output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs comprise a same convolution kernel obtained by carrying out convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation.
Preferably, the extracting image features of the text picture in at least two different directions by using the convolution network includes:
performing convolution operation on the feature map output by the previous convolution layer by utilizing the convolution kernel of each convolution layer in the convolution network to obtain feature subgraphs extracted by each convolution kernel, wherein the convolution kernel of each convolution layer comprises an original convolution kernel and a convolution kernel which is rotated at least once;
fusing the original convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into a next convolution layer;
And taking the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
Preferably, the extracting image features of the text picture in at least two different directions by using the convolution network includes:
rotating the convolution kernel of each convolution layer in the convolution network at least once, and performing convolution operation on the feature graph output by the previous convolution layer by utilizing the convolution kernels before and after rotation to obtain feature subgraphs extracted by each convolution kernel before and after rotation;
fusing the convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into a next convolution layer;
And taking the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
Preferably, the at least two feature subgraphs include:
the same convolution kernel carries out a convolution operation on the feature diagram output by the previous convolution layer before rotation to obtain a feature subgraph; and
After the same convolution kernel rotates 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the rotated convolution kernel checks the feature diagram output by the previous convolution layer to carry out the convolution operation to obtain the feature subgraph.
Preferably, the identifying text content contained in the text picture based on the extracted image features in the at least two different directions includes:
inputting the extracted image features in at least two different directions into a pre-constructed recognition network to obtain text content contained in the text picture output by the recognition network;
the recognition network and the convolution network form a text recognition model, and the text recognition model is obtained by training sample picture training data marked with text content recognition results.
Preferably, the obtaining the text picture to be identified includes:
acquiring an original text picture to be identified;
And if the original text picture is detected to incline relative to the horizontal direction, rotating the original text picture to the horizontal direction to serve as the text picture to be identified.
Preferably, after said rotating the original text picture to the horizontal direction, the method further comprises:
Calculating the height-width ratio of the original text picture in the horizontal direction;
and if the aspect ratio exceeds the set threshold, rotating the original text picture in the horizontal direction by 90 degrees to serve as the text picture to be identified.
Preferably, before inputting the text picture into the pre-constructed convolutional network, the method further comprises:
taking the text picture as a forward text picture, and rotating the forward text picture by 180 degrees to obtain a reverse text picture;
the inputting the text picture into a pre-constructed convolutional network comprises:
Respectively inputting the forward text picture and the reverse text picture into a convolution network in the text recognition model to obtain text content and confidence coefficient contained in the forward text picture output by the text recognition model and text content and confidence coefficient contained in the reverse text picture output by the text recognition model;
And taking one with high confidence as a final recognition result from the text content contained in the forward text picture and the text content contained in the reverse text picture.
A text recognition device, comprising:
The image acquisition unit is used for acquiring a text image to be identified, wherein the text image is an image area where the text to be identified is located;
The feature extraction unit is used for extracting image features in at least two different directions from the text picture;
And the text content identification unit is used for identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
A text recognition device, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the text recognition method as described above.
A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a text recognition method as described above.
By means of the technical scheme, the text picture corresponding to the image area where the text to be recognized is located is obtained, the text picture to be recognized is further extracted with image features in at least two different directions, and text content contained in the text picture is recognized based on the extracted image features in the at least two different directions. Therefore, for the text picture to be identified, in view of diversification of the text content direction, the method strengthens the extracted direction information when extracting the image features, namely, the text picture is subjected to feature extraction from two or more different directions, so that the extracted image features contain feature information of the text picture to be identified in a plurality of directions, and on the basis, the text content contained in the text picture can be identified more accurately based on the extracted image features, and the accuracy of text identification is improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 illustrates a text picture schematic of several different directional distributions;
FIG. 2 is a flow chart of a text recognition method according to an embodiment of the present application;
FIG. 3 illustrates a process diagram of rotating an original text picture to a horizontal orientation;
FIG. 4 illustrates a process diagram of rotating a text picture to landscape orientation;
FIG. 5 illustrates a schematic diagram of a process for feature extraction by two adjacent convolutional layers through a shared rotating convolution kernel;
FIG. 6 illustrates an identification network architecture diagram of a codec structure;
FIG. 7 illustrates a process diagram of a rotation operation for text pictures with text flipped;
FIG. 8 is a schematic diagram of an identification processing device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a text recognition device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The scheme of the application can be realized based on the terminal with the data processing capability, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.
Next, as described in connection with fig. 2, the text recognition method of the present application may include the steps of:
and step S100, acquiring a text picture to be identified.
Specifically, the text image to be recognized is an image area where the text to be recognized is located. In this step, the text picture to be identified may be directly obtained, or the text region detection may be performed on the original picture containing the text to be identified, to obtain the image region where the text to be identified is located.
Further, in order to facilitate text recognition, the text picture obtained in this step may be a text line picture, that is, an image area where a line of text is located.
And step S110, extracting image features in at least two different directions from the text picture.
Specifically, the direction of the text content of the text picture to be recognized is not fixed, so that in order to adapt to recognition of text content in a plurality of different directions, the extracted direction information is enhanced when the image feature extraction is performed in the step, that is, the text picture is subjected to feature extraction from two or more different directions, so that the extracted image feature contains feature information in a plurality of different directions of the text to be recognized in the text picture. Of course, there are many different ways to extract the image features, and the method is not strictly limited in this step.
Step S120, identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Specifically, after the image features of the text picture to be recognized in at least two different directions are extracted, text content contained in the text picture can be recognized more accurately based on the extracted image features, and the accuracy of text recognition is improved.
According to the text recognition method provided by the embodiment of the application, the text picture corresponding to the image area where the text to be recognized is located is obtained, the image features in at least two different directions are further extracted for the text picture to be recognized, and the text content contained in the text picture is recognized based on the extracted image features in at least two different directions. Therefore, for the text picture to be identified, in view of diversification of the text content direction, the method strengthens the extracted direction information when extracting the image features, namely, the text picture is subjected to feature extraction from two or more different directions, so that the extracted image features contain feature information of the text picture to be identified in a plurality of directions, and on the basis, the text content contained in the text picture can be identified more accurately based on the extracted image features, and the accuracy of text identification is improved.
In some embodiments of the present application, the process of acquiring the text picture to be identified in the foregoing step S100 is further described.
In this embodiment, an original text picture to be identified is first obtained, where the original text picture is a picture obtained by text region detection. In general, the original text picture may be a text line picture. The shape of the original text picture may also be different depending on the detection means used in the text region detection, for example, the original text picture may be rectangular, parallelogram or other alternative shapes. In general, the application can select rectangular original text pictures.
Since the positions, trends, text directions, etc. of the text regions in different scenes may be different, the obtained original text picture may be inclined with respect to the horizontal direction, that is, any one of the edges of the original text picture is not parallel to the horizontal direction, as shown in fig. 3, the left original text picture in fig. 3 is inclined with respect to the horizontal direction.
In order to better extract image features of the text picture in a plurality of different directions in the subsequent step, in this embodiment, the original text picture may be rotated to a horizontal direction as the text picture to be identified.
The direction of rotation of the original text picture may be either counterclockwise or clockwise, as long as one edge of the rotated text picture is ensured to be parallel to the horizontal direction.
It can be understood that after the above rotation processing, when the image features are extracted by using a plurality of convolution kernels, the extraction of the image features can be performed along four main directions of the horizontal and vertical axes of the text, and the extracted image features are more convenient for the subsequent text content recognition.
Still further, the text picture rotated in the horizontal direction may have two forms, one is that the text picture is vertically placed, and the other is that the text picture is horizontally placed. For a vertically placed text picture, the difficulty is higher in subsequent network processing, and meanwhile, in order to facilitate the uniformity of data, in this embodiment, the text picture can be adjusted to be horizontally placed, and the specific implementation process can include:
After the original text picture is rotated to the horizontal direction, the aspect ratio of the original text picture in the horizontal direction is further calculated.
Specifically, by calculating the aspect ratio, it can be determined whether the text picture is placed vertically or horizontally. For example, the aspect ratio threshold may be preset, such as to 2 or other selectable value. When the calculated height-width ratio is determined to exceed the set threshold, the text picture can be considered to be vertically placed, and therefore the original text picture in the horizontal direction can be further rotated by 90 degrees to serve as the text picture to be identified; if the calculated aspect ratio is determined not to exceed the set threshold, the text picture can be considered to be placed transversely, and the text picture can be directly used as the text picture to be identified without rotation processing.
The process of rotating the original text picture in the horizontal direction by 90 degrees can be performed in a clockwise or counterclockwise direction. As shown in fig. 4, for the left text picture in fig. 4, which is vertically disposed, it may be rotated 90 degrees in a clockwise direction.
In some embodiments of the present application, the process of extracting image features in at least two different directions from the text picture in step S110 is described.
This embodiment describes an alternative to image feature extraction through a convolutional network. Specifically, the application can pre-train a convolution network for extracting image features.
The convolutional network may take the form of Resnet networks or other forms of networks with convolutional layers.
The convolutional network comprises a number of convolutional layers. In order to extract image features in at least two different directions from a text picture, the convolution kernels of the convolution layers are set to share a rotation convolution kernel in this embodiment. Here, the convolution kernel is a weight matrix, and the feature map output by the previous convolution layer is subjected to convolution operation through the convolution kernel. The shared rotation convolution kernel is a shared rotation convolution kernel in which the parameters of the convolution kernel are kept unchanged (i.e., the weights in the weight matrix are unchanged) and the convolution kernel is rotated at least once by 90 degrees.
By sharing the rotating convolution kernel, feature information in different directions of the text picture can be obtained.
The feature map output by each convolution layer is formed by fusing at least two feature subgraphs, wherein the at least two feature subgraphs comprise the same convolution kernel obtained by carrying out convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation.
Specifically, the at least two feature subgraphs may include:
the same convolution kernel carries out a convolution operation on the feature diagram output by the previous convolution layer before rotation to obtain a feature subgraph; and
After the same convolution kernel rotates 90 degrees, 180 degrees and/or 270 degrees according to a set direction (such as clockwise or anticlockwise), the rotated convolution kernel checks the feature subgraph obtained by performing convolution operation on the feature graph output by the previous convolution layer.
It can be appreciated that the convolution kernel performs a convolution operation on the feature map output by the previous convolution layer before rotation, and a corresponding feature sub-map can be obtained. On the basis, the convolution kernel can rotate 90 degrees, 180 degrees and/or 270 degrees according to the above manner, and each time the convolution kernel rotates once, the feature image output by the previous convolution layer is checked by utilizing the rotated convolution kernel to carry out convolution operation, so that a feature sub-image can be additionally obtained, for example, the convolution kernel rotates once, a feature sub-image can be additionally obtained, the convolution kernel rotates twice, two feature sub-images can be additionally obtained, the convolution kernel rotates three times, and three feature sub-images can be additionally obtained.
And fusing all characteristic subgraphs obtained before and after the rotation of the convolution kernel to obtain the output of the current convolution layer.
Referring to fig. 5, fig. 5 illustrates a schematic diagram of a process of feature extraction by two adjacent convolution layers through a shared rotating convolution kernel.
The output characteristic of the previous layer in the convolutional network is defined as F i-1, the size is c×h×w, where C is the number of channels characterized, H is the high characterized, and W is the wide characterized.
With feature F i-1 as the input to the current convolution layer, the convolution kernel size is set to 3*3 (FIG. 5 is merely an example, the size of the convolution kernel may be other sizes, and the application is not limited in this regard).
To enable the convolution kernel to capture text features in various directions, the convolution kernel may be rotated in this embodiment, such as by 90 degrees, 180 degrees, and 270 degrees, respectively, in a counter-clockwise direction. Further, the convolution operation is performed on the feature map output by the upper layer by using each convolution kernel before and after rotation, so that feature information in four main directions is obtained. The characteristic subgraphs obtained after the convolution operation of each convolution kernel are F i,0、Fi,90、Fi,180、Fi,270 respectively. The output of the current layer is the fusion of the four feature subgraphs, F i. The calculation formula of each feature is as follows:
Fi,0=conv0(Fi-1)
Fi,90=conv90(Fi-1)
Fi,180=conv180(Fi-1)
Fi,270=conv270(Fi-1)
Fi=cat(Fi,0,Fi,90,Fi,180,Fi,270)
Where cat () represents the concatenation operation of a feature.
The fused output feature F i has a size of 4C ' H ' W '. Wherein, C ', H ' and W ' are the channel number of the current convolution layer feature, the height and width of the feature, respectively.
And inputting the fused output characteristics F i into a next convolution layer for processing, and so on, wherein the extracted characteristics of each convolution layer contain different direction information, the directivity of the extracted image characteristics is enhanced, and the characteristics output by the last convolution layer are taken as the image characteristics F extracted by the convolution network.
Based on the convolutional network introduced above, a text picture may be input into the convolutional network while extracting image features in at least two different directions for the text picture. And extracting image features of the text picture in at least two different directions by using a convolution network, wherein the feature map output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs comprise the same convolution kernel obtained by carrying out convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation.
It will be appreciated that the convolution kernel, after one rotation, can extract image features in one direction, and if the rotation is performed as illustrated in fig. 5, then image features in four directions can be extracted.
Of course, compared with a mode that the convolution kernel does not do any rotation processing, the method can extract image features in one direction more when the convolution kernel rotates once more, further can provide more accurate image features for subsequent text recognition, and improves the accuracy of text recognition.
In some embodiments of the present application, two alternative implementation architectures of the convolutional network described above are further described, as well as implementations of image feature extraction using the convolutional network.
Alternatively, the convolution kernel of each convolution layer in the convolution network may include an original convolution kernel and a convolution kernel from which the original convolution kernel has undergone at least one rotation.
On the basis, the process of extracting the image features in at least two different directions of the text picture by using a convolution network can comprise the following steps:
S1, carrying out convolution operation on the feature map output by the previous convolution layer by utilizing the convolution of each convolution layer in the convolution network to obtain a feature subgraph extracted by each convolution kernel.
The convolution kernel of each convolution layer comprises an original convolution kernel and a convolution kernel after at least one rotation, so that the characteristic subgraphs respectively extracted by the original convolution kernel and the convolution kernels after rotation can be obtained.
S2, fusing the original convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature graph into a next convolution layer.
Specifically, the fusion process of the feature subgraphs extracted by the original convolution kernels and the rotated convolution kernels can be described in connection with the foregoing, for example, by stitching the feature subgraphs together in the channel dimension.
And S3, taking the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
Alternatively, the convolution kernel of each convolution layer in the convolution network includes the original convolution kernel and does not include the rotated convolution kernel.
On the basis, the process of extracting the image features in at least two different directions of the text picture by using a convolution network can comprise the following steps:
S1, rotating a convolution kernel of each convolution layer in a convolution network at least once, and performing convolution operation on a feature map output by a previous convolution layer by utilizing convolution kernels before and after rotation to obtain feature subgraphs extracted by each convolution kernel before and after rotation.
It can be understood that when the convolution layers of the convolution network only contain the original convolution kernels, in order to extract the image features from a plurality of different directions, when the convolution network is used for feature extraction, at least one rotation needs to be performed on the convolution kernel of each convolution layer, and then the convolution operation is performed on the feature map output by the previous convolution layer by using the convolution kernels before and after rotation, so as to obtain the feature subgraphs extracted by each convolution kernel before and after rotation.
S2, fusing the convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature graph into a next convolution layer.
And S3, taking the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
As can be seen from comparing the architectures of the two convolution networks, the former convolution network is pre-configured with a plurality of convolution kernels before and after rotation, and then when extracting the image features, each convolution kernel can be directly used for extracting the features. In the latter convolution network, only the convolution kernels before rotation are configured, so that when image feature extraction is performed, the convolution kernels need to be rotated at least once, and feature extraction can be performed by using each convolution kernel before rotation and after rotation. The multi-directional image feature extraction can be realized by both implementations, and the multi-directional image feature extraction can be specifically selected by a technician according to actual needs.
Based on the extracting of the image features of the text picture in at least two different directions by using the convolution network described in the foregoing embodiment, step S120 is further described in the embodiment of the present application, and the process of identifying the text content included in the text picture based on the extracted image features in the at least two different directions.
In this embodiment, a neural network model may be selected to process the text recognition task, that is, a recognition network may be trained in advance, where the recognition network and the convolutional network together form the text recognition model. Specifically, the output of the convolutional network serves as the input to the identification network, and the convolutional network and the identification network are trained in combination.
And training by using sample picture training data marked with text content recognition results when training the text recognition model.
On the basis, the image features in at least two different directions can be extracted by inputting the text picture to be recognized into a convolution network of the text recognition model, and further the extracted image features are input into the recognition network, and the recognition network outputs text content contained in the text picture.
The identification network may be selected from a variety of neural network architectures, for example, encoder-Decoder codec architecture, as shown in fig. 6.
The encoder Encoder may adopt a bidirectional LSTM (Long Short-Term Memory network) structure, and takes the image feature F output by the convolutional network in the previous step as input, and outputs the hidden state h i of each frame of the encoder.
The Decoder may employ a GRU (Gate Recurrent Unit, gated loop unit) or LSTM structure. For the hidden state s t at the current time of the decoder, attention mechanism Attention can be adopted to calculate the correlation between the hidden state s t and the hidden state h i of each frame of the encoder, and the context feature vector c t is obtained, and the calculation process is as follows:
eti=o(st,hi)
where o represents a dot product operation and T represents an encoder length.
Finally, the text predictor y t at the current time of the decoder is solved by the hidden state s t at the current time and the context feature vector c t together through the linear classification layer W.
In some embodiments of the present application, another implementation of a text recognition method is presented.
For the acquired text picture to be recognized, the problem of text inversion may occur, as shown in fig. 7, where the text in the upper diagram in fig. 7 is inverted.
In addition, in combination with the process of rotating the text picture corresponding to fig. 3 and fig. 4, the situation that the text of the final text picture is turned over may also occur in the rotating process.
If the text picture with the text turning problem is input into the text recognition model for recognition, the finally recognized text content may be inaccurate or the recognized text content is reversed in order.
For this purpose, in this embodiment, the following processing steps are further added before inputting the text picture into the previously constructed convolutional network:
And rotating the forward text picture by 180 degrees by taking the text picture as the forward text picture to obtain the reverse text picture.
On the basis, the forward text picture and the reverse text picture are respectively input into a convolution network in a text recognition model, so that text content and confidence coefficient of the text content contained in the forward text picture output by the text recognition model and text content and confidence coefficient of the text content contained in the reverse text picture output by the text recognition model are obtained.
And taking one with high confidence as a final recognition result from the text content contained in the forward text picture and the text content contained in the reverse text picture.
The forward text picture and the reverse text picture are respectively input into the text recognition model, and one recognized text content with high confidence is selected as a final recognition result, so that the method can adapt to the text pictures in different directions, and the obtained final recognition result is more accurate.
The text recognition device provided by the embodiment of the application is described below, and the text recognition device described below and the text recognition method described above can be referred to correspondingly.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a text recognition device according to an embodiment of the present application.
As shown in fig. 8, the apparatus may include:
A picture obtaining unit 11, configured to obtain a text picture to be identified, where the text picture is an image area where a text to be identified is located;
a feature extraction unit 12, configured to extract image features in at least two different directions from the text picture;
a text content identifying unit 13 for identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Alternatively, the feature extraction unit may include:
the convolution network processing unit is used for inputting the text picture into a pre-constructed convolution network; and extracting image features in at least two different directions of the text picture by using the convolution network, wherein the feature map output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs comprise a same convolution kernel obtained by carrying out convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation.
Optionally, the embodiment of the present application provides two alternative implementation structures of the convolutional network processing unit, which are respectively as follows:
first, a convolutional network processing unit includes:
the first convolution operation unit is used for carrying out convolution operation on the feature image output by the previous convolution layer by utilizing the convolution kernel of each convolution layer in the convolution network to obtain a feature subgraph extracted by each convolution kernel, wherein the convolution kernel of each convolution layer comprises an original convolution kernel and a convolution kernel which is subjected to at least one rotation;
the first feature fusion unit is used for fusing the original convolution kernels and the feature subgraphs extracted by the convolution kernels after rotation, and inputting the fused feature graphs into a next convolution layer;
And the first convolution output unit is used for taking the characteristic diagram output by the last convolution layer of the convolution network as the image characteristic of the text picture.
Second, the convolutional network processing unit includes:
the convolution kernel rotating unit is used for rotating the convolution kernel of each convolution layer in the convolution network at least once;
the second convolution operation unit is used for carrying out convolution operation on the feature graphs output by the previous convolution layer by utilizing the convolution cores before and after rotation to obtain feature subgraphs extracted by each convolution core before and after rotation;
The second feature fusion unit is used for fusing the convolution kernels and the feature subgraphs extracted by the convolution kernels after rotation, and inputting the fused feature graphs into a next convolution layer;
And the second convolution output unit is used for taking the characteristic diagram output by the last convolution layer of the convolution network as the image characteristic of the text picture.
Optionally, the at least two feature subgraphs may include:
the same convolution kernel carries out a convolution operation on the feature diagram output by the previous convolution layer before rotation to obtain a feature subgraph; and
After the same convolution kernel rotates 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the rotated convolution kernel checks the feature diagram output by the previous convolution layer to carry out the convolution operation to obtain the feature subgraph.
Alternatively, the text content recognition unit may include:
the recognition network processing unit is used for inputting the extracted image features in the at least two different directions into a pre-built recognition network to obtain text contents contained in the text pictures output by the recognition network; the recognition network and the convolution network form a text recognition model, and the text recognition model is obtained by training sample picture training data marked with text content recognition results.
Optionally, the above-mentioned picture obtaining unit may include:
The original picture acquisition unit is used for acquiring an original text picture to be identified, wherein the original text picture is rectangular;
And the first rotating unit is used for rotating the original text picture to the horizontal direction to serve as the text picture to be recognized if the original text picture is detected to incline relative to the horizontal direction.
Further optionally, the above-mentioned picture obtaining unit may further include:
the high-width ratio calculating unit is used for calculating the high-width ratio of the original text picture in the horizontal direction after the processing of the first rotating unit;
and the second rotating unit is used for rotating the original text picture in the horizontal direction by 90 degrees to serve as the text picture to be identified if the aspect ratio exceeds the set threshold value.
Optionally, the apparatus of the present application may further include:
And the third rotating unit is used for rotating the forward text picture by 180 degrees by taking the text picture as the forward text picture before inputting the text picture into the pre-constructed convolutional network to obtain the reverse text picture. On this basis, the convolutional network processing unit may include:
The forward and reverse text picture input unit is used for respectively inputting the forward text picture and the reverse text picture into a convolution network in the text recognition model to obtain text content and confidence coefficient of the text content contained in the forward text picture output by the text recognition model and text content and confidence coefficient of the text content contained in the reverse text picture output by the text recognition model;
and the confidence coefficient selection unit is used for taking one with high confidence coefficient among the text content contained in the forward text picture and the text content contained in the reverse text picture as a final recognition result.
The text recognition device provided by the embodiment of the application can be applied to text recognition equipment, such as a terminal: cell phones, computers, etc. Alternatively, fig. 9 shows a block diagram of a hardware structure of the text recognition apparatus, and referring to fig. 9, the hardware structure of the text recognition apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;
The processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
The memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
acquiring a text picture to be identified, wherein the text picture is an image area where the text to be identified is located;
extracting image features in at least two different directions from the text picture;
and identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:
acquiring a text picture to be identified, wherein the text picture is an image area where the text to be identified is located;
extracting image features in at least two different directions from the text picture;
and identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method of text recognition, comprising:
acquiring a text picture to be identified, wherein the text picture is an image area where the text to be identified is located;
extracting image features in at least two different directions from the text picture, so that the extracted image features contain feature information in a plurality of different directions of the text to be identified in the text picture;
and identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
2. The method of claim 1, wherein the extracting image features in at least two different directions for the text picture comprises:
Inputting the text picture into a pre-constructed convolution network;
And extracting image features in at least two different directions of the text picture by using the convolution network, wherein the feature map output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs comprise a same convolution kernel obtained by carrying out convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation.
3. The method of claim 2, wherein the extracting image features in at least two different directions of the text picture using the convolutional network comprises:
performing convolution operation on the feature map output by the previous convolution layer by utilizing the convolution kernel of each convolution layer in the convolution network to obtain feature subgraphs extracted by each convolution kernel, wherein the convolution kernel of each convolution layer comprises an original convolution kernel and a convolution kernel which is rotated at least once;
fusing the original convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into a next convolution layer;
And taking the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
4. The method of claim 2, wherein the extracting image features in at least two different directions of the text picture using the convolutional network comprises:
rotating the convolution kernel of each convolution layer in the convolution network at least once, and performing convolution operation on the feature graph output by the previous convolution layer by utilizing the convolution kernels before and after rotation to obtain feature subgraphs extracted by each convolution kernel before and after rotation;
fusing the convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into a next convolution layer;
And taking the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
5. The method of claim 2, wherein the at least two feature subgraphs comprise:
the same convolution kernel carries out a convolution operation on the feature diagram output by the previous convolution layer before rotation to obtain a feature subgraph; and
After the same convolution kernel rotates 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the rotated convolution kernel checks the feature diagram output by the previous convolution layer to carry out the convolution operation to obtain the feature subgraph.
6. The method of claim 2, wherein the identifying text content contained in the text picture based on the extracted image features in the at least two different directions comprises:
inputting the extracted image features in at least two different directions into a pre-constructed recognition network to obtain text content contained in the text picture output by the recognition network;
the recognition network and the convolution network form a text recognition model, and the text recognition model is obtained by training sample picture training data marked with text content recognition results.
7. The method according to any one of claims 1-6, wherein the obtaining a text picture to be identified comprises:
acquiring an original text picture to be identified;
And if the original text picture is detected to incline relative to the horizontal direction, rotating the original text picture to the horizontal direction to serve as the text picture to be identified.
8. The method of claim 7, further comprising, after said rotating said original text picture to a horizontal orientation:
Calculating the height-width ratio of the original text picture in the horizontal direction;
and if the aspect ratio exceeds the set threshold, rotating the original text picture in the horizontal direction by 90 degrees to serve as the text picture to be identified.
9. The method of claim 6, wherein prior to inputting the text picture into the pre-constructed convolutional network, the method further comprises:
taking the text picture as a forward text picture, and rotating the forward text picture by 180 degrees to obtain a reverse text picture;
the inputting the text picture into a pre-constructed convolutional network comprises:
Respectively inputting the forward text picture and the reverse text picture into a convolution network in the text recognition model to obtain text content and confidence coefficient contained in the forward text picture output by the text recognition model and text content and confidence coefficient contained in the reverse text picture output by the text recognition model;
And taking one with high confidence as a final recognition result from the text content contained in the forward text picture and the text content contained in the reverse text picture.
10. A text recognition device, comprising:
The image acquisition unit is used for acquiring a text image to be identified, wherein the text image is an image area where the text to be identified is located;
the feature extraction unit is used for extracting image features in at least two different directions from the text picture, so that the extracted image features contain feature information in a plurality of different directions of the text to be identified in the text picture;
And the text content identification unit is used for identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
11. A text recognition device, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the respective steps of the text recognition method according to any one of claims 1 to 9.
12. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the text recognition method according to any one of claims 1 to 9.
CN202110666915.6A 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium Active CN113392825B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110666915.6A CN113392825B (en) 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium
PCT/CN2021/139972 WO2022262239A1 (en) 2021-06-16 2021-12-21 Text identification method, apparatus and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110666915.6A CN113392825B (en) 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113392825A CN113392825A (en) 2021-09-14
CN113392825B true CN113392825B (en) 2024-04-30

Family

ID=77621485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110666915.6A Active CN113392825B (en) 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113392825B (en)
WO (1) WO2022262239A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392825B (en) * 2021-06-16 2024-04-30 中国科学技术大学 Text recognition method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222589A (en) * 2018-11-27 2020-06-02 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
CN111400497A (en) * 2020-03-19 2020-07-10 北京远鉴信息技术有限公司 Text recognition method and device, storage medium and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN111783756B (en) * 2019-04-03 2024-04-16 北京市商汤科技开发有限公司 Text recognition method and device, electronic equipment and storage medium
CN110659633A (en) * 2019-08-15 2020-01-07 坎德拉(深圳)科技创新有限公司 Image text information recognition method and device and storage medium
CN112101351B (en) * 2020-09-07 2024-04-19 凌云光技术股份有限公司 Text line rotation correction method and device based on projection
AU2021100391A4 (en) * 2021-01-22 2021-04-15 GRG Banking Equipment Co.,Ltd Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism
CN113392825B (en) * 2021-06-16 2024-04-30 中国科学技术大学 Text recognition method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222589A (en) * 2018-11-27 2020-06-02 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
CN111400497A (en) * 2020-03-19 2020-07-10 北京远鉴信息技术有限公司 Text recognition method and device, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Accurate, data-efficient, unconstrained text recognition with convolutional neural networks;Mohamed Yousef et al;《Pattern Recognition》;第108卷;1-12 *
基于轻量级网络的自然场景下的文本检测;孙婧婧等;《电子测量技术》;第43卷(第8期);101-107 *

Also Published As

Publication number Publication date
WO2022262239A1 (en) 2022-12-22
CN113392825A (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN107704838B (en) Target object attribute identification method and device
CN109961009B (en) Pedestrian detection method, system, device and storage medium based on deep learning
CN109325954B (en) Image segmentation method and device and electronic equipment
US10134165B2 (en) Image distractor detection and processing
US20200380263A1 (en) Detecting key frames in video compression in an artificial intelligence semiconductor solution
CN109859096A (en) Image Style Transfer method, apparatus, electronic equipment and storage medium
WO2021208667A1 (en) Image processing method and apparatus, electronic device, and storage medium
CN108304775A (en) Remote sensing images recognition methods, device, storage medium and electronic equipment
JP2013522971A (en) Image feature detection based on the application of multiple feature detectors
CN105512220B (en) Image page output method and device
CN112101359B (en) Text formula positioning method, model training method and related device
WO2022002262A1 (en) Character sequence recognition method and apparatus based on computer vision, and device and medium
CN107644423B (en) Scene segmentation-based video data real-time processing method and device and computing equipment
US20200005689A1 (en) Generating three-dimensional user experience based on two-dimensional media content
CN113392825B (en) Text recognition method, device, equipment and storage medium
CN110298327A (en) A kind of visual effect processing method and processing device, storage medium and terminal
WO2022063321A1 (en) Image processing method and apparatus, device and storage medium
CN111597937B (en) Fish gesture recognition method, device, equipment and storage medium
CN109871814B (en) Age estimation method and device, electronic equipment and computer storage medium
WO2020228171A1 (en) Data enhancement method and device, and computer readable storage medium
CN108304838B (en) Picture information identification method and terminal
CN113255667B (en) Text image similarity evaluation method and device, electronic equipment and storage medium
CN113610864B (en) Image processing method, device, electronic equipment and computer readable storage medium
CN106504223B (en) The reference angle determination method and device of picture
CN113298098A (en) Fundamental matrix estimation method and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230506

Address after: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96

Applicant after: University of Science and Technology of China

Applicant after: IFLYTEK Co.,Ltd.

Address before: NO.666, Wangjiang West Road, hi tech Zone, Hefei City, Anhui Province

Applicant before: IFLYTEK Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant