CN113392825A - Text recognition method, device, equipment and storage medium - Google Patents

Text recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN113392825A
CN113392825A CN202110666915.6A CN202110666915A CN113392825A CN 113392825 A CN113392825 A CN 113392825A CN 202110666915 A CN202110666915 A CN 202110666915A CN 113392825 A CN113392825 A CN 113392825A
Authority
CN
China
Prior art keywords
text
convolution
picture
feature
text picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110666915.6A
Other languages
Chinese (zh)
Other versions
CN113392825B (en
Inventor
赵坤
杨争艳
吴嘉嘉
殷兵
胡金水
刘聪
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202110666915.6A priority Critical patent/CN113392825B/en
Publication of CN113392825A publication Critical patent/CN113392825A/en
Priority to PCT/CN2021/139972 priority patent/WO2022262239A1/en
Application granted granted Critical
Publication of CN113392825B publication Critical patent/CN113392825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)

Abstract

The application discloses a text recognition method, a text recognition device, text recognition equipment and a storage medium. Therefore, for the text picture to be recognized, in view of the diversification of the text content direction, the extracted direction information is strengthened during the image feature extraction, namely, the feature extraction is performed on the text picture from two or more different directions, so that the extracted image features comprise the feature information of the text to be recognized in the text picture in multiple directions.

Description

Text recognition method, device, equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text recognition method, apparatus, device, and storage medium.
Background
With the development of text recognition technology, the application of text recognition in real life is more and more extensive. Examples are road sign recognition, photo translation, document scan recognition in automatic driving, etc.
In real life, the text regions are distributed in various directions in the scene picture, as shown in fig. 1, and include horizontal text, oblique text, vertical text, and the like. Due to the diversified directionality of the text pictures to be recognized, the method brings greater challenges to the text recognition, and how to accurately perform the text recognition on the pictures becomes a problem to be solved urgently in the industry.
Disclosure of Invention
In view of the foregoing problems, the present application is provided to provide a text recognition method, apparatus, device and storage medium for accurately performing text recognition on a text picture to be recognized with diversified directionality. The specific scheme is as follows:
a text recognition method, comprising:
acquiring a text picture to be recognized, wherein the text picture is an image area where a text to be recognized is located;
extracting image features in at least two different directions from the text picture;
identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Preferably, the extracting image features in at least two different directions from the text picture includes:
inputting the text picture into a pre-constructed convolution network;
and extracting image features of the text picture in at least two different directions by using the convolution network, wherein the feature graph output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs are obtained by performing convolution operation on the feature graph output by the previous convolution layer before rotation and after at least one rotation of the same convolution kernel.
Preferably, the extracting, by using the convolutional network, image features in at least two different directions of the text picture includes:
performing convolution operation on a feature graph output by a previous convolution layer by utilizing a convolution kernel of each convolution layer in the convolution network to obtain a feature subgraph extracted by each convolution kernel, wherein the convolution kernel of each convolution layer comprises an original convolution kernel and a convolution kernel after at least one rotation;
fusing the original convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into the next convolution layer;
and taking the feature graph output by the last convolution layer of the convolution network as the image feature of the text picture.
Preferably, the extracting, by using the convolutional network, image features in at least two different directions of the text picture includes:
performing at least one rotation on the convolution kernel of each convolution layer in the convolution network, and performing convolution operation on the feature graph output by the previous convolution layer by using the convolution kernels before and after the rotation to obtain feature subgraphs extracted from each convolution kernel before and after the rotation;
fusing the convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into the next convolution layer;
and taking the feature graph output by the last convolution layer of the convolution network as the image feature of the text picture.
Preferably, the at least two feature subgraphs comprise:
carrying out convolution operation on the feature graph output by the previous convolution layer by the same convolution kernel before rotation to obtain a feature subgraph; and the number of the first and second groups,
and after the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, performing convolution operation on the feature map output by the previous convolution kernel by the rotated convolution kernel to obtain a feature subgraph.
Preferably, the identifying text content contained in the text picture based on the extracted image features in the at least two different directions includes:
inputting the extracted image characteristics in the at least two different directions into a pre-constructed identification network to obtain text contents contained in the text picture output by the identification network;
the recognition network and the convolution network form a text recognition model, and the text recognition model is obtained by training sample picture training data marked with a text content recognition result.
Preferably, the acquiring the text picture to be recognized includes:
acquiring an original text picture to be identified;
and if the original text picture is detected to be inclined relative to the horizontal direction, rotating the original text picture to the horizontal direction to be used as the text picture to be identified.
Preferably, after the rotating the original text picture to the horizontal direction, the method further comprises:
calculating the aspect ratio of the original text picture in the horizontal direction;
and if the aspect ratio is determined to exceed the set threshold value, rotating the original text picture in the horizontal direction by 90 degrees to serve as the text picture to be recognized.
Preferably, before inputting the text picture into a pre-constructed convolutional network, the method further comprises:
taking the text picture as a forward text picture, and rotating the forward text picture by 180 degrees to obtain a reverse text picture;
then, the inputting the text picture into a pre-constructed convolutional network includes:
respectively inputting the forward text picture and the reverse text picture into a convolution network in the text recognition model to obtain text content and confidence thereof contained in the forward text picture output by the text recognition model and text content and confidence thereof contained in the reverse text picture output by the text recognition model;
and taking the text content contained in the forward text picture and the text content contained in the reverse text picture, wherein the one with higher confidence coefficient is taken as the final recognition result.
A text recognition apparatus comprising:
the image acquisition unit is used for acquiring a text image to be identified, wherein the text image is an image area where the text to be identified is located;
the feature extraction unit is used for extracting image features in at least two different directions from the text picture;
and the text content identification unit is used for identifying the text content contained in the text picture based on the extracted image features in the at least two different directions.
A text recognition apparatus comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the text recognition method.
A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the text recognition method as described above.
By means of the technical scheme, the text picture corresponding to the image area where the text to be recognized is located is obtained, the image features in at least two different directions are further extracted from the text picture to be recognized, and the text content contained in the text picture is recognized based on the extracted image features in at least two different directions. Therefore, for the text picture to be recognized, in view of the diversification of the text content direction, the extracted direction information is strengthened during the image feature extraction, namely, the feature extraction is performed on the text picture from two or more different directions, so that the extracted image features comprise the feature information of the text to be recognized in the text picture in multiple directions.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates several text picture diagrams distributed in different directions;
fig. 2 is a schematic flow chart of a text recognition method according to an embodiment of the present application;
FIG. 3 illustrates a process diagram for rotating an original text picture to a horizontal orientation;
FIG. 4 illustrates a process diagram for rotating a text picture to a landscape orientation;
FIG. 5 illustrates a schematic diagram of a process for feature extraction by sharing a rotating convolution kernel between two adjacent convolution layers;
FIG. 6 illustrates a schematic diagram of a recognition network architecture for a codec structure;
FIG. 7 is a process diagram illustrating a rotation operation for a text picture turned over by a word;
fig. 8 is a schematic structural diagram of an identification processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a text recognition device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The scheme can be realized based on a terminal with data processing capacity, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.
Next, as described in conjunction with fig. 2, the text recognition method of the present application may include the following steps:
and S100, acquiring a text picture to be identified.
Specifically, the text picture to be recognized is an image area where the text to be recognized is located. In this step, a text picture that needs to be subjected to text recognition may be directly obtained, or a text region detection may be performed on an original picture that includes a text to be recognized, so as to obtain an image region where the text to be recognized is located.
Further, in order to facilitate text recognition, the text picture acquired in this step may be a text line picture, that is, an image area where a line of text is located.
Step S110, extracting image characteristics in at least two different directions from the text picture.
Specifically, the direction of the text content of the text picture to be recognized is not fixed, and in order to adapt to recognition of text contents in multiple different directions, the extracted direction information is strengthened in the step of extracting the image features, that is, feature extraction is performed on the text picture from two or more different directions, so that the extracted image features include feature information of the text to be recognized in the text picture in multiple different directions. Of course, there may be many different implementation manners for the extraction manner of the image features, and this step is not strictly limited.
Step S120, identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Specifically, after the image features in at least two different directions of the text picture to be recognized are extracted, based on the extracted image features, the text content contained in the text picture can be recognized more accurately, and the accuracy of text recognition is improved.
The text recognition method provided by the embodiment of the application obtains the text picture corresponding to the image area where the text to be recognized is located, further extracts the image features in at least two different directions for the text picture to be recognized, and further recognizes the text content contained in the text picture based on the extracted image features in at least two different directions. Therefore, for the text picture to be recognized, in view of the diversification of the text content direction, the extracted direction information is strengthened during the image feature extraction, namely, the feature extraction is performed on the text picture from two or more different directions, so that the extracted image features comprise the feature information of the text to be recognized in the text picture in multiple directions.
In some embodiments of the present application, a process of obtaining the text picture to be recognized in the foregoing step S100 is further described.
In this embodiment, an original text picture to be identified is first obtained, where the original text picture is obtained by text region detection. In general, the original text picture may be a text line picture. The shape of the original text picture may also be different according to the detection means used in the text region detection, for example, the original text picture may be a rectangle, a parallelogram, or other optional shape. In general, the application can select a rectangular original text picture.
Since the text region may have different positions, trends, character directions, and the like in different scenes, the obtained original text picture may be inclined with respect to the horizontal direction, that is, any one edge of the original text picture is not parallel to the horizontal direction, as shown in fig. 3, the original text picture on the left side in fig. 3 is inclined with respect to the horizontal direction.
In order to better extract image features in a plurality of different directions from the text image in the subsequent step, in this embodiment, the original text image may be rotated to the horizontal direction to be used as the text image to be recognized.
The direction of rotation for the original text picture may be counterclockwise or clockwise, as long as it is ensured that one edge of the text picture after rotation is parallel to the horizontal direction.
It can be understood that after the rotation processing, when the image features are extracted by using a plurality of convolution kernels, the image features can be extracted along four main directions of a horizontal central axis and a vertical central axis of a character, and the extracted image features are more convenient for subsequent text content identification.
Still further, the text picture after horizontal rotation may have two forms, one is that the text picture is vertically placed, and the other is that the text picture is horizontally placed. For the vertically placed text pictures, the difficulty is higher when the subsequent network processing is performed, and meanwhile, in order to facilitate the uniformity of data, the text pictures can be adjusted to be placed horizontally in the embodiment, and the specific implementation process can include:
after the original text picture is rotated to the horizontal direction, the aspect ratio of the original text picture in the horizontal direction is further calculated.
Specifically, by calculating the aspect ratio, it is possible to determine whether the text picture is vertically placed or horizontally placed. For example, an aspect ratio threshold may be preset, such as set to 2 or other selectable value. When the calculated aspect ratio is determined to exceed the set threshold, the text picture can be considered to be vertically placed, and therefore the original text picture in the horizontal direction can be further rotated by 90 degrees to serve as the text picture to be identified; if the calculated aspect ratio is determined not to exceed the set threshold, the text picture can be considered to be transversely placed and can be directly used as the text picture to be recognized without rotation processing.
The process of rotating the original text picture in the horizontal direction by 90 degrees may be performed clockwise or counterclockwise. As shown in fig. 4, for the left text picture in fig. 4, the left text picture is vertically placed and can be rotated by 90 degrees in a clockwise direction.
In some embodiments of the present application, for the step S110, a process of extracting image features in at least two different directions from the text image is described.
An alternative approach to image feature extraction by convolutional networks is presented in this embodiment. Specifically, the method can train a convolution network in advance for extracting the image features.
The convolutional network may take the form of a Resnet29 network or other form of network with convolutional layers.
The convolutional network comprises several convolutional layers. In order to extract image features in at least two different directions from the text image, the convolution kernels of the convolution layers are set as shared rotation convolution kernels in the embodiment. Here, the convolution kernel, i.e., the weight matrix, performs a convolution operation on the feature map output from the previous convolution layer by the convolution kernel. The shared convolution kernel is a convolution kernel in which the parameters of the convolution kernel are kept unchanged (i.e., the weights in the weight matrix are unchanged), the convolution kernel is rotated at least once by 90 degrees, and the convolution kernel after rotation and the convolution kernel before rotation are shared convolution kernels.
By sharing the convolution kernel, the feature information in different directions of the text picture can be acquired.
The feature graph output by each convolution layer is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs are obtained by performing convolution operation on the feature graph output by the previous convolution layer before rotation and after at least one rotation of the same convolution kernel.
Specifically, the at least two feature subgraphs may include:
carrying out convolution operation on the feature graph output by the previous convolution layer by the same convolution kernel before rotation to obtain a feature subgraph; and the number of the first and second groups,
after the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to a set direction (such as clockwise or counterclockwise), the feature subgraph obtained by performing convolution operation on the feature graph output by the previous convolution kernel is obtained by the rotated convolution kernel.
It can be understood that, the convolution kernel performs convolution operation on the feature map output by the previous convolution layer before rotation, and a corresponding feature sub-map can be obtained. On the basis, the convolution kernel can be rotated by 90 degrees, 180 degrees and/or 270 degrees in the above manner, and each time the convolution kernel is rotated once, the convolution kernel performs convolution operation on the feature graph output by the previous convolution layer, that is, one additional feature subgraph can be obtained by rotating the convolution kernel once, for example, one additional feature subgraph can be obtained by rotating the convolution kernel twice, two additional feature subgraphs can be obtained by rotating the convolution kernel three times, and three additional feature subgraphs can be obtained by rotating the convolution kernel three times.
And fusing the characteristic subgraphs obtained before and after the convolution kernel rotation to obtain the output of the current convolution layer.
Fig. 5, in conjunction with fig. 5, illustrates a schematic diagram of a process of feature extraction of two adjacent convolution layers by sharing a convolution rotation kernel.
Defining the output characteristic of the previous layer in the convolutional network as Fi-1And the dimension is C x H x W, where C is the number of channels of the feature, H is the height of the feature, and W is the width of the feature.
Will be characterized by Fi-1As an input to the current convolutional layer, the size of the convolution kernel is set to 3 × 3 (fig. 5 is only an example, and the size of the convolution kernel may be other sizes, which is not strictly limited in this application).
In order to enable the convolution kernel to capture the text features in different directions, the convolution kernel may be rotated in the present embodiment, for example, by 90 degrees, 180 degrees, and 270 degrees in the counterclockwise direction, respectively. Further, each convolution kernel before and after rotation is utilized to carry out convolution operation on the characteristic diagram output by the upper layer, and four main directions are obtainedCharacteristic information of (1). The characteristic subgraph obtained after the convolution operation of each convolution kernel is respectively Fi,0、Fi,90、Fi,180、Fi,270. The output of the current layer is the fusion F of the four characteristic subgraphsi. The calculation formula of each feature is as follows:
Fi,0=conv0(Fi-1)
Fi,90=conv90(Fi-1)
Fi,180=conv180(Fi-1)
Fi,270=conv270(Fi-1)
Fi=cat(Fi,0,Fi,90,Fi,180,Fi,270)
where cat () represents the splicing operation of a feature.
Fused output feature FiThe size was 4C ' H ' W '. Wherein, C ', H ' and W ' are the channel number of the current convolution layer characteristic and the height and width of the characteristic respectively.
The fused output characteristic FiAnd inputting the next convolution layer for processing, and repeating the steps, wherein the extracted features of each convolution layer contain different direction information, the directionality of the extracted image features is enhanced, and the features output by the last convolution layer are used as the image features F extracted by the convolution network.
Based on the above-described convolutional network, when extracting image features in at least two different directions for a text picture, the text picture can be input into the convolutional network. The method comprises the steps of extracting image features of at least two different directions of a text picture by utilizing a convolution network, wherein a feature graph output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs are obtained by performing convolution operation on the feature graph output by the previous convolution layer before rotation and after at least one rotation of the same convolution kernel.
It is understood that the convolution kernel can extract image features in one direction through one rotation, and image features in four directions can be extracted according to the rotation manner illustrated in fig. 5.
Of course, compared with a mode that the convolution kernel does not perform any rotation processing, the image features in one direction can be extracted more every time the convolution kernel rotates more than once, so that more accurate image features can be provided for subsequent text recognition, and the accuracy of the text recognition is improved.
In some embodiments of the present application, two alternative implementation architectures of the above convolutional network and an implementation process of image feature extraction using the convolutional network are further described.
In an alternative, the convolution kernels of each convolution layer in the convolution network may include the original convolution kernel and the convolution kernel after at least one rotation of the original convolution kernel.
On this basis, the process of extracting image features in at least two different directions of the text picture by using a convolutional network may include:
s1, carrying out convolution operation on the feature graph output by the previous convolution layer by utilizing the convolution kernel of each convolution layer in the convolution network to obtain the feature subgraph extracted by each convolution kernel.
The convolution kernel of each convolution layer comprises an original convolution kernel and convolution kernels rotated for at least one time, so that characteristic subgraphs extracted from the original convolution kernel and the convolution kernels rotated for at least one time can be obtained.
And S2, fusing the original convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into the next convolution layer.
Specifically, the fusion process of the feature subgraphs extracted by the original convolution kernel and the rotated convolution kernels may be introduced in conjunction with the foregoing description, for example, the feature subgraphs are spliced together in the channel dimension.
And S3, using the feature graph output by the last convolution layer of the convolution network as the image feature of the text picture.
Alternatively, the convolution kernel of each convolution layer in the convolution network includes the original convolution kernel and not the rotated convolution kernel.
On this basis, the process of extracting image features in at least two different directions of the text picture by using a convolutional network may include:
s1, the convolution kernel of each convolution layer in the convolution network is rotated for at least one time, and the convolution kernel before and after rotation is used for carrying out convolution operation on the feature graph output by the previous convolution layer to obtain the feature subgraph extracted by each convolution kernel before and after rotation.
It can be understood that, when convolution layers of a convolution network only include original convolution kernels, in order to extract image features from multiple different directions, when feature extraction is performed by using the convolution network, it is necessary to first perform at least one rotation on the convolution kernels of each convolution layer, and then perform convolution operations on feature maps output by a previous convolution layer by using convolution kernels before and after rotation, so as to obtain feature subgraphs extracted by each convolution kernel before and after rotation.
And S2, fusing the convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into the next convolution layer.
And S3, using the feature graph output by the last convolution layer of the convolution network as the image feature of the text picture.
Comparing the architectures of the two convolution networks, it can be known that the former convolution network is configured with a plurality of convolution kernels before and after rotation in advance, and further when image feature extraction is performed, each convolution kernel can be directly used for feature extraction. In the latter convolution network, only the convolution kernel before rotation is configured, and therefore, when image feature extraction is performed, at least one rotation needs to be performed on the convolution kernel first, and further, feature extraction can be performed by using each convolution kernel before and after rotation. Both implementations can achieve extraction of multi-directional image features, which can be specifically selected by a skilled person according to actual needs.
Based on the above-described process of extracting image features in at least two different directions from a text image by using a convolutional network, step S120 is further described in this embodiment of the present application, and a process of identifying text content included in the text image is based on the extracted image features in the at least two different directions.
In this embodiment, a neural network model may be selected to process the text recognition task, that is, a recognition network may be trained in advance, and the recognition network and the convolutional network jointly form a text recognition model. Specifically, the output of the convolutional network is used as the input of the recognition network, and the convolutional network and the recognition network are jointly trained.
And when the text recognition model is trained, training is carried out by utilizing the sample picture training data marked with the text content recognition result.
On the basis, the text picture to be recognized is input into the convolution network of the text recognition model, so that the image features in at least two different directions can be extracted, the extracted image features are further input into the recognition network, and the text content contained in the text picture is output by the recognition network.
The identification network may adopt various neural network architectures, for example, an Encoder-Decoder codec architecture may be adopted, as shown in fig. 6.
The Encoder Encoder can adopt a bidirectional LSTM (Long Short-Term Memory) structure, and the image feature F output by the convolution network in the previous step is used as input to output the hidden state h of each frame of the Encoderi
The Decoder may employ a GRU (Gate recovery Unit) or LSTM structure. Implicit state s for the current moment of the decodertThe latent state s can be calculated using Attention mechanism AttentiontAnd the encoder is in a state h of being hidden by each frameiObtaining context feature vector ctThe calculation process is as follows:
eti=o(st,hi)
Figure BDA0003117156890000121
Figure BDA0003117156890000122
where o denotes a dot product operation and T denotes an encoder length.
Finally, the text prediction value y of the decoder at the current momenttFrom the hidden state s at the current momenttAnd context feature vector ctAre jointly solved by the linear classification layer W.
In some embodiments of the present application, another implementation of a text recognition method is presented.
For the acquired text picture to be recognized, the problem of text inversion may occur, as shown in fig. 7, the text in the upper diagram in fig. 7 is inverted.
In addition, in the process of rotating the text image corresponding to fig. 3 and 4, a situation that the final text image has characters turned over may also occur in the rotation process.
If the text picture with the character turning problem is input into the text recognition model for recognition, the finally recognized text content may be inaccurate, or the recognized text content is reversed in word order.
For this reason, in this embodiment, before the text picture is input into the pre-constructed convolutional network, the following processing steps are further added:
and taking the text picture as a forward text picture, and rotating the forward text picture by 180 degrees to obtain a reverse text picture.
On the basis, the forward text picture and the reverse text picture are respectively input into a convolution network in a text recognition model, and text content and confidence degree contained in the forward text picture output by the text recognition model and text content and confidence degree contained in the reverse text picture output by the text recognition model are obtained.
And taking the text content contained in the forward text picture and the text content contained in the reverse text picture, wherein the one with higher confidence coefficient is taken as the final recognition result.
The forward text picture and the reverse text picture are respectively input into the text recognition model, and one recognized text content with higher confidence coefficient is selected as a final recognition result, so that the method can adapt to the text pictures in different directions, and the obtained final recognition result is more accurate.
The following describes the text recognition apparatus provided in the embodiments of the present application, and the text recognition apparatus described below and the text recognition method described above may be referred to correspondingly.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a text recognition apparatus disclosed in the embodiment of the present application.
As shown in fig. 8, the apparatus may include:
the image obtaining unit 11 is configured to obtain a text image to be identified, where the text image is an image area where a text to be identified is located;
a feature extraction unit 12, configured to extract image features in at least two different directions from the text picture;
a text content identification unit 13, configured to identify text content included in the text picture based on the extracted image features in the at least two different directions.
Optionally, the feature extraction unit may include:
the convolution network processing unit is used for inputting the text picture into a pre-constructed convolution network; and extracting image features of the text picture in at least two different directions by using the convolution network, wherein the feature graph output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs are obtained by performing convolution operation on the feature graph output by the previous convolution layer before rotation and after at least one rotation of the same convolution kernel.
Optionally, the embodiment of the present application provides two optional implementation structures of a convolution network processing unit, which are respectively as follows:
first, the convolutional network processing unit includes:
the first convolution operation unit is used for performing convolution operation on a feature map output by a previous convolution layer by using a convolution kernel of each convolution layer in the convolution network to obtain a feature subgraph extracted by each convolution kernel, wherein the convolution kernel of each convolution layer comprises an original convolution kernel and a convolution kernel after at least one rotation;
the first characteristic fusion unit is used for fusing the original convolution kernels and the characteristic subgraphs extracted by the rotated convolution kernels and inputting the fused characteristic graphs into the next convolution layer;
and the first convolution output unit is used for taking a feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
Second, the convolutional network processing unit includes:
the convolution kernel rotating unit is used for rotating the convolution kernel of each convolution layer in the convolution network at least once;
the second convolution operation unit is used for performing convolution operation on the feature graph output by the previous convolution layer by utilizing the convolution kernels before and after rotation to obtain feature subgraphs extracted by each convolution kernel before and after rotation;
the second feature fusion unit is used for fusing the convolution kernels and the feature subgraphs extracted by the rotated convolution kernels and inputting the fused feature graphs into the next convolution layer;
and the second convolution output unit is used for taking the feature graph output by the last convolution layer of the convolution network as the image feature of the text picture.
Optionally, the at least two feature sub-graphs may include:
carrying out convolution operation on the feature graph output by the previous convolution layer by the same convolution kernel before rotation to obtain a feature subgraph; and the number of the first and second groups,
and after the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, performing convolution operation on the feature map output by the previous convolution kernel by the rotated convolution kernel to obtain a feature subgraph.
Optionally, the text content recognition unit may include:
the recognition network processing unit is used for inputting the extracted image characteristics in the at least two different directions into a pre-constructed recognition network to obtain text contents contained in the text picture output by the recognition network; the recognition network and the convolution network form a text recognition model, and the text recognition model is obtained by training sample picture training data marked with a text content recognition result.
Optionally, the picture acquiring unit may include:
the device comprises an original image acquisition unit, a recognition unit and a recognition unit, wherein the original image acquisition unit is used for acquiring an original text image to be recognized, and the original text image is rectangular;
and the first rotation unit is used for rotating the original text picture to the horizontal direction as the text picture to be identified if the original text picture is detected to be inclined relative to the horizontal direction.
Further optionally, the picture acquiring unit may further include:
an aspect ratio calculation unit configured to calculate an aspect ratio of the original text picture in the horizontal direction after the processing by the first rotation unit;
and the second rotating unit is used for rotating the original text picture in the horizontal direction by 90 degrees to serve as the text picture to be recognized if the aspect ratio is determined to exceed the set threshold.
Optionally, the apparatus of the present application may further include:
and the third rotation unit is used for taking the text picture as a forward text picture and rotating the forward text picture by 180 degrees to obtain a reverse text picture before inputting the text picture into the pre-constructed convolution network. On this basis, the convolution network processing unit may include:
a forward and reverse text picture input unit, configured to input the forward text picture and the reverse text picture into a convolution network in the text recognition model respectively, so as to obtain text content and confidence thereof included in the forward text picture output by the text recognition model, and text content and confidence thereof included in the reverse text picture output by the text recognition model;
and the confidence coefficient selecting unit is used for taking one of the text contents contained in the forward text picture and the text contents contained in the reverse text picture, which has higher confidence coefficient, as a final recognition result.
The text recognition device provided by the embodiment of the application can be applied to text recognition equipment, such as a terminal: mobile phones, computers, etc. Alternatively, fig. 9 shows a block diagram of a hardware structure of the text recognition apparatus, and referring to fig. 9, the hardware structure of the text recognition apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring a text picture to be recognized, wherein the text picture is an image area where a text to be recognized is located;
extracting image features in at least two different directions from the text picture;
identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:
acquiring a text picture to be recognized, wherein the text picture is an image area where a text to be recognized is located;
extracting image features in at least two different directions from the text picture;
identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A text recognition method, comprising:
acquiring a text picture to be recognized, wherein the text picture is an image area where a text to be recognized is located;
extracting image features in at least two different directions from the text picture;
identifying text content contained in the text picture based on the extracted image features in the at least two different directions.
2. The method of claim 1, wherein the extracting image features in at least two different directions from the text picture comprises:
inputting the text picture into a pre-constructed convolution network;
and extracting image features of the text picture in at least two different directions by using the convolution network, wherein the feature graph output by each convolution layer in the convolution network is formed by fusing at least two feature subgraphs, and the at least two feature subgraphs are obtained by performing convolution operation on the feature graph output by the previous convolution layer before rotation and after at least one rotation of the same convolution kernel.
3. The method of claim 2, wherein said extracting image features in at least two different directions of the text picture using the convolutional network comprises:
performing convolution operation on a feature graph output by a previous convolution layer by utilizing a convolution kernel of each convolution layer in the convolution network to obtain a feature subgraph extracted by each convolution kernel, wherein the convolution kernel of each convolution layer comprises an original convolution kernel and a convolution kernel after at least one rotation;
fusing the original convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into the next convolution layer;
and taking the feature graph output by the last convolution layer of the convolution network as the image feature of the text picture.
4. The method of claim 2, wherein said extracting image features in at least two different directions of the text picture using the convolutional network comprises:
performing at least one rotation on the convolution kernel of each convolution layer in the convolution network, and performing convolution operation on the feature graph output by the previous convolution layer by using the convolution kernels before and after the rotation to obtain feature subgraphs extracted from each convolution kernel before and after the rotation;
fusing the convolution kernels and the feature subgraphs extracted by the rotated convolution kernels, and inputting the fused feature subgraphs into the next convolution layer;
and taking the feature graph output by the last convolution layer of the convolution network as the image feature of the text picture.
5. The method of claim 2, wherein the at least two feature subgraphs comprise:
carrying out convolution operation on the feature graph output by the previous convolution layer by the same convolution kernel before rotation to obtain a feature subgraph; and the number of the first and second groups,
and after the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, performing convolution operation on the feature map output by the previous convolution kernel by the rotated convolution kernel to obtain a feature subgraph.
6. The method according to claim 2, wherein the identifying text content contained in the text picture based on the extracted image features in the at least two different directions comprises:
inputting the extracted image characteristics in the at least two different directions into a pre-constructed identification network to obtain text contents contained in the text picture output by the identification network;
the recognition network and the convolution network form a text recognition model, and the text recognition model is obtained by training sample picture training data marked with a text content recognition result.
7. The method according to any one of claims 1 to 6, wherein the obtaining of the text picture to be recognized comprises:
acquiring an original text picture to be identified;
and if the original text picture is detected to be inclined relative to the horizontal direction, rotating the original text picture to the horizontal direction to be used as the text picture to be identified.
8. The method of claim 7, wherein after the rotating the original text picture to a horizontal orientation, the method further comprises:
calculating the aspect ratio of the original text picture in the horizontal direction;
and if the aspect ratio is determined to exceed the set threshold value, rotating the original text picture in the horizontal direction by 90 degrees to serve as the text picture to be recognized.
9. The method of claim 6, wherein prior to inputting the text picture into a pre-constructed convolutional network, the method further comprises:
taking the text picture as a forward text picture, and rotating the forward text picture by 180 degrees to obtain a reverse text picture;
then, the inputting the text picture into a pre-constructed convolutional network includes:
respectively inputting the forward text picture and the reverse text picture into a convolution network in the text recognition model to obtain text content and confidence thereof contained in the forward text picture output by the text recognition model and text content and confidence thereof contained in the reverse text picture output by the text recognition model;
and taking the text content contained in the forward text picture and the text content contained in the reverse text picture, wherein the one with higher confidence coefficient is taken as the final recognition result.
10. A text recognition apparatus, comprising:
the image acquisition unit is used for acquiring a text image to be identified, wherein the text image is an image area where the text to be identified is located;
the feature extraction unit is used for extracting image features in at least two different directions from the text picture;
and the text content identification unit is used for identifying the text content contained in the text picture based on the extracted image features in the at least two different directions.
11. A text recognition apparatus, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the text recognition method according to any one of claims 1 to 9.
12. A storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the text recognition method according to any one of claims 1 to 9.
CN202110666915.6A 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium Active CN113392825B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110666915.6A CN113392825B (en) 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium
PCT/CN2021/139972 WO2022262239A1 (en) 2021-06-16 2021-12-21 Text identification method, apparatus and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110666915.6A CN113392825B (en) 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113392825A true CN113392825A (en) 2021-09-14
CN113392825B CN113392825B (en) 2024-04-30

Family

ID=77621485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110666915.6A Active CN113392825B (en) 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113392825B (en)
WO (1) WO2022262239A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022262239A1 (en) * 2021-06-16 2022-12-22 科大讯飞股份有限公司 Text identification method, apparatus and device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222589A (en) * 2018-11-27 2020-06-02 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
CN111400497A (en) * 2020-03-19 2020-07-10 北京远鉴信息技术有限公司 Text recognition method and device, storage medium and electronic equipment
US20210042567A1 (en) * 2019-04-03 2021-02-11 Beijing Sensetime Technology Development Co., Ltd. Text recognition

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN110659633A (en) * 2019-08-15 2020-01-07 坎德拉(深圳)科技创新有限公司 Image text information recognition method and device and storage medium
CN112101351B (en) * 2020-09-07 2024-04-19 凌云光技术股份有限公司 Text line rotation correction method and device based on projection
AU2021100391A4 (en) * 2021-01-22 2021-04-15 GRG Banking Equipment Co.,Ltd Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism
CN113392825B (en) * 2021-06-16 2024-04-30 中国科学技术大学 Text recognition method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222589A (en) * 2018-11-27 2020-06-02 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
US20210042567A1 (en) * 2019-04-03 2021-02-11 Beijing Sensetime Technology Development Co., Ltd. Text recognition
CN111400497A (en) * 2020-03-19 2020-07-10 北京远鉴信息技术有限公司 Text recognition method and device, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MOHAMED YOUSEF ET AL: "Accurate, data-efficient, unconstrained text recognition with convolutional neural networks", 《PATTERN RECOGNITION》, vol. 108, pages 1 - 12 *
孙婧婧等: "基于轻量级网络的自然场景下的文本检测", 《电子测量技术》, vol. 43, no. 8, pages 101 - 107 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022262239A1 (en) * 2021-06-16 2022-12-22 科大讯飞股份有限公司 Text identification method, apparatus and device, and storage medium

Also Published As

Publication number Publication date
WO2022262239A1 (en) 2022-12-22
CN113392825B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN109961009B (en) Pedestrian detection method, system, device and storage medium based on deep learning
CN108805131B (en) Text line detection method, device and system
CN109146892B (en) Image clipping method and device based on aesthetics
CN108520229B (en) Image detection method, image detection device, electronic equipment and computer readable medium
US11475681B2 (en) Image processing method, apparatus, electronic device and computer readable storage medium
CN109241861B (en) Mathematical formula identification method, device, equipment and storage medium
CN112508975A (en) Image identification method, device, equipment and storage medium
CN109886330B (en) Text detection method and device, computer readable storage medium and computer equipment
CN109948533B (en) Text detection method, device and equipment and readable storage medium
RU2697649C1 (en) Methods and systems of document segmentation
CN112041851A (en) Text recognition method and terminal equipment
CN112926564B (en) Picture analysis method, system, computer device and computer readable storage medium
US8204889B2 (en) System, method, and computer-readable medium for seeking representative images in image set
WO2022002262A1 (en) Character sequence recognition method and apparatus based on computer vision, and device and medium
CN113496208B (en) Video scene classification method and device, storage medium and terminal
CN112686243A (en) Method and device for intelligently identifying picture characters, computer equipment and storage medium
CN110533039A (en) A kind of true-false detection method of license plate, device and equipment
CN114782412A (en) Image detection method, and training method and device of target detection model
CN110969641A (en) Image processing method and device
CN114494775A (en) Video segmentation method, device, equipment and storage medium
CN114005019B (en) Method for identifying flip image and related equipment thereof
CN113392825A (en) Text recognition method, device, equipment and storage medium
CN112651399B (en) Method for detecting same-line characters in inclined image and related equipment thereof
CN112257708A (en) Character-level text detection method and device, computer equipment and storage medium
CN109871814B (en) Age estimation method and device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230506

Address after: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96

Applicant after: University of Science and Technology of China

Applicant after: IFLYTEK Co.,Ltd.

Address before: NO.666, Wangjiang West Road, hi tech Zone, Hefei City, Anhui Province

Applicant before: IFLYTEK Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant