CN111931773B

CN111931773B - Image recognition method, device, equipment and storage medium

Info

Publication number: CN111931773B
Application number: CN202011012813.4A
Authority: CN
Inventors: 刘水; 李兵; 宁亚光
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2022-01-28
Anticipated expiration: 2040-09-24
Also published as: CN111931773A

Abstract

The embodiment of the invention provides an image identification method, an image identification device, image identification equipment and a storage medium, wherein the image identification method comprises the following steps: acquiring an image to be processed; extracting feature information of the image to be processed in a first direction to obtain a first feature map; acquiring feature information of the image to be processed in a second direction based on the first feature map to obtain a second feature map, wherein the first direction is intersected with the second direction; respectively acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature diagram and the second feature diagram; and predicting text information of the image to be processed at least based on the first characteristic sequence and the second characteristic sequence, so that the identification of the image content of the complex long text can be realized.

Description

Image recognition method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the field of computers, in particular to an image identification method, an image identification device, image identification equipment and a storage medium.

Background

With the development of computer and internet technology, online learning becomes a trend, and the efficiency of learning by students and teaching and tutoring by teachers can be effectively improved by utilizing online learning.

The process of examining and approving the homework of students or answer papers is carried out in online learning, the process is more and more intelligent based on an image recognition technology, for example, after the students finish the homework or answer papers, images bearing the homework or the answer papers can be uploaded to electronic equipment such as a server by utilizing student end equipment, the images are recognized by the electronic equipment by utilizing the image recognition technology, the conversion from image contents to text information is realized, and therefore the electronic equipment can automatically provide homework examination and verification results or answer paper examination and verification results based on recognized text information; or the electronic equipment transmits the text information to the teacher-side equipment after recognizing the text information in the image, and the teacher manually obtains an operation approval result or an answer approval result based on the text information displayed by the teacher-side equipment.

However, the above-mentioned images may contain image contents of complex long texts (such as long formulas), which results in low accuracy of recognized text information due to the image recognition technology having defects in recognizing the image contents of complex texts.

Therefore, how to provide an image recognition method to recognize image contents of complex long texts and improve the accuracy of recognized text information becomes a technical problem which needs to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides an image identification method, an image identification device, image identification equipment and a storage medium, which are used for realizing identification of complex long text images.

To solve the above problem, an embodiment of the present invention provides an image recognition method, including:

acquiring an image to be processed, wherein the image to be processed is a text image;

extracting feature information of the image to be processed in a first direction to obtain a first feature map;

extracting feature information of the image to be processed in a second direction to obtain a second feature map, wherein the first direction is intersected with the second direction;

respectively acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature diagram and the second feature diagram;

and predicting text information of the image to be processed at least based on the first characteristic sequence and the second characteristic sequence.

To solve the above problem, an embodiment of the present invention provides an image recognition apparatus, including:

the image acquisition unit is suitable for acquiring an image to be processed, and the image to be processed is a text image;

the first feature map extraction unit is suitable for extracting feature information of the image to be processed in a first direction to obtain a first feature map;

the second feature map acquisition unit is suitable for extracting feature information of the image to be processed in a second direction to obtain a second feature map, and the first direction is intersected with the second direction;

the characteristic sequence acquisition unit is suitable for acquiring a first characteristic sequence corresponding to a first direction and a second characteristic sequence corresponding to a second direction according to the first characteristic diagram and the second characteristic diagram respectively;

and the text information prediction unit is suitable for predicting text information of the image to be processed according to the first characteristic sequence and the second characteristic sequence.

In order to solve the above problem, an embodiment of the present invention provides a storage medium storing a program suitable for image recognition to implement the image recognition method according to any one of the preceding claims.

To solve the above problem, an embodiment of the present invention provides an apparatus, including at least one memory and at least one processor; the memory stores a program that the processor calls to perform the image recognition method of any one of the preceding claims.

Compared with the prior art, the technical scheme of the invention has the following advantages:

in the image recognition method, the image recognition device, the image recognition equipment and the storage medium, when image recognition is performed, feature information of the image to be processed in a first direction is extracted to obtain a first feature map, feature information of the image to be processed in a second direction is obtained based on the first feature map to obtain a second feature map, a first feature sequence corresponding to the first direction and a second feature sequence corresponding to the second direction are obtained based on the first feature map and the second feature map respectively, and text information of the image to be processed is predicted according to the first feature sequence and the second feature sequence. It can be seen that, in the image recognition method provided by the embodiment of the present invention, the feature information of the image to be processed is obtained in two intersecting directions, so that the image to be processed is recognized and processed in multiple directions, and further, the recognition of the complex long text image is completely and accurately realized.

Moreover, in a preferred embodiment of the present invention, the size of the to-be-processed image in the first direction is different from the size of the to-be-processed image in the second direction, so that the size characteristic is closer to that of the long text image, and thus the identification of the complex long text image can be more accurately realized.

Drawings

FIG. 1 is a flow chart of an image recognition method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a step of acquiring an image to be processed according to the image recognition method provided by the embodiment of the invention;

fig. 3 to fig. 4 are schematic diagrams of an alternative method for capturing a text image in an original image according to an embodiment of the present invention;

fig. 5 is a diagram illustrating an example of a step of acquiring feature information of the to-be-processed image in a second direction according to the image recognition method provided in the embodiment of the present invention;

fig. 6 is a diagram illustrating an example of a step of performing size transformation on the second initial feature map in the image recognition method according to the embodiment of the present invention;

fig. 7 is a diagram illustrating an example of a step of acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction in an image recognition method according to an embodiment of the present invention;

fig. 8 is a schematic block diagram illustrating a step of acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction in an image recognition method according to an embodiment of the present invention;

fig. 9 is a schematic flow chart of a step of predicting text information of the image to be processed according to the image recognition method provided in the embodiment of the present invention;

FIG. 10 is a schematic flow chart of an image recognition method according to an embodiment of the present invention;

fig. 11 is a diagram illustrating an example of a step of extracting text position information of the image to be processed in the image recognition method according to the embodiment of the present invention;

fig. 12 is a schematic flow chart of a step of predicting text information of the image to be processed according to the image recognition method provided in the embodiment of the present invention;

FIG. 13 is a flowchart illustrating a step of obtaining a target feature sequence according to the image recognition method provided in the embodiment of the present invention;

FIG. 14 is a block diagram of an image recognition apparatus according to an embodiment of the present invention;

fig. 15 is an alternative hardware device architecture of the device provided by the embodiment of the present invention.

Detailed Description

However, when the image content of a complicated long text (for example, an equation including a score, a power, a sum, and an integral) whose line and column information is unclear is faced, the text information cannot be accurately recognized from the image.

Based on this, the embodiment of the present invention provides an image recognition method, an image recognition device, an image recognition apparatus, and a storage device, where the method includes: acquiring an image to be processed, wherein the image to be processed is a text image; extracting feature information of the image to be processed in a first direction to obtain a first feature map; extracting feature information of the image to be processed in a second direction to obtain a second feature map, wherein the first direction is intersected with the second direction; respectively acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature diagram and the second feature diagram; and predicting text information of the image to be processed at least based on the first characteristic sequence and the second characteristic sequence.

In the image recognition method of the embodiment of the invention, a first feature map is obtained by extracting feature information of the image to be processed in a first direction, feature information of the image to be processed in a second direction is obtained based on the first feature map, a second feature map is obtained, a first feature sequence corresponding to the first direction and a second feature sequence corresponding to the second direction are obtained based on the first feature map and the second feature map respectively, and text information of the image to be processed is obtained by prediction according to the first feature sequence and the second feature sequence. It can be seen that, in the image recognition method provided by the embodiment of the present invention, the feature information of the image to be processed is obtained in two intersecting directions, so that the image to be processed is recognized and processed in multiple directions, and further, the recognition of the complex long text image is completely and accurately realized.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flow chart illustrating an image recognition method according to an embodiment of the present invention.

As shown in the figure, the image recognition method provided in the embodiment of the present invention, which implements image recognition of the to-be-processed image, includes the following steps:

step S10: and acquiring an image to be processed, wherein the image to be processed is a text image.

In order to realize image recognition of an image to be processed, text information in the image to be processed is recognized, and the image to be processed needs to be acquired first. The image to be processed is a text image, the text image is an image containing text content, and text information corresponding to the text content can be identified from the image to be processed by identifying the image to be processed.

It can be understood that the image to be processed may include image content of a complex long text (for example, a long formula) or may include image content of a normal text, so that the embodiment of the present invention may simultaneously implement recognition of various text images. It should be noted that, in the embodiment of the present invention, the long text may be a text with a text information length greater than or equal to 28 characters as the long text, and a text with a text information length smaller than 28 characters as the short text, and in other embodiments, the number of characters of the long text may be set according to actual requirements, for example, the number of characters may be 20 characters, or 35 characters.

The first direction may be a direction from left to right, right to left, top to bottom, or bottom to top of the image to be processed, and may be determined specifically according to a writing habit of a text. In this embodiment, referring to the schematic direction diagram of the image to be processed shown in fig. 4, the first direction may be set to be from left to right based on the left-to-right writing habit.

The second direction is another direction intersecting the first direction, and when the first direction is from left to right, the second direction may be from top to bottom, or from bottom to top. In this embodiment, the second direction is taken as an example from top to bottom as shown in fig. 4.

The size of the image to be processed in the first direction can be different from the size of the image to be processed in the second direction, so that the size characteristic of the text image of the long text is adapted, the characteristic loss caused by mismatching of the image sizes can be avoided, and the identification of the text image of the complex long text is realized more accurately.

Specifically, the image to be processed has a preset size, and the size of the text image with the preset size in the first direction may be larger than the size of the text image with the preset size in the second direction, for example, the preset size may be 224 × 56, so as to accommodate most text images containing long text contents.

Of course, in order to ensure that the obtained to-be-processed image meets the identification requirement, please refer to fig. 2, and fig. 2 is a schematic flow chart of the step of obtaining the to-be-processed image according to the image identification method provided in the embodiment of the present invention.

As shown in fig. 2, the image recognition method provided in the embodiment of the present invention obtains an image to be processed by the following steps:

step S100: acquiring an original image, wherein at least one image area in the original image is a text image area.

It will be appreciated that the original image, i.e. the image without processing, is for example: an image obtained by photographing a print or handwritten text information (such as a formula) may have a blank area in an original image because the image is not processed, and the size of the original image is not fixed.

Step S101: intercepting a text image area in the original image to obtain a text image;

by intercepting the text image in the original image, the non-text content area or blank area in the original image can be removed, so that the text can be directly identified and processed subsequently, and the image identification efficiency is improved.

Referring to fig. 3 to 4, in order to cut out an alternative schematic diagram of the text image in the original image, the image in the dashed box in fig. 3 may be understood as the text image that needs to be cut out, and fig. 4 may be understood as the text image obtained after cutting out. It can be understood that, when the text image is intercepted, the text image can be intercepted according to a preset rule, for example, the text image is intercepted in a row unit, or the text image is intercepted at intervals of punctuation marks.

The original image may intercept one or more text images according to a preset rule, which is not specifically limited herein.

Step S102: and preprocessing the text image to obtain an image to be processed with a preset size.

The text image is preprocessed into the image to be processed with the preset size, so that the image to be processed is processed in batch, and the image recognition efficiency is improved.

For example, the text image may be collectively processed to a size of 224 × 56 as an image to be processed.

It should be noted that the preset size may also be associated with an image processing step to be executed subsequently, and in an optional example, the preset size may be set according to a requirement for an image size in the subsequent image processing step.

Through the processing, the image to be processed can be obtained, and the text image with the preset size can be obtained.

Returning to fig. 1, after acquiring the image to be processed, step S11 may be performed: and extracting the characteristic information of the image to be processed in the first direction to obtain a first characteristic diagram.

The feature map refers to a convolution feature map, and the convolution feature map corresponding to the text image can be obtained by performing convolution processing on the text image. The size of the first characteristic diagram and the size of the image to be processed are in a preset proportional relation.

After the to-be-processed image is obtained, the feature information of the to-be-processed image can be extracted in the first direction, so that a first feature map is obtained.

In the embodiment of the present invention, a convolutional neural network model may be adopted to extract feature information of the image to be processed in the first direction. Alternatively, the convolutional neural network model may employ a BCNN model or a ResNet model (e.g., a ResNet18 model or a ResNet45 model). In this embodiment, the ResNet18 model can be used to extract the feature information of the image to be processed in the first direction, and compared with the BCNN model, the ResNet18 model can extract the picture information more sufficiently; compared with a ResNet45 model (the ResNet45 model has more network layer numbers), the ResNet18 model has fewer parameters, and can keep higher efficiency on the premise of ensuring accuracy.

Specifically, the image to be processed may be input into the ResNet18 model, and processed by the first 6 layers of modules (including 5 layers of convolution modules and one layer of pooling module, for example, 7 × 64/2 conv blocks, 3 × 3 max pool, [3 × 64] × 2 conv blocks, and [3 × 128] × 2 conv blocks in sequence), so as to obtain the first feature map.

It should be noted that the extracted first feature map has the same aspect ratio as the to-be-processed image, for example, when the to-be-processed image is 224 × 56, the size of the extracted first feature map is 56 × 14. In the embodiment of the invention, the size of the first feature map can be set according to the data calculation amount and the corresponding precision requirement, and the preset size of the image to be processed acquired at the earlier stage can be determined according to the set size of the first feature map.

Step S12: and extracting the characteristic information of the image to be processed in the second direction to obtain a second characteristic diagram.

After the first feature map is obtained, feature information of the image to be processed in the second direction can be obtained through image transformation processing.

The characteristic information of the image to be processed is acquired in two crossed directions respectively, so that the image to be processed can be identified in multiple directions, and the complex long text image can be identified completely and accurately.

The second feature map may be extracted based on the image to be processed, or feature information of the image to be processed in the second direction may be extracted, for example, by using a convolutional neural network model. Reference is made in particular to the extraction process of the first feature map. In the embodiment of the present invention, feature information of the to-be-processed image in the second direction may be acquired based on the first feature map.

It should be noted that, the second feature map and the first feature map may be matched in size, so as to facilitate subsequent processing of the first feature map and the second feature map by using the same steps.

In a specific embodiment, the obtaining of the feature information of the image to be processed in the second direction may be implemented by the following process:

step S120: rotating the first characteristic diagram by a preset angle to obtain a second initial characteristic diagram, wherein the preset angle is matched with an included angle between the first direction and the second direction;

and obtaining the second initial characteristic diagram by rotating the first characteristic diagram so as to enable the second initial characteristic diagram to characterize the characteristic information of the image to be processed in the second direction.

The preset angle can be equal to an included angle between the first direction and the second direction, and can also be obtained according to a preset rule and the included angle between the first direction and the second direction. In this embodiment, an example that a preset angle is equal to an included angle between the first direction and the second direction is taken as an example for description, and correspondingly, the preset angle is 90 °.

For example, referring to fig. 5, when the first feature size is 56 × 14, after 90 ° rotation, the second initial feature size is 14 × 56.

Step S121: and carrying out size transformation on the second initial characteristic diagram to obtain a second characteristic diagram with the same size as the first characteristic diagram.

Based on the size of the first feature map in the first direction being different from the size of the first feature map in the second direction, with continued reference to fig. 5, the embodiment of the present invention further performs size transformation on the second initial feature map to make the second feature map and the first feature map have the same size, so that the convolution module can be shared in the convolution process of the first feature map and the second feature map.

Specifically, the size transformation may be implemented by using a pooling process (posing) and an expansion process (padding), or may be implemented by using a deconvolution process and a pooling process.

In this embodiment, the above size transformation is implemented by means of deconvolution processing and pooling processing, so that the feature loss possibly caused by means of expansion processing (padding) can be avoided. Specifically, the size transformation of the second initial feature map may be implemented by the following process:

step S121A: and performing deconvolution processing on the second initial feature map to obtain a second transition feature map, wherein the size of the second transition feature map is the same as that of the image to be processed after the image is rotated by a preset angle.

Through the deconvolution processing, the feature information in the image to be processed can be retained in the formed second transition feature map, thereby avoiding possible feature loss.

Specifically, the second transition feature map with the same size as the size of the image to be processed after the rotation by the preset angle can be obtained through two times of deconvolution processing.

For example, referring to fig. 6, the size of the second initial feature map is 14 × 56, and the size of the second transition feature map obtained by the deconvolution process is 56 × 224.

The reason why the second transition feature map obtained by processing based on the first feature map includes the feature information of the image to be processed, even if the deconvolution processing is performed, the second transition feature map still includes the feature map of the feature information of the image to be processed, and it is not understood that the image to be processed becomes the original image to be processed after the deconvolution processing.

Step S121B: and performing pooling treatment on the second transition characteristic diagram to obtain a second characteristic diagram with the same size as the first characteristic diagram.

Specifically, the pooling process may include two times, first, obtaining a number of blocks corresponding to the target size through a local pooling (Roi _ pooling) process, and then, performing a maximum pooling (max _ pooling) process on each block to obtain a second feature map of the target size.

For example, the size of the second transition feature map is 56 × 224, and 56 × 14 blocks of 1 × 16 can be obtained by local pooling, and then maximum pooling is performed on each block to obtain the second feature map with the size of 56 × 14.

Referring to the example in fig. 6, it can be seen that the second transition feature map has a size of 56 × 224, and the pooled second feature map has a size of 56 × 14. .

In an optional example, in a scenario with a higher requirement on the recognition speed and a lower requirement on the recognition accuracy, the step of size transformation may also be implemented by using a pooling process (pooling) and an expansion process (padding), and the specific process may include: pooling the second initial feature map to obtain a second transition feature map, wherein the size of the second transition feature map in one direction is the same as that of the first feature map, and the size of the second transition feature map in the other direction is smaller than that of the first feature map; and expanding the feature information of the second transition feature map in the other direction to obtain a second feature map with the same size as the first feature map. For example, the second initial feature map has a size of 14 × 56, and the pooling process may be used to obtain the second transition feature map having a size of 3 × 14, and the expansion process may be performed to obtain the second feature map having a size of 56 × 14.

Returning to fig. 1, after obtaining the second feature map, step S13 may be executed: and acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature diagram and the second feature diagram respectively.

And the characteristic sequence is used for representing the position of the text information of the image to be processed obtained by identification in the specific text information base. The specific text information base is a pre-established database, stores text information which may appear in the image to be processed, establishes the text information in the database and the position relation of the text information in the database, and can obtain corresponding text information through corresponding position information. The first characteristic sequence is used for representing the positions of all text messages in a specific text message library, which are obtained by identification in a first direction; the second characteristic sequence is used for representing the positions of the text messages in the specific text message library, which are obtained by identification in the second direction. For example, referring to fig. 7, a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction may be obtained according to the first feature map and the second feature map, respectively, so as to implement image recognition based on the first direction and the second direction simultaneously.

In this embodiment of the present invention, a first feature sequence and a second feature sequence may be obtained by performing convolution processing and time sequence conversion processing on the first feature map and the second feature map, and specifically, the obtaining of the first feature sequence corresponding to the first direction and the second feature sequence corresponding to the second direction may be implemented by using the following procedures: performing convolution processing on the first characteristic diagram to obtain a first sequence characteristic diagram; performing time sequence conversion processing on the first sequence feature map to obtain a first feature sequence; performing convolution processing on the second characteristic diagram to obtain a second sequence characteristic diagram; and performing time sequence conversion processing on the second sequence feature diagram to obtain a second feature sequence.

In an alternative example, the first feature map and the second feature map may be processed by different convolution modules, so as to obtain a first feature sequence based on the first feature map and a second feature sequence based on the second feature map. In the embodiment of the present invention, it is preferable that the first feature map and the second feature map are processed based on the same convolution module, and then the first feature sequence is obtained based on the first feature map, and the second feature sequence is obtained based on the second feature map, so that the amount of calculation is reduced, and the calculation efficiency is improved.

Specifically, in the embodiment of the present invention, the convolution processing may be implemented by using a shared convolution module. For example, referring to the module diagram of this step shown in fig. 8, the shared convolution module may include 5 convolution layers (for example, may include shared conv,256,/(2, 1,1, 0), shared conv,256,/(2, 1,0, 1), shared conv,256,/(2, 1,0, 0), and shared conv, 256), the number of channels of the convolution kernel may be 256, and the shared convolution module may further extract features on the basis of the first feature map and the second feature map, and perform conversion through a timing conversion module (for example, a blstm module) to obtain a first feature sequence and a second feature sequence representing the first direction and the second direction, respectively. The sequence length may be maxT × 1, for example, maxT may select a numerical value that can match a length value of most of texts, for example, when formula recognition is performed, length value maxT that can match a length value of most of statistical sample formulas may be selected by a length value of a statistical sample formula, for example, length value that can match a length value of most of statistical sample formulas is 28, and maxT is set to 28.

Through the processing, a first characteristic sequence corresponding to the first direction and a second characteristic sequence corresponding to the second direction can be obtained.

Returning to fig. 1, after obtaining the first feature sequence and the second feature sequence, step S14 may be executed: and predicting text information of the image to be processed at least based on the first characteristic sequence and the second characteristic sequence.

It can be understood that the first feature sequence and the second feature sequence represent text information of the image to be processed, which is obtained by recognition in different directions, and are two different recognition results, in order to simultaneously reflect the two recognition results, a target feature sequence capable of simultaneously reflecting the first feature sequence and the second feature sequence may be obtained in a fusion manner, and then prediction of the text information of the image to be processed is achieved according to the target feature sequence.

Specifically, referring to fig. 9, fig. 9 is a schematic flow chart of a step of predicting text information of the image to be processed by the image recognition method according to the embodiment of the present invention. The method comprises the following steps:

step S141: and fusing the first characteristic sequence and the second characteristic sequence to obtain a target characteristic sequence.

In this embodiment of the present invention, the first feature sequence and the second feature sequence may be fused based on a preset weight, and in an optional example, the weights of the first feature sequence and the second feature sequence may be equal, or based on an actual scene of an image to be processed, the weight of the first feature sequence in the first direction is set to be greater than the weight of the second feature sequence in the second direction, so as to implement the fusion of the first feature sequence and the second feature sequence.

Step S142: and decoding the target characteristic sequence to obtain the text information of the image to be processed.

In the embodiment of the present invention, the prediction of the image to be processed may be directly performed based on the target feature sequence.

In an alternative example, the corresponding decoding may be implemented by a Recurrent Neural Network (RNN). For example, the target feature sequence may be directly input, and the prediction result may be obtained after the target feature sequence is sequentially processed by a blstm module, an attention (attention) module, and a regression (softmax) module.

The target feature sequence input in the embodiment of the invention contains feature information in multiple directions, so that the identification of multiple directions can be realized, and special texts such as power, fraction, summation, integral and the like can be accurately identified when the complex long text is predicted, thereby realizing the accurate identification of the complex long text.

Based on the problem of text drift easily occurring in the recognition process of a complex long text, in order to further improve the image recognition accuracy of the image recognition method provided by the embodiment of the present invention, another image recognition method is also provided by the embodiment of the present invention, please refer to fig. 10, where fig. 10 is another schematic flow diagram of the image recognition method provided by the embodiment of the present invention.

As shown in the figure, the image recognition method provided by the embodiment of the present invention includes:

step S20: and acquiring an image to be processed.

For details of step S20, please refer to the description of step S10 shown in fig. 1, which is not repeated herein.

Step S21: and extracting the characteristic information of the image to be processed in the first direction to obtain a first characteristic diagram.

For details of step S21, please refer to the description of step S11 shown in fig. 1, which is not repeated herein.

Step S22: and extracting the characteristic information of the image to be processed in a second direction to obtain a second characteristic diagram, wherein the first direction is intersected with the second direction.

For details of step S22, please refer to the description of step S12 shown in fig. 1, which is not repeated herein.

Step S23: respectively acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature diagram and the second feature diagram;

for details of step S23, please refer to the description of step S12 shown in fig. 1, which is not repeated herein.

Step S24: and extracting text position information of the image to be processed based on the first feature map to obtain a third feature sequence.

The text position information of the image to be processed is extracted and is used for fusing the text position information of the image to be processed during subsequent text information prediction, so that the text drift problem which possibly occurs during complex long text prediction is avoided.

Specifically, the extraction of the text position information of the image to be processed can be implemented by adopting the following procedures:

step S241: performing convolution operation on the first feature map for preset times, wherein the feature map obtained after the convolution operation is taken as a first intermediate feature map, and in the convolution operation, the first intermediate feature map obtained after any convolution operation is taken as the input of the subsequent convolution operation;

the preset times can be set according to corresponding application scenes and calculation requirements, and in the embodiment of the invention, the preset times can be more than or equal to 3. Specifically, referring to the example shown in fig. 11, 3 convolution operations are performed, for example, when the first feature map has a size of 56 × 14, the first feature map is sequentially sent to 3 convolution modules, so that the convolution operations are performed 3 times, and three first intermediate feature maps having sizes of 28 × 7, 14 × 3, and 7 × 1 are obtained.

Step S242: performing deconvolution processing on the first intermediate feature map obtained by the last convolution operation for a preset number of times, wherein the feature map obtained after the deconvolution operation is taken as a second intermediate feature map, and in the deconvolution operation, a feature map obtained by adding or combining a second intermediate feature map obtained after any convolution operation and a first intermediate feature map with the same size as the second intermediate feature map is taken as the input of the next deconvolution operation;

in the embodiment of the present invention, the preset number of times may be 3, and correspondingly, the deconvolution operation is also 3 times. Still taking the example given in fig. 11 of obtaining three first intermediate feature maps with the sizes 28 × 7, 14 × 3, 7 × 1 in sequence, feeding the first intermediate feature map of 7 × 1 into 3 consecutive deconvolution modules for deconvolution in sequence, wherein the first deconvolution results in a second intermediate feature map with the size 14 × 3, adding the first intermediate feature maps with the same size as the second intermediate feature map to obtain a feature map with the size still 14 × 3, feeding the feature map into the next deconvolution module, the second deconvolution results in a second intermediate feature map with the size 28 × 7, adding or merging the first intermediate feature maps with the same size as the second intermediate feature map to obtain a feature map with the size still 28 × 7, feeding the feature map into the next deconvolution module to obtain a second intermediate feature map with the size 56 × 14, wherein the second intermediate feature map is obtained by the last deconvolution, as a third intermediate feature map.

In this embodiment, the feature map that is input to the subsequent deconvolution operation can be obtained by the addition (add) process, and the amount of calculation can be reduced and the recognition efficiency can be improved as compared with the merge (concat) process.

Step S243: determining a third feature sequence based on the third intermediate feature map and the first feature map.

After the third intermediate feature map is obtained, scaling dot product and sum (scaled dot product and sum) may be performed on the third intermediate feature map and the first feature map, and a third feature sequence may be obtained after further time sequence conversion.

The scaling point multiplication and summation is used for learning the relationship between the sample characters, and the time sequence conversion can be performed through the blstm module in the foregoing embodiment, so as to obtain a third feature sequence having the same parameter features as the first feature sequence and the second feature sequence. In an embodiment of the invention, for example, a third signature sequence with a length of 28 x 1 is obtained.

In this embodiment, a Convolution Arrangement Module (CAM) may be adopted to extract the text position information of the image to be processed.

Step S25: and predicting text information of the image to be processed according to the first characteristic sequence, the second characteristic sequence and the third characteristic sequence.

It can be understood that by adding the third feature sequence representing the text position information of the image to be processed, the text drift problem which may occur when complex long text prediction is performed can be avoided.

Specifically, the first feature sequence, the second feature sequence, and the third feature sequence may obtain a target feature sequence capable of simultaneously reflecting the first feature sequence, the second feature sequence, and the third feature sequence in a fusion manner, and then realize prediction of text information of the image to be processed according to the target feature sequence.

Because the target feature sequence input in the embodiment of the invention contains feature information in multiple directions and feature information representing text position information of the image to be processed, the text drift problem which may occur when complex long text prediction is carried out can be avoided.

In the identification process based on the complex long text, a plurality of feature sequences need to be fused, in order to better realize the prediction of text information, the embodiment of the present invention provides another image identification method, and the image identification is realized by setting a certain weight to the feature sequences, specifically, the image identification method provided by the embodiment of the present invention includes:

step S30: and acquiring an image to be processed.

For details of step S30, please refer to the description of step S10 shown in fig. 1, which is not repeated herein.

Step S31: and extracting the characteristic information of the image to be processed in the first direction to obtain a first characteristic diagram.

For details of step S31, please refer to the description of step S11 shown in fig. 1, which is not repeated herein.

Step S32: and extracting the characteristic information of the image to be processed in a second direction to obtain a second characteristic diagram, wherein the first direction is intersected with the second direction.

For details of step S32, please refer to the description of step S12 shown in fig. 1, which is not repeated herein.

Step S33: respectively acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature diagram and the second feature diagram;

for details of step S33, please refer to the description of step S12 shown in fig. 1, which is not repeated herein.

Step S34: and extracting text position information of the image to be processed based on the first feature map to obtain a third feature sequence.

For details of step S34, please refer to the description of step S24 shown in fig. 10, which is not repeated herein.

Step S35: and predicting text information of the image to be processed according to the first characteristic sequence, the second characteristic sequence and the third characteristic sequence.

In the embodiment of the invention, the influence of the first feature sequence, the second feature sequence and the third feature sequence in image recognition is considered to be different, so that the image text information is predicted according to the corresponding weight value by setting the corresponding weight value, and the accuracy of image recognition is improved.

Specifically, referring to fig. 12, the process of predicting text information of the image to be processed according to the first feature sequence, the second feature sequence, and the third feature sequence may include:

step S350: determining the weight values of the first characteristic sequence, the second characteristic sequence and the third characteristic sequence;

in this embodiment, the weights of the first feature sequence, the second feature sequence and the third feature sequence may be determined by a number ratio of the image containing the features characterized by the first feature sequence, the image containing the features characterized by the second feature sequence and the image containing the features characterized by the third feature sequence. When the number of images containing the features represented by the first feature sequence is large, the weight of the first feature sequence is large; correspondingly, when the number of the images containing the features represented by the second feature sequence is large, the weight of the second feature sequence is large; and when the number of the images containing the features represented by the third feature sequence is larger, the weight of the third feature sequence is larger.

In this embodiment of the present invention, determining the weights of the first feature sequence, the second feature sequence, and the third feature sequence may be implemented by the following processes:

step SA 1: according to the first feature map, determining a category to which the image to be processed belongs, wherein the category comprises: the method comprises the steps that a first type of image containing short text information in a first direction, a second type of image containing effective text information in a second direction and a third type of image containing long text information are obtained;

in the embodiment of the invention, an image containing short text information in a first direction is taken as a first type of image and corresponds to the characteristic represented by a first characteristic sequence; taking an image containing effective text information in a second direction as a second type of image, and corresponding to the characteristic represented by the second characteristic sequence; and taking the image containing the long text information as a third type of image, and corresponding to the characteristic represented by the third characteristic sequence.

It can be understood that the text in the first direction is text information of a short text (in this embodiment, the text information may be smaller than 28 characters), that is, the short text information in the first direction represents ordinary text information, so that a first feature map with first-direction feature information is correspondingly extracted, and further corresponds to a first feature sequence obtained by the first feature map; if the text information is valid in the second direction, the text information indicates that the text information is provided in the second direction and represents that the row and column information is ambiguous, so that a second feature map representing feature information in the second direction corresponds to the second feature sequence obtained by the second feature map; for text information containing a long text (in this embodiment, the number of characters may be greater than or equal to 28), that is, the text information contains long text information, text drift is likely to occur, and thus determination of position information is required, and correspondingly, the image of the type corresponds to the third feature sequence extracted with the text position information.

For example, in formula identification, the first type of image may correspond to an image of a generic formula that is smaller in length (e.g., less than 28) without an upper subscript; the second type of image may correspond to an image of a smaller length (e.g., less than 28), including a formula for superscripts; the third type of image may correspond to an image of a larger length (e.g., greater than or equal to 28), superscript-free formula.

In an alternative example, the category to which the image to be processed belongs may be determined by building a corresponding learning model. Specifically, a category learning model may be established, and the category to which the image to be processed belongs may be determined by training the category learning model.

Specifically, the category learning model may be a three-classification model, so as to determine respective proportions of the three-way sequence features. In an optional example, the category learning model may perform convolution (e.g., perform conv,256,/(2, 1,1, 1)) twice on a feature map of an image to be determined, extract corresponding classification feature information to obtain a classification feature map, convert the classification feature map into a feature map with one-dimensional features through smooth stretching (flat) processing, further perform full-connected layer processing twice, gradually convert the length of the feature map with the one-dimensional features into maxT, and further perform softmax layer processing to obtain a category probability of the training image, thereby determining a category of the image to be determined. In this embodiment, the class learning model is a model after training, and accordingly, when the method is applied to the recognition method according to the embodiment of the present invention, the feature map of the image to be determined corresponds to a first feature map, so that the class to which the image to be processed belongs can be determined according to the first feature map.

Step SA 2: adjusting the quantity ratio of the image of the category to which the image to be processed belongs to other category images according to the category to which the image to be processed belongs;

it can be understood that, determining the category to which the image to be processed belongs corresponds to adding one image sample, and accordingly, the number ratio of the image of the category to which the image to be processed belongs to the image of the other category should be adjusted accordingly.

In the embodiment of the invention, the initial value of the number ratio of the first-class image, the second-class image and the third-class image can be 6:3:1, so that the image type ratio is more consistent with the actual image type ratio, and the convergence speed of the subsequent calculation process can be obviously improved.

It should be noted that the initial value of the quantity ratio is the quantity ratio, and does not directly represent the quantity, and in the class learning model, the quantity of the samples is usually in the order of tens of thousands, and the samples are continuously accumulated, so that the actual image class ratio can be continuously fitted.

Step SA 3: determining weights of the first feature sequence, the second feature sequence and the third feature sequence according to the quantity ratio of the images of each category, wherein the weight of the first feature sequence corresponds to the ratio corresponding to the first category of images, the weight of the second feature sequence corresponds to the ratio corresponding to the second category of images, and the weight of the third feature sequence corresponds to the ratio corresponding to the third category of images.

Based on the corresponding relationship between different categories and different feature sequences, the weights of the first feature sequence, the second feature sequence and the third feature sequence can be determined according to the number ratio of the images of each category.

Step S351: and fusing the first characteristic sequence, the second characteristic sequence and the third characteristic sequence according to the weight values of the first characteristic sequence, the second characteristic sequence and the third characteristic sequence to obtain a target characteristic sequence.

Specifically, the fusion may be implemented by using an activation function, where the activation function may filter some interference features according to a certain threshold, suppress the features exceeding the peak value, and finally complete the fusion of the feature sequences.

For example, referring to fig. 13, the weights of the first feature sequence, the second feature sequence and the third feature sequence are [ C1, C2, C3%]C is a vector of maxT dimension, and C1, C2 and C3 respectively represent the weight values of the first characteristic sequence, the second characteristic sequence and the third characteristic sequence; and taking the first, second and third signature sequences as T1, T2 and T3 respectively, wherein T is also a vector of maxT dimension, and the process of fusion by adopting an activation function H is as follows: h (C1 × T1+ C2 × T2+ C3 × T3), the final fused target signature sequence can be represented as (H1, H2, …, hT), where H is the activation function. Optionally, the activation function may be a softsign activation function or a tanh activation function, and preferably, the softsign activation function may be used as the activation function of this example. Wherein the expression of the softsign activation function is 1/(1 + | x |)²With antisymmetric, depocenter, differentiable characteristics and returning values between-1 and 1, with flatter curves and slowerThe falling derivative of (c) may enable more efficient learning.

Step S352: and decoding the target characteristic sequence to obtain the text information of the image to be processed.

In this embodiment, the text information of the image to be processed may be directly predicted by an RNN (recurrent neural network) network. Specifically, a bidirectional LSTM (long short term memory network) may be selected, a target feature sequence (h 1, h2, …, hmaxT) is input, and text information of the image to be processed is predicted. For example, the target feature sequence may be processed by the blstm module, the attention module, and the softmax module in sequence to obtain a prediction result.

It can be understood that, when the above embodiment is applied to a formula recognition scenario, the first direction is a direction from left to right of the image to be processed, and the second direction is a direction from top to bottom of the image to be processed, because the input feature sequence includes feature information in two directions from left to right and from top to bottom and position information capable of achieving alignment, the prediction result will not have character drift, and the power, the score, and the special mathematical symbol of the upper and lower structures in the formula will not be recognized incorrectly.

According to the embodiment of the invention, fusion of a plurality of feature sequences is realized by determining the weight of the feature sequences, so that the accuracy of image identification can be improved.

In order to improve the image recognition accuracy of the image recognition method provided by the embodiment of the present invention, referring to fig. 14, an embodiment of the present invention further provides another image recognition apparatus, further including:

the image acquiring unit 400 is adapted to acquire an image to be processed, which is a text image;

the first feature map extracting unit 410 is adapted to extract feature information of the image to be processed in a first direction to obtain a first feature map;

the second feature map obtaining unit 420 is adapted to extract feature information of the image to be processed in a second direction to obtain a second feature map, where the first direction intersects with the second direction;

a feature sequence obtaining unit 430, adapted to obtain a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature map and the second feature map, respectively;

the text information prediction unit 440 is adapted to predict text information of the image to be processed according to the first feature sequence and the second feature sequence.

The first direction may be a direction from left to right, right to left, top to bottom, or bottom to top of the image to be processed, and may be determined specifically according to a writing habit of a text. The second direction is another direction intersecting the first direction, and when the first direction is from left to right, the second direction may be from top to bottom, or from bottom to top. In this embodiment, the second direction is taken as an example from top to bottom.

The size of the image to be processed in the first direction is different from the size of the image to be processed in the second direction, and the size characteristic of the text image of the long text can be adapted, so that the characteristic loss caused by mismatching of the image sizes can be avoided, and the identification of the text image of the complex long text can be realized more accurately.

The step of extracting feature information of the image to be processed in the second direction may include: and acquiring the characteristic information of the image to be processed in a second direction based on the first characteristic diagram.

The step of extracting feature information of the image to be processed in the second direction based on the first feature map may include: rotating the first characteristic diagram by a preset angle to obtain a second initial characteristic diagram, wherein the preset angle is matched with an included angle between the first direction and the second direction; and carrying out size transformation on the second initial characteristic diagram to obtain a second characteristic diagram with the same size as the first characteristic diagram.

Wherein the step of transforming the size of the second initial feature map may include: performing deconvolution processing on the second initial feature map to obtain a second transition feature map, wherein the size of the second transition feature map is the same as that of the image to be processed after the image is rotated by a preset angle; and performing pooling treatment on the second transition characteristic diagram to obtain a second characteristic diagram with the same size as the first characteristic diagram.

In another optional example, the step of performing size transformation on the second initial feature map may include: pooling the second initial feature map to obtain a second transition feature map, wherein the size of the second transition feature map in one direction is the same as that of the first feature map, and the size of the second transition feature map in the other direction is smaller than that of the first feature map; and expanding the feature information of the second transition feature map in the other direction to obtain a second feature map with the same size as the first feature map.

Optionally, the step of predicting text information of the image to be processed according to the first feature sequence and the second feature sequence may include: fusing the first characteristic sequence and the second characteristic sequence to obtain a target characteristic sequence; and decoding the target characteristic sequence to obtain the text information of the image to be processed.

In this embodiment of the present invention, the method may further include a third feature sequence extracting module 450, adapted to extract, after the step of extracting the feature information of the to-be-processed image in the first direction to obtain the first feature map, and before the step of predicting the text information of the to-be-processed image according to the first feature sequence and the second feature sequence, text position information of the to-be-processed image based on the first feature map to obtain a third feature sequence. Correspondingly, the step of predicting text information of the image to be processed according to the first feature sequence and the second feature sequence includes: and predicting text information of the image to be processed according to the first characteristic sequence, the second characteristic sequence and the third characteristic sequence.

Optionally, the step of extracting text position information of the image to be processed based on the first feature map to obtain a third feature sequence includes: performing convolution processing on the first feature map for a preset number of times, wherein a feature map obtained after convolution operation is used as a first intermediate feature map, and in the convolution operation, the first intermediate feature map obtained after any convolution operation is used as the input of the subsequent convolution operation; performing deconvolution processing on the first intermediate feature map obtained by the last convolution operation for a preset number of times, wherein the feature map obtained after the deconvolution operation is taken as a second intermediate feature map, and in the deconvolution operation, a feature map obtained by adding or combining a second intermediate feature map obtained after any convolution operation and a first intermediate feature map with the same size as the second intermediate feature map is taken as the input of the next deconvolution operation; and taking the second intermediate feature map obtained by the last deconvolution operation as a third feature map, and determining a third feature sequence based on the third intermediate feature map and the first feature map. Wherein the preset number of times is greater than or equal to 3.

Correspondingly, the predicting text information of the image to be processed according to the first feature sequence, the second feature sequence and the third feature sequence includes: determining the weight values of the first characteristic sequence, the second characteristic sequence and the third characteristic sequence; fusing the first characteristic sequence, the second characteristic sequence and the third characteristic sequence according to the weight values of the first characteristic sequence, the second characteristic sequence and the third characteristic sequence to obtain a target characteristic sequence; and decoding the target characteristic sequence to obtain the text information of the image to be processed.

Optionally, the step of determining the weights of the first feature sequence, the second feature sequence, and the third feature sequence includes: according to the first feature map, determining a category to which the image to be processed belongs, wherein the category comprises: the method comprises the steps that a first type of image containing short text information in a first direction, a second type of image containing effective text information in a second direction and a third type of image containing long text information are obtained; adjusting the quantity ratio of the image of the category to which the image to be processed belongs to other category images according to the category to which the image to be processed belongs; determining weights of the first feature sequence, the second feature sequence and the third feature sequence according to the quantity ratio of the images of each category, wherein the weight of the first feature sequence corresponds to the ratio corresponding to the first category of images, the weight of the second feature sequence corresponds to the ratio corresponding to the second category of images, and the weight of the third feature sequence corresponds to the ratio corresponding to the third category of images. Wherein the initial value of the number ratio of the first type image to the second type image to the third type image is 6:3:1

Optionally, the step of determining the category to which the to-be-processed image belongs according to the first feature map includes: and establishing a category learning model, wherein the category learning model is used for training and determining the category of the image to be processed.

Optionally, the first direction is a direction from left to right of the image to be processed, and the second direction is a direction from top to bottom of the image to be processed.

Optionally, the acquiring the image to be processed includes: acquiring an original image, wherein at least one image area in the original image is a text image area; intercepting a text image area in the original image to obtain a text image; and preprocessing the text image to obtain an image to be processed with a preset size.

Optionally, the extracting the feature information of the image to be processed in the first direction includes: and extracting the characteristic information of the image to be processed in the first direction by adopting a convolutional neural network model.

Optionally, the step of obtaining a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature map and the second feature map respectively includes: performing convolution processing on the first characteristic diagram to obtain a first sequence characteristic diagram; performing time sequence conversion processing on the first sequence feature map to obtain a first feature sequence; performing convolution processing on the second characteristic diagram to obtain a second sequence characteristic diagram; and performing time sequence conversion processing on the second sequence feature diagram to obtain a second feature sequence. Optionally, a shared convolution module may be used to perform the convolution process.

Optionally, the text information of the image to be processed is obtained by decoding with a recurrent neural network.

The image recognition device provided in the embodiment of the present invention obtains a first feature map by extracting feature information of the to-be-processed image in a first direction during image recognition, obtains feature information of the to-be-processed image in a second direction based on the first feature map, obtains a second feature map, further obtains a first feature sequence corresponding to the first direction and a second feature sequence corresponding to the second direction based on the first feature map and the second feature map, respectively, and predicts and obtains text information of the to-be-processed image according to the first feature sequence and the second feature sequence. It can be seen that, in the image recognition method provided by the embodiment of the present invention, the feature information of the image to be processed is obtained in two intersecting directions, so that the image to be processed is recognized and processed in multiple directions, and further, the recognition of the complex long text image is completely and accurately realized.

Of course, the embodiment of the present invention further provides an apparatus, and the apparatus provided in the embodiment of the present invention may load the program module architecture in a program form, so as to implement the image recognition method provided in the embodiment of the present invention; the hardware device can be applied to an electronic device with specific data processing capacity, and the electronic device can be: such as a terminal device or a server device.

Optionally, fig. 15 shows an optional hardware device architecture of the device provided in the embodiment of the present invention, which may include: at least one memory 3 and at least one processor 1; the memory stores a program which the processor calls to execute the aforementioned image recognition method, in addition to at least one communication interface 2 and at least one communication bus 4; the processor 1 and the memory 3 may be located in the same electronic device, for example, the processor 1 and the memory 3 may be located in a server device or a terminal device; the processor 1 and the memory 3 may also be located in different electronic devices.

As an alternative implementation of the disclosure of the embodiment of the present invention, the memory 3 may store a program, and the processor 1 may call the program to execute the image recognition method provided by the above-described embodiment of the present invention.

In the embodiment of the present invention, the electronic device may be a tablet computer, a notebook computer, or the like capable of performing image recognition.

In the embodiment of the present invention, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete mutual communication through the communication bus 4; it is obvious that the communication connection of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 shown in fig. 15 is only an alternative way;

optionally, the communication interface 2 may be an interface of a communication module, such as an interface of a GSM module;

the processor 1 may be a central processing unit CPU or a Specific Integrated circuit asic (application Specific Integrated circuit) or one or more Integrated circuits configured to implement an embodiment of the invention.

The memory 3 may comprise a high-speed RAM memory and may also comprise a non-volatile memory, such as at least one disk memory.

It should be noted that the above-mentioned apparatus may also include other devices (not shown) that may not be necessary to the disclosure of the embodiments of the present invention; these other components may not be necessary to understand the disclosure of embodiments of the present invention, which are not individually described herein.

Embodiments of the present invention further provide a computer-readable storage medium, where computer-executable instructions are stored, and when the instructions are executed by a processor, the image recognition method may be implemented.

In the computer-executable instruction stored in the storage medium provided in the embodiment of the present invention, when performing image recognition, feature information of the to-be-processed image in a first direction is extracted to obtain a first feature map, feature information of the to-be-processed image in a second direction is obtained based on the first feature map to obtain a second feature map, a first feature sequence corresponding to the first direction and a second feature sequence corresponding to the second direction are further obtained based on the first feature map and the second feature map, respectively, and text information of the to-be-processed image is predicted and obtained according to the first feature sequence and the second feature sequence. It can be seen that, in the image recognition method provided by the embodiment of the present invention, the feature information of the image to be processed is obtained in two intersecting directions, so that the image to be processed is recognized and processed in multiple directions, and further, the recognition of the complex long text image is completely and accurately realized.

The embodiments of the present invention described above are combinations of elements and features of the present invention. Unless otherwise mentioned, the elements or features may be considered optional. Each element or feature may be practiced without being combined with other elements or features. In addition, the embodiments of the present invention may be configured by combining some elements and/or features. The order of operations described in the embodiments of the present invention may be rearranged. Some configurations of any embodiment may be included in another embodiment, and may be replaced with corresponding configurations of the other embodiment. It is obvious to those skilled in the art that claims that are not explicitly cited in each other in the appended claims may be combined into an embodiment of the present invention or may be included as new claims in a modification after the filing of the present application.

Embodiments of the invention may be implemented by various means, such as hardware, firmware, software, or a combination thereof. In a hardware configuration, the method according to an exemplary embodiment of the present invention may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, and the like.

In a firmware or software configuration, embodiments of the present invention may be implemented in the form of modules, procedures, functions, and the like. The software codes may be stored in memory units and executed by processors. The memory unit is located inside or outside the processor, and may transmit and receive data to and from the processor via various known means.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Although the embodiments of the present invention have been disclosed, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An image recognition method for recognizing text information in an image, comprising:

acquiring an image to be processed, wherein the image to be processed is a text image, and the size of the image to be processed in a first direction is different from the size of the image to be processed in a second direction;

acquiring feature information of the image to be processed in a second direction based on the first feature map to obtain a second feature map, wherein the first direction is intersected with the second direction; the first characteristic diagram and the second characteristic diagram are convolution characteristic diagrams, and the characteristic information of the image to be processed is acquired in two intersecting directions and is used for identifying and processing the image to be processed in multiple directions;

respectively acquiring a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction according to the first feature diagram and the second feature diagram; the first characteristic sequence is used for representing the positions of all text messages in a specific text message library, which are obtained by identification in a first direction; the second characteristic sequence is used for representing the positions of all text information in a specific text information base, which are obtained by identification in a second direction; the specific text information base is a pre-established database, and stores text information which possibly appears in the image to be processed;

extracting text position information of the image to be processed based on the first feature map to obtain a third feature sequence;

and predicting text information of the image to be processed according to the first characteristic sequence, the second characteristic sequence and the third characteristic sequence.

2. The image recognition method according to claim 1, wherein the step of acquiring feature information of the image to be processed in the second direction based on the first feature map comprises:

rotating the first characteristic diagram by a preset angle to obtain a second initial characteristic diagram, wherein the preset angle is matched with an included angle between the first direction and the second direction;

and carrying out size transformation on the second initial characteristic diagram to obtain a second characteristic diagram with the same size as the first characteristic diagram.

3. The image recognition method of claim 2, wherein the step of transforming the size of the second initial feature map comprises:

performing deconvolution processing on the second initial feature map to obtain a second transition feature map, wherein the size of the second transition feature map is the same as that of the image to be processed after the image is rotated by a preset angle;

and performing pooling treatment on the second transition characteristic diagram to obtain a second characteristic diagram with the same size as the first characteristic diagram.

4. The image recognition method of claim 2, wherein the step of transforming the size of the second initial feature map comprises:

pooling the second initial feature map to obtain a second transition feature map, wherein the size of the second transition feature map in one direction is the same as that of the first feature map, and the size of the second transition feature map in the other direction is smaller than that of the first feature map;

and expanding the feature information of the second transition feature map in the other direction to obtain a second feature map with the same size as the first feature map.

5. The image recognition method according to claim 1, wherein the step of extracting text position information of the image to be processed based on the first feature map to obtain a third feature sequence comprises:

performing convolution processing on the first feature map for a preset number of times, wherein a feature map obtained after convolution operation is used as a first intermediate feature map, and in the convolution operation, the first intermediate feature map obtained after any convolution operation is used as the input of the subsequent convolution operation;

performing deconvolution processing on the first intermediate feature map obtained by the last convolution operation for a preset number of times, wherein the feature map obtained after the deconvolution operation is taken as a second intermediate feature map, and in the deconvolution operation, a feature map obtained by adding or combining a second intermediate feature map obtained after any convolution operation and a first intermediate feature map with the same size as the second intermediate feature map is taken as the input of the next deconvolution operation;

and taking the second intermediate feature map obtained by the last deconvolution operation as a third intermediate feature map, and determining a third feature sequence based on the third intermediate feature map and the first feature map.

6. The image recognition method of claim 1, wherein the predicting text information of the image to be processed according to the first feature sequence, the second feature sequence and the third feature sequence comprises:

determining the weight values of the first characteristic sequence, the second characteristic sequence and the third characteristic sequence;

fusing the first characteristic sequence, the second characteristic sequence and the third characteristic sequence according to the weight values of the first characteristic sequence, the second characteristic sequence and the third characteristic sequence to obtain a target characteristic sequence;

and decoding the target characteristic sequence to obtain the text information of the image to be processed.

7. The image recognition method of claim 6, wherein the step of determining the weight values of the first, second and third feature sequences comprises:

according to the first feature map, determining a category to which the image to be processed belongs, wherein the category comprises: the method comprises the steps that a first type of image containing short text information in a first direction, a second type of image containing effective text information in a second direction and a third type of image containing long text information are obtained;

adjusting the quantity ratio of the image of the category to which the image to be processed belongs to other category images according to the category to which the image to be processed belongs;

determining weights of the first feature sequence, the second feature sequence and the third feature sequence according to the quantity ratio of the images of each category, wherein the weight of the first feature sequence corresponds to the ratio corresponding to the first category of images, the weight of the second feature sequence corresponds to the ratio corresponding to the second category of images, and the weight of the third feature sequence corresponds to the ratio corresponding to the third category of images.

8. The image recognition method of claim 7, wherein an initial value of the number ratio of the first type image, the second type image and the third type image is 6:3: 1.

9. The image recognition method according to claim 7, wherein the step of determining the category to which the image to be processed belongs according to the first feature map comprises:

and establishing a category learning model, wherein the category learning model is used for training and determining the category of the image to be processed.

10. The image recognition method according to claim 1, wherein the first direction is a direction from left to right of the image to be processed, and the second direction is a direction from top to bottom of the image to be processed.

11. The image recognition method of claim 1, wherein the obtaining the image to be processed comprises:

acquiring an original image, wherein at least one image area in the original image is a text image area;

intercepting a text image area in the original image to obtain a text image;

and preprocessing the text image to obtain an image to be processed with a preset size.

12. The image recognition method of claim 1, wherein the extracting the feature information of the image to be processed in the first direction comprises: and extracting the characteristic information of the image to be processed in the first direction by adopting a convolutional neural network model.

13. The image recognition method according to claim 1, wherein the step of obtaining a first feature sequence corresponding to a first direction and a second feature sequence corresponding to a second direction from the first feature map and the second feature map, respectively, comprises:

performing convolution processing on the first characteristic diagram to obtain a first sequence characteristic diagram;

performing time sequence conversion processing on the first sequence feature map to obtain a first feature sequence;

and the number of the first and second groups,

performing convolution processing on the second characteristic diagram to obtain a second sequence characteristic diagram;

and performing time sequence conversion processing on the second sequence feature diagram to obtain a second feature sequence.

14. The image recognition method of claim 13, wherein the convolution processing is performed using a shared convolution module.

15. The image recognition method of claim 6, wherein the target feature sequence is decoded by using a recurrent neural network to obtain text information of the image to be processed.

16. An image recognition apparatus for recognizing text information in an image, comprising:

the image acquisition unit is suitable for acquiring an image to be processed, the image to be processed is a text image, and the size of the image to be processed in a first direction is different from the size of the image to be processed in a second direction;

the second feature map extraction unit is suitable for acquiring feature information of the image to be processed in a second direction based on the first feature map to obtain a second feature map, wherein the first direction is intersected with the second direction; the first characteristic diagram and the second characteristic diagram are convolution characteristic diagrams, and the characteristic information of the image to be processed is acquired in two intersecting directions and is used for identifying and processing the image to be processed in multiple directions;

the characteristic sequence acquisition unit is suitable for acquiring a first characteristic sequence corresponding to a first direction and a second characteristic sequence corresponding to a second direction according to the first characteristic diagram and the second characteristic diagram respectively; the first characteristic sequence is used for representing the positions of all text messages in a specific text message library, which are obtained by identification in a first direction; the second characteristic sequence is used for representing the positions of all text information in a specific text information base, which are obtained by identification in a second direction; the specific text information base is a pre-established database, and stores text information which possibly appears in the image to be processed;

the third feature sequence extraction module is suitable for extracting text position information of the image to be processed based on the first feature map to obtain a third feature sequence;

and the text information prediction unit is suitable for predicting text information of the image to be processed according to the first feature sequence, the second feature sequence and the third feature sequence.

17. A storage medium characterized in that it stores a program adapted for image recognition to realize the image recognition method according to any one of claims 1 to 15.

18. An apparatus comprising at least one memory and at least one processor; the memory stores a program that the processor calls to execute the image recognition method according to any one of claims 1 to 15.