CN113159053A

CN113159053A - Image recognition method and device and computing equipment

Info

Publication number: CN113159053A
Application number: CN202110459094.9A
Authority: CN
Inventors: 范湉湉; 卢永晨; 黄灿; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-23

Abstract

The embodiment of the application provides an image identification method, an image identification device and computing equipment, wherein the method comprises the following steps: acquiring a first image to be processed; inputting the first image into an optical character recognition network to obtain character information in the first image, wherein the optical character recognition network is trained by the aid of a semantic segmentation network, the semantic segmentation network is used for predicting position information of characters in an image, and the semantic segmentation network and the optical character recognition network share features extracted by an intermediate layer of the optical character recognition network. The optical character recognition network is trained in an auxiliary mode through the semantic segmentation network, so that the optical character recognition network can automatically notice character areas in the first image, the prediction difficulty of the optical character recognition network is further reduced, the precision of the optical character recognition network is improved, and accurate recognition of complex scenes such as severely bent, rotated and vertical characters is achieved.

Description

Image recognition method and device and computing equipment

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to an image identification method, an image identification device and computing equipment.

Background

With the development of deep learning, an object Recognition technology is becoming more mature, for example, Optical Character Recognition (OCR) is a method for analyzing and recognizing an image to obtain text and layout information in the image.

However, in some scenes, such as restaurant signs, product labels, company logos, etc., complex shape distortions of the characters, such as severe bending, rotation, verticality, etc., the current object recognition technology cannot accurately recognize the characters in the image.

Disclosure of Invention

The embodiment of the application provides an image identification method, an image identification device and computing equipment, which are used for realizing accurate identification of characters in an image.

In a first aspect, an embodiment of the present application provides an image recognition method, including:

acquiring a first image to be processed;

inputting the first image into an optical character recognition network to obtain character information in the first image, wherein the optical character recognition network is trained by the aid of a semantic segmentation network, the semantic segmentation network is used for predicting position information of characters in an image, and the semantic segmentation network and the optical character recognition network share features extracted by an intermediate layer of the optical character recognition network.

In a second aspect, an embodiment of the present application provides an image recognition apparatus, including:

a first acquisition unit configured to acquire a first image to be processed;

and the identification unit is used for inputting the first image into an optical character recognition network to obtain character information in the first image, wherein the optical character recognition network is trained by the aid of a semantic segmentation network, the semantic segmentation network is used for predicting position information of characters in an image, and features extracted by an intermediate layer of the optical character recognition network are shared with the optical character recognition network.

In a third aspect, embodiments of the present application provide a computing device, comprising a processor and a memory;

the memory for storing a computer program;

the processor is configured to execute the computer program to implement the method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which includes computer instructions, which when executed by a computer, cause the computer to implement the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which includes a computer program, the computer program being stored in a readable storage medium, from which the computer program can be read by at least one processor of a computer, and the execution of the computer program by the at least one processor causes the computer to implement the method of the first aspect.

According to the image recognition method, the image recognition device and the computing equipment, the optical character recognition network is trained in an auxiliary mode through the semantic segmentation network, so that the optical character recognition network can automatically notice character areas in the first image, the prediction difficulty of the optical character recognition network is further reduced, the precision of the optical character recognition network is improved, and accurate recognition of complex scenes such as seriously bent, rotated and vertical characters is achieved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of an optical character recognition network and a semantic segmentation network to be trained in accordance with the present application;

FIG. 2 is a schematic flowchart illustrating a training method for an OCR network according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a method for training an OCR network according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another optical character recognition network and semantic segmentation network to be trained in accordance with the present application;

fig. 5 is a schematic structural diagram of a transform network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another optical character recognition network and semantic segmentation network to be trained in accordance with the present application;

fig. 7 is a schematic flowchart of an image recognition method according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a trained OCR network according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 11 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an image identification method, an image identification device and computing equipment.

In order to facilitate understanding of the embodiments of the present application, the related concepts related to the embodiments of the present application are first briefly described as follows:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

It should be understood that, in the present embodiment, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "plurality" means two or more than two unless otherwise specified.

In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The technical solutions of the embodiments of the present application are described in detail below with reference to some embodiments. The following several embodiments may be combined with each other and may not be described in detail in some embodiments for the same or similar concepts or processes.

First, a training process of the neural network will be described.

FIG. 1 is a schematic diagram of an optical character recognition network and a semantic segmentation network to be trained in accordance with the present application.

The semantic segmentation network is used for predicting the position information of characters in the image.

Wherein the optical character recognition network is used for recognizing characters in the image.

As shown in FIG. 1, the optical character recognition network includes an intermediate layer that is coupled to an input layer of the semantic segmentation network. The output of the middle layer of the optical character recognition network is the input of the semantic segmentation network. The semantic segmentation network is used for assisting in training the optical character recognition network, so that the trained image implicitly encodes the position information of characters in the image, and the recognition accuracy of the optical character recognition network on the characters is further improved.

In some embodiments, the semantic segmentation network is a pluggable branch, in the training phase, the semantic segmentation network assists the optical character recognition network to learn the position information of the characters, in the actual prediction phase, the semantic segmentation network may be removed, and the optical character recognition network that learns the position information of the characters is used to recognize the characters in the image, so that the accuracy of the optical character recognition network in recognizing the characters is improved without increasing the prediction time and the calculation complexity.

Fig. 2 is a flowchart illustrating a training method for an optical character recognition network according to an embodiment of the present disclosure. As shown in fig. 1 and 2, the method of the embodiment of the present application includes:

s201, acquiring a training image;

s202, end-to-end training is carried out on the optical character recognition network and the semantic segmentation network by using the training image, wherein the input of the semantic segmentation network is the output of the middle layer of the optical character recognition network.

The execution subject of the embodiment of the present application is a device having a model training function, for example, an image recognition device. In some embodiments, the image recognition apparatus is a computing device. In some embodiments, the image recognition apparatus is a unit having a data processing function in the computing device, for example, a processor in the computing device. The embodiment of the present application takes an execution subject as an example of a computing device.

In some embodiments, the computing device may be a terminal device, such as a terminal server, a smart phone, a laptop, a tablet, a personal desktop, a smart camera, and the like.

The training image may be understood as a training image in a training image set, where the training process of each training image in the training image set is the same, and for convenience of description, this embodiment takes a training image as an example for description.

In some embodiments, one training image is input to the optical character recognition network during each training process. After the training of the training image, inputting the next training image to start training.

In some embodiments, multiple training images may be input during each training process, and the model shown in FIG. 1 may be trained simultaneously using the multiple training images.

As shown in fig. 1, after a training image is acquired, the training image is input into an optical character recognition network, a feature map is output from an intermediate layer of the optical character recognition network, the feature map is input into a semantic segmentation network, and end-to-end training is performed on the optical character recognition network and the semantic segmentation network. Specifically, the training image is used for training the optical character recognition network, the characteristic diagram output by the middle layer of the optical character recognition network is used for training the optical character recognition network and the semantic segmentation network, and the semantic segmentation network and the optical character recognition network share the middle layer of the optical character recognition network, so that the optical character recognition network automatically and implicitly encodes the position information of characters along with the training, the optical character recognition network can automatically notice character areas in pictures, the prediction difficulty of the optical character recognition network is reduced, and the recognition accuracy of the optical character recognition network is improved.

The end-to-end training of the optical character recognition network and the semantic segmentation network using the training image in S202 is described in detail below with reference to fig. 3.

Fig. 3 is a schematic flowchart of a training method for an optical character recognition network according to an embodiment of the present application, where as shown in fig. 3, the step S202 includes:

s301, inputting a training image into an optical character recognition network to obtain a first characteristic diagram output by an intermediate layer of the optical character recognition network and character information in the training image predicted by the optical character recognition network; inputting the first characteristic diagram into a semantic segmentation network to obtain position information of characters in a training image predicted by the semantic segmentation network;

s302, performing end-to-end training on the optical character recognition network and the semantic segmentation network according to the difference between the character information predicted by the optical character recognition network and the real information of the characters in the training image and the difference between the position information of the characters predicted by the semantic segmentation network and the real position information of the characters in the training image.

In this embodiment, the intermediate layer of the optical character recognition network outputs a feature map, and the feature map is the first feature map.

In some embodiments, before training, the real position information of the characters in the training image is labeled, and meanwhile, the real information of the characters in the training image is labeled.

During training, as shown in fig. 1, a training image is input into an optical character recognition network, a first characteristic diagram is output from an intermediate layer of the optical character recognition network, and the optical character recognition network predicts character information in the training image. In some embodiments, the textual information includes a type of text, a shape of the text, and the like. And the semantic segmentation network predicts the position information of the characters in the training image.

And comparing the character information predicted by the optical character recognition network with the real information of the characters in the training image to obtain the difference between the character information predicted by the optical character recognition network and the real information of the characters in the training image. And comparing the position information of the characters predicted by the semantic segmentation network with the real position information of the characters in the training image to obtain the difference between the position information of the characters predicted by the semantic segmentation network and the real position information of the characters in the training image. And performing end-to-end training on the optical character recognition network and the semantic segmentation network according to the difference between the character information predicted by the optical character recognition network and the real information of the characters in the training image and the difference between the position information of the characters predicted by the semantic segmentation network and the real position information of the characters in the training image.

In some embodiments, the loss corresponding to the optical character recognition network is calculated according to the character information predicted by the optical character recognition network and the real information of the characters in the training image, the loss corresponding to the semantic segmentation network is calculated according to the position information of the characters predicted by the semantic segmentation network and the real position information of the characters in the training image, and the parameters in the optical character recognition network and the semantic segmentation network are adjusted according to the calculated loss corresponding to the optical character recognition network and the calculated loss corresponding to the semantic segmentation network, so that end-to-end training is realized.

In a possible implementation manner, the intermediate layer of the optical character recognition network is trained in advance, and when the parameters in the optical character recognition network and the semantic segmentation network are adjusted according to the loss corresponding to the optical character recognition network and the loss corresponding to the semantic segmentation network, the parameters in the intermediate layer of the optical character recognition network may not be adjusted.

Optionally, the loss function used by the optical character recognition network to calculate the loss includes any one of the following: logarithmic loss function, quadratic damage function, exponential loss function, cross entropy loss function, mean square error loss function.

Optionally, the loss function used by the semantic segmentation network to calculate the loss includes any one of the following: logarithmic loss function, quadratic damage function, exponential loss function, cross entropy loss function, mean square error loss function.

In some embodiments, the optical character recognition network includes a convolutional neural network, and the first feature map is a feature map of an output of the convolutional neural network. That is, the intermediate layer of the optical character recognition network is a convolutional neural network in the optical character recognition network.

Among them, Convolutional Neural Networks (CNN) are a kind of feed Forward Neural Networks (FNN) containing convolution calculation and having a deep structure, and are one of the typical algorithms of deep learning (deep learning). Convolutional Neural Networks have a feature learning (rendering) capability, and can perform Shift-Invariant classification (Shift-Invariant classification) on input information according to a hierarchical structure thereof, and are therefore also called "Shift-Invariant Artificial Neural Networks (SIANN)".

In some embodiments, the convolutional neural network may be a shallow convolutional neural network (shallow CNN) that outputs a first feature map of the training image.

In some embodiments, as shown in FIG. 4, the semantic segmentation network includes a deconvolution layer for converting the first feature map into the second feature map.

The deconvolution layer is also called as an deconvolution layer, and the deconvolution layer can be regarded as an inverse process of the convolution layer, and a feature graph obtained by the convolution layer is used as an input to perform deconvolution to obtain a deconvolution result for verifying whether the feature graph extracted by each convolution layer is accurate or not.

Optionally, the size of the second feature map is consistent with the size of the training image, for example, the size of the training image is 16 × 16, that is, the training image includes 16 rows of pixel points, and each row of pixel points includes 16 pixel points. Correspondingly, the size of the second feature map is also 16X 16.

In some embodiments, the second feature map may be understood as a training image obtained by deconvolving the first feature map.

In some embodiments, the deconvolution layer in FIG. 4 includes at least one deconvolution sublayer. Optionally, at least one deconvolution layer is connected in series, i.e. the output of the previous deconvolution layer is the input of the next deconvolution layer.

Continuing to refer to fig. 4, the semantic segmentation network further includes a character segmentation layer, the character segmentation layer is connected to the deconvolution layer, that is, the deconvolution layer performs deconvolution on the first feature map output by the convolutional neural network to obtain a second feature map, and inputs the second feature map into the character segmentation layer, and the character segmentation layer predicts position information of characters in the training image based on the second feature map.

In some embodiments, the position information of the text output by the semantic segmentation network is a binary mask of the position of the text in the training image.

In some embodiments, the optical character recognition network further comprises a character recognition module in addition to the convolutional neural network, wherein an input end of the character recognition module is connected with an output end of the convolutional neural network, and the character recognition module recognizes characters in the training image based on the first feature map output by the convolutional neural network.

In some embodiments, the text recognition module may be a transform network.

Fig. 5 is a schematic structural diagram of a transform network according to an embodiment of the present application. It should be noted that fig. 5 is only an example, and the structure of the transform network according to the embodiment of the present application includes, but is not limited to, that shown in fig. 5.

As shown in fig. 5, the transform network comprises an encoding component and a decoding component.

Wherein the encoding component includes at least one encoder (encoder), and "NeX" on the left side of fig. 5 indicates the number of encoders. In some embodiments, the encoding component includes 6 encoders. All encoders are structurally identical, but they do not share parameters. Each decoder can be decomposed into 4 sub-layers, including a multi-head attention (multi-head attention), add & normalization (add & norm), location aware feed-forward, and add & normalization layers. In some embodiments, the location-aware feed-forward network is a fully-connected feed-forward neural network.

The convolutional neural network processes the current input to obtain a plurality of characteristic vectors; and the current encoder acquires M input vectors from the upper layer of the current encoder. And aiming at each input vector in the M input vectors, taking the input vector as a center, and obtaining a middle vector corresponding to the input vector based on the association degree between each input vector and the input vector in a preset attention window range. With reference to this approach, an intermediate vector corresponding to each of the M input vectors may be determined. The M intermediate vectors are then combined into Q output vectors. If the current encoder is the last encoder in the encoding assembly, the Q output vectors output by the encoder are used as the feature representation of the current input.

The following describes a Multi-head attention layer (Multi-head attention), an add & sum normalization (add & norm) layer, and a location aware feed forward layer (location aware feed-forward), respectively, with reference to specific examples.

(1) Multi-head attention layer (Multi-HeadAttention)

The attention mechanism simulates the internal process of biological observation behavior, namely a mechanism which aligns internal experience and external feeling so as to increase the observation fineness of partial areas, and can rapidly screen out high-value information from a large amount of information by using limited attention resources. Attention mechanism can quickly extract important features of sparse data, and thus is widely used for natural language processing tasks, especially machine translation. The self-attention mechanism (self-attention mechanism) is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features. The essential idea of the attention mechanism can be rewritten as the following formula:

the formula meaning means that a constituent element in the Source is imagined to be composed of a series of data pairs, at this time, a certain element Query in the Target is given, a weight coefficient of Value corresponding to each Key is obtained by calculating similarity or correlation between the Query and each Key, and then the Value is subjected to weighted summation, so that a final Attentition Value is obtained. So essentially the Attenttion mechanism is to perform weighted summation on the Value values of the elements in Source, and Query and Key are used to calculate the weight coefficients of the corresponding Value. Conceptually, Attention can be understood as selectively screening out and focusing on a small amount of important information from a large amount of information, ignoring most of the important information. The focusing process is embodied in the calculation of the weight coefficient, the greater the weight is, the more the weight is focused on the Value corresponding to the weight, that is, the weight represents the importance of the information, and the Value is the corresponding information. The self-Attention mechanism may be understood as internal Attention, where the Attention mechanism occurs between all elements in the Source and the Target element Query, and the self-Attention mechanism refers to the Attention mechanism occurring between the Source internal elements or between the Source internal elements, and may also be understood as an Attention calculation mechanism in a special case of Source.

In some embodiments, attention may be represented by the following equation (1):

attention_output＝Attention(Q,K，V) (1)

the Multi-HeadAttention consists of a plurality of Self-Attentions. The input values into Self-orientation will form three vectors through three different layers: query (Q), keys (K), values (V). The Attention function can be seen as mapping a query and a series of key-value pairs, all vectors being query, keys, values and output, to an output. output is a weighted sum of values, and the weight set for each value is obtained by calculating a correlation function (compatibility function) of the query and its corresponding key. The multi-head attention projects Q, K, V through h different linear transformations, concatenating the different attention results according to the following equations (2) and (3):

MultiHead(Q，K，V)＝Concant(head₁，...，head_h)W^O (2)

head_i＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V) (3)

wherein, W^O、W_i ^Q、W_i ^K、W_i ^VIs a learning matrix.

(2) Addition and normalization layer (add & norm)

The Multi-header orientation output vector and the initial input vector pass through an Add layer and a LayerNormalization layer, wherein the Add layer plays a role in adding results of two layers of neural networks, and the LayerNormalization layer plays a role in layer normalization. The addition and normalization layer can prevent the gradient from disappearing and accelerate convergence.

(3) Position sensing feedforward layer (locality aware feed-forward)

The position-aware feed-forward layer is primarily to provide a non-linear transformation. The location-aware feed-forward network is a fully-connected layer. The feed forward layer is applied separately and identically to each location. It consists of two linear transformations with a ReLU activation in between. Although the linear transformation is the same at different locations, different parameters are used from layer to layer. Another way to describe the feedforward layer is a convolution of two kernels of size 1. The dimensions of the input and output are dmodel 512 and the dimensions of the internal layers are dff 2048.

Wherein, the decoding component comprises at least one Decoder (Decoder), the Decoder and the Encoder have almost the same structure but one sub-layer of attribute, here, the input and output of the Decoder and the decoding process are firstly defined as follows:

and (3) outputting: probability distribution of output words corresponding to the i position;

inputting: the output of the encoder corresponds to the output of the i-1 position decoder. The attion in the middle of the decoder is not self-attition, and its K, V comes from the encoder and Q comes from the output of the decoder at the last position.

And (3) decoding: it is noted here that the training and prediction are not the same. During training, decoding is carried out by decoding all at once, and the truth value of the previous step is used for prediction; when predicting, because there is no true value, it needs to predict one by one.

The decoding component also includes an embedding layer, which may be referred to as an input embedding (input embedding) layer. The current input may be a text input, for example, a piece of text, or a sentence. The text can be Chinese text, English text, or other language text. After the current input is obtained, the embedding layer may perform embedding processing on each word in the current input, so as to obtain a feature vector of each word. In the input embedding layer, word embedding processing may be performed on each word in the current input, so as to obtain a word embedding vector for each word. At the position encoding layer, the position of each word in the current input may be obtained, and a position vector may be generated for the position of each word. In some examples, the position of the respective word may be an absolute position of the respective word in the current input. When the word embedding vector and the position vector of each word in the current input are obtained, the position vector of each word and the corresponding word embedding vector can be combined to obtain each word feature vector, and a plurality of feature vectors corresponding to the current input are obtained. The plurality of feature vectors may be represented as an embedded matrix having a preset dimension. The number of eigenvectors in the plurality of eigenvectors can be set to be M, and the predetermined dimension is H dimension, so that the plurality of eigenvectors can be represented as M × H embedded matrix.

Position coding (Positional Encoding)

the transform model lacks a method of interpreting the order of words in the input sequence. To solve this problem, the transformer adds an additional position Encoding (Positional Encoding) to the input of the Encoding and decoding components for learning the position of words or for learning the distance between different words in a sentence, and outputs a position vector. There are various specific calculation methods for this position vector, for example, the calculation method is as shown in equations (4) and (5):

where pos refers to the position of the current word in the sentence, i refers to the index of each value in the vector, sine coding is used in even positions, and cosine coding is used in odd positions.

In some embodiments, the position coding of the encoding end may be adaptive 2D position coding.

While the transform network shown in fig. 5 is described above, it should be noted that fig. 5 is only an example, and the transform network of the present application may further include more network layers or fewer network layers than those shown in fig. 5.

In a specific embodiment, fig. 6 is a schematic diagram of another optical character recognition network to be trained and a semantic segmentation network according to the present application, as shown in fig. 6, the optical character recognition network includes a convolutional neural network and a transform network, and the semantic segmentation network includes a deconvolution layer and a character segmentation layer, wherein an output end of the convolutional neural network is connected to an input end of the deconvolution layer and an input end of the transform network, respectively.

Illustratively, the text of the training image is the character A, B, C, the training image is input to a convolutional neural network, the convolutional neural network processes the training image and outputs a first feature map of the training image, and in one aspect, the convolutional neural network inputs the first feature map to a transform network, and the transform network processes the first feature map to identify the character A, B, C in the training image.

In another aspect, a convolutional neural network inputs the first feature map into a deconvolution layer in a semantic segmentation network. The deconvolution layer performs deconvolution processing on the first feature map, outputs a second feature map, inputs the second feature map into a character segmentation layer, and performs character segmentation on the second feature map by the character segmentation layer to predict the position information of the character A, B, C in the training image.

Calculating a first loss between the position information of the character predicted by the semantic segmentation network in the training image and the real position information of the character in the training image by using a loss function corresponding to the semantic segmentation network, and calculating a second loss between the character information in the training image recognized by the optical character recognition network and the real information of the character in the training image by using a loss function corresponding to the optical character recognition network. And reversely training the semantic segmentation network and the convolutional neural network according to the first loss, and reversely training the recognition network and the convolutional neural network in the optical character recognition network according to the second loss.

With the training, the character position information predicted by the semantic segmentation network is close to the real position information of the character, so that the semantic segmentation network effectively encodes the position information of the character. And because the semantic segmentation network and the optical character recognition network share the CNN part, the first characteristic diagram extracted by the CNN is simultaneously used for character recognition and character position information prediction. With the training, the CNN automatically and implicitly encodes the position information of the characters, and the CNN is a part of the optical character recognition network, so that the optical character recognition network can automatically notice character areas in images, the prediction difficulty of the optical character recognition network is reduced, the precision of the optical character recognition network is improved, and accurate recognition of complex scenes, such as seriously bent, rotated and vertical characters, is realized.

In addition, the semantic segmentation network and the optical character recognition network are trained end to end at the same time in a training stage, a transformer network in the optical character recognition network helps the CNN to extract semantic information of characters, the semantic segmentation network helps the CNN to extract position information of the characters, and the two kinds of information are mutually supplemented, so that the feature extraction capability of the CNN is enhanced.

The training process of the optical character recognition network is described in detail above, and the prediction process of the optical character recognition network is described below.

Fig. 7 is a flowchart of an image recognition method according to an embodiment of the present application, that is, the embodiment of the present application mainly introduces a process of performing character recognition on a first image by using the trained optical character recognition network. As shown in fig. 7, includes:

s701, acquiring a first image to be processed;

s702, inputting the first image into an optical character recognition network to obtain the character information in the first image.

The optical character recognition network is trained by the aid of a semantic segmentation network, the semantic segmentation network is used for predicting position information of characters in the images, and the semantic segmentation network and the optical character recognition network share features extracted from an intermediate layer of the optical character recognition network.

The first image to be processed comprises at least one character.

In some embodiments, the trained optical character recognition network is connected to a trained semantic segmentation network, such as that shown in FIG. 6. The first image is input into the optical character recognition network, the optical character recognition network recognizes the characters in the first image, and the CNN part in the optical character recognition network can automatically notice the character area in the image, so that the prediction difficulty of the optical character recognition network is reduced, and the precision of the optical character recognition network is improved. Meanwhile, the feature map output by the CNN part is input to a semantic segmentation network, and the semantic segmentation network identifies the position information of the characters in the first image.

In some embodiments, after the optical character recognition network is trained, the semantic segmentation network is removed, and the optical character recognition network shown in fig. 8 is obtained. The first image is input to an optical character recognition network that recognizes the text in the first image.

The optical character recognition network assists in training the optical character recognition network through the semantic segmentation network, so that the optical character recognition network automatically notices character areas in a first image, the prediction difficulty of the optical character recognition network is further reduced, the precision of the optical character recognition network is improved, and accurate recognition of complex scenes, such as severely bent, rotated and vertical characters, is realized.

Fig. 9 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application. The image recognition device may be a computing device or may be a component of a computing device (e.g., an integrated circuit, a chip, etc.). As shown in fig. 9, the image recognition apparatus 100 may include:

a first acquiring unit 110, configured to acquire a first image to be processed;

the recognition unit 120 is configured to input the first image into an optical character recognition network, so as to obtain text information in the first image, where the optical character recognition network is trained with the aid of a semantic segmentation network, the semantic segmentation network is configured to predict position information of a text in an image, and share features extracted by an intermediate layer of the optical character recognition network with the optical character recognition network.

Fig. 10 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application. As shown in fig. 10, the image recognition apparatus 100 may further include:

a second obtaining unit 130, configured to obtain a training image;

a training unit 140, configured to perform end-to-end training on the optical character recognition network and the semantic segmentation network using the training image, where an input of the semantic segmentation network is an output of an intermediate layer of the optical character recognition network.

In some embodiments, the training unit 140 is specifically configured to input the training image into the optical character recognition network, to obtain a first feature map output by an intermediate layer of the optical character recognition network, and character information in the training image predicted by the optical character recognition network; inputting the first feature map into the semantic segmentation network to obtain position information of characters in the training image predicted by the semantic segmentation network; and performing end-to-end training on the optical character recognition network and the semantic segmentation network according to the difference between the character information predicted by the optical character recognition network and the real information of the characters in the training image and the difference between the position information of the characters predicted by the semantic segmentation network and the real position information of the characters in the training image.

In some embodiments, the optical character recognition network comprises a convolutional neural network, and the first feature map is a feature map of an output of the convolutional neural network.

In some embodiments, the semantic segmentation network includes a deconvolution layer for converting the first feature map into a second feature map.

Optionally, the size of the second feature map is consistent with the size of the training image.

In some embodiments, the semantic segmentation network further comprises a character segmentation layer that predicts location information of the text in the training image based on the second feature map.

Optionally, the output of the semantic segmentation network is a binarization mask of the position of the character in the training image.

In some embodiments, the optical character recognition network further comprises a recognition network, an input end of the recognition network is connected with an output end of the convolutional neural network, and the recognition network is used for recognizing characters in the training image according to the first feature map output by the convolutional neural network.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 100 shown in fig. 9 may correspond to a corresponding main body in executing the method of the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the apparatus 100 are respectively for implementing corresponding flows in each method such as the method, and are not described herein again for brevity.

The apparatus and system of embodiments of the present application are described above in terms of functional units in conjunction with the following figures. It is to be understood that the functional units may be implemented in hardware, by instructions in software, or by a combination of hardware and software units. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor. Alternatively, the software elements may reside in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, or other storage medium known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 11 is a block diagram of a computing device according to an embodiment of the present application, where the computing device is configured to execute the image recognition method according to the foregoing embodiment, and refer to the description in the foregoing method embodiment specifically.

The computing device 200 shown in fig. 11 includes a memory 201, a processor 202, and a communication interface 203. The memory 201, the processor 202 and the communication interface 203 are connected with each other in communication. For example, the memory 201, the processor 202, and the communication interface 203 may be connected by a network connection. Alternatively, the computing device 200 may also include a bus 204. The memory 201, the processor 202 and the communication interface 203 are connected to each other by a bus 204. Fig. 11 is a computing device 200 with a memory 201, a processor 202, and a communication interface 203 communicatively coupled to each other via a bus 204.

The Memory 201 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 201 may store programs, and the processor 202 and the communication interface 203 are used to perform the above-described methods when the programs stored in the memory 201 are executed by the processor 202.

The processor 202 may be implemented as a general purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits.

The processor 202 may also be an integrated circuit chip having signal processing capabilities. In implementation, the method of the present application may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 202. The processor 202 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 202 reads the information in the memory 201 and completes the method of the embodiment of the application in combination with the hardware thereof.

The communication interface 203 enables communication between the computing device 200 and other devices or communication networks using transceiver modules such as, but not limited to, transceivers. For example, the data set may be acquired through the communication interface 203.

When computing device 200 includes bus 204, as described above, bus 204 may include a pathway to transfer information between various components of computing device 200 (e.g., memory 201, processor 202, communication interface 203).

According to an aspect of the present application, there is provided a computer storage medium having a computer program stored thereon, which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

In summary, the present disclosure is only an embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present disclosure. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image recognition method, comprising:

acquiring a first image to be processed;

2. The method of claim 1, further comprising:

acquiring a training image;

and performing end-to-end training on the optical character recognition network and the semantic segmentation network by using the training image, wherein the input of the semantic segmentation network is the output of an intermediate layer of the optical character recognition network.

3. The method of claim 2, wherein the training the optical character recognition network and the semantic segmentation network end-to-end using the training images comprises:

inputting the training image into the optical character recognition network to obtain a first characteristic diagram output by an intermediate layer of the optical character recognition network and character information in the training image predicted by the optical character recognition network;

inputting the first feature map into the semantic segmentation network to obtain position information of characters in the training image predicted by the semantic segmentation network;

and performing end-to-end training on the optical character recognition network and the semantic segmentation network according to the difference between the character information predicted by the optical character recognition network and the real information of the characters in the training image and the difference between the position information of the characters predicted by the semantic segmentation network and the real position information of the characters in the training image.

4. The method of claim 3, wherein the optical character recognition network comprises a convolutional neural network, and wherein the first feature map is a feature map of an output of the convolutional neural network.

5. The method of claim 4, wherein the semantic segmentation network comprises a deconvolution layer for converting the first feature map into a second feature map, and wherein a size of the second feature map is consistent with a size of the training image.

6. The method of claim 5, wherein the semantic segmentation network further comprises a character segmentation layer that predicts location information of a word in the training image based on the second feature map.

7. The method according to any one of claims 1-6, wherein the semantic segmentation network outputs a binary mask of the position of the text in the image.

8. The method of claim 4, wherein the optical character recognition network further comprises a text recognition module, an input of the text recognition module is connected to an output of the convolutional neural network, and the text recognition module is configured to recognize text in the training image according to the first feature map output by the convolutional neural network.

9. An image recognition apparatus, comprising:

a first acquisition unit configured to acquire a first image to be processed;

10. A computing device, comprising: a memory, a processor;

the memory for storing a computer program;

the processor for executing the computer program to implement the image recognition method according to any one of the preceding claims 1 to 8.

11. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the image recognition method according to any one of claims 1 to 8.