CN110232417B

CN110232417B - Image recognition method and device, computer equipment and computer readable storage medium

Info

Publication number: CN110232417B
Application number: CN201910523751.4A
Authority: CN
Inventors: 胡益清; 姜德强; 刘银松; 叶朝萍; 任博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2022-10-25
Anticipated expiration: 2039-06-17
Also published as: CN110232417A

Abstract

The invention discloses an image identification method, an image identification device, computer equipment and a computer readable storage medium, and belongs to the technical field of images. The method comprises the steps of extracting features of an image to be recognized to generate a first feature map, decoding the first feature map based on the importance degree of feature points of sub-images in the first feature map, determining that effective information such as characters contained in the image is decoded when the positions of the feature points with the maximum importance degree in any sub-image and the previous sub-image relative to the first feature map are the same in the decoding process, terminating the decoding process by the computer equipment, and outputting the effective information such as the characters contained in the image according to the acquired decoding result. The image recognition mode can judge whether the effective information in the image is decoded or not in the decoding process, so that the decoding process is terminated in advance, the operation amount in the image recognition process is reduced, and the image recognition efficiency is improved.

Description

Image recognition method and device, computer equipment and computer readable storage medium

Technical Field

The present invention relates to the field of image technologies, and in particular, to an image recognition method, an image recognition apparatus, a computer device, and a computer-readable storage medium.

Background

With the development of machine learning technology, a computer device can recognize information such as characters contained in an image based on a deep neural network. At present, in an image recognition task, an image recognition model constructed based on a deep neural network is generally adopted to perform feature extraction on an image to be recognized, obtain a feature map of the image, and decode the feature map to obtain information such as characters contained in the image.

However, in the image recognition method, the image recognition model needs to decode each region of the image, and decodes a region not including the effective information such as characters, which results in an increase in the amount of computation for image recognition, an increase in the time consumption for image recognition, and a reduction in recognition efficiency.

Disclosure of Invention

The embodiment of the invention provides an image identification method, an image identification device, computer equipment and a computer readable storage medium, which can solve the problem of low image identification efficiency in the related art. The technical scheme is as follows:

in one aspect, an image recognition method is provided, and the method includes:

acquiring an image to be identified;

inputting the image into an image recognition model, performing feature extraction on the image by the image recognition model to obtain a first feature map, decoding the first feature map based on the importance degree of feature points in the first feature map, ending decoding when the situation that the positions of the feature points with the maximum importance degree in any subgraph and the previous subgraph in the first feature map are the same in the first feature map is detected in the decoding process, and outputting a feature vector obtained by decoding;

and decoding the feature vector output by the image recognition model to obtain character information contained in the image.

In one aspect, an image recognition apparatus is provided, the apparatus including:

the acquisition module is used for acquiring an image to be identified;

the output module is used for inputting the image into an image recognition model, performing feature extraction on the image by the image recognition model to obtain a first feature map, decoding the first feature map based on the importance degree of feature points in the first feature map, finishing decoding when the condition that the positions of the feature points with the maximum importance degree in any sub-image and the previous sub-image in the first feature map are the same in the first feature map is detected in the decoding process, and outputting a feature vector obtained by decoding;

and the decoding module is used for decoding the characteristic vector output by the image recognition model to obtain the character information contained in the image.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded into and executed by the one or more processors to perform operations performed by the image recognition method.

In one aspect, a computer-readable storage medium having at least one program code stored therein is provided, the at least one program code being loaded into and executed by a processor to perform operations performed by the image recognition method.

According to the technical scheme provided by the embodiment of the invention, a first feature map is generated by extracting features of an image to be recognized, the first feature map is decoded based on the importance degree of feature points of sub-images in the first feature map, in the decoding process, when the positions of the feature points with the maximum importance degree in any sub-image and the previous sub-image relative to the first feature map are the same, the effective information such as characters contained in the image can be determined to be decoded, the computer equipment terminates the decoding process, and the effective information such as characters contained in the image is output according to the obtained decoding result. The image recognition mode can judge whether the effective information in the image is decoded or not in the decoding process, so that the decoding process is terminated in advance, the operation amount in the image recognition process is reduced, and the image recognition efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a block diagram of an image recognition system according to an embodiment of the present invention;

FIG. 2 is a flow chart of an image recognition method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a long-term and short-term memory network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an encoder according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a code embedding method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a decoder according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a truncation decoding process provided in an embodiment of the present invention;

FIG. 8 is a diagram illustrating an image recognition result according to an embodiment of the present invention;

FIG. 9 is a sample image schematic of a configuration provided by embodiments of the present invention;

fig. 10 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

In order to facilitate understanding of the technical process of the embodiment of the present invention, some terms related to the embodiment of the present invention are explained below:

attention Mechanism (Attention Mechanism): the method is a means for rapidly screening high-value information from a large amount of information by using limited attention resources. The visual attention mechanism is a brain signal processing mechanism unique to human vision. Human vision obtains a target area needing important attention, namely a focus of attention in general, by rapidly scanning a global image, and then puts more attention resources into the area to obtain more detailed information of the target needing attention, and suppresses other useless information. The attention mechanism is widely used in various deep learning tasks such as natural language processing, image recognition and speech recognition, and is one of the most important core techniques for deep learning techniques to pay attention and understand deeply.

In summary, the attention mechanism has two main aspects: firstly, determining which part of the input needs to be concerned; the second is to allocate limited information processing resources to important parts. The attention mechanism in deep learning is similar to the selective visual attention mechanism of human beings in nature, and the core goal is to select information which is more critical to the current task from a large number of information.

Characteristic diagram: is a digital matrix used to indicate the characteristics of the image. In the image feature extraction process, the computer device can perform convolution operation on the image through at least one convolution layer in the convolution neural network, one convolution layer can output a convolution operation result, and the convolution operation result is used as a feature map of the image. In the embodiment of the present invention, a feature map output by the last convolutional layer in the convolutional neural network is used as the first feature map of the image.

Subfigure: consisting of a set of feature points in a first feature map. In the decoding process of the first feature map, the computer device may sequentially scan each region of the first feature map, use a group of feature points included in each region as a sub-map of the first feature map, and according to the scanning order, may respectively refer to sub-maps obtained by two adjacent scans as a sub-map and a previous sub-map of the sub-map.

Fig. 1 is a block diagram of an image recognition system according to an embodiment of the present invention. The image recognition system 100 includes: a terminal 110 and an image recognition platform 140.

The terminal 110 is connected to the image recognition platform 110 through a wireless network or a wired network. The terminal 110 may be at least one of a smartphone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 player, an MP4 player, and a laptop portable computer. The terminal 110 is installed and operated with an application program supporting image recognition. The application may be a character recognition type application or the like. Illustratively, the terminal 110 is a terminal used by a user, and an application running in the terminal 110 has a user account registered therein.

The terminal 110 is connected to the image recognition platform 140 through a wireless network or a wired network.

The image recognition platform 140 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The image recognition platform 140 is used to provide background services for applications that support image recognition. Alternatively, the image recognition platform 140 undertakes primary recognition work and the terminal 110 undertakes secondary recognition work; or, the image recognition platform 140 undertakes the secondary recognition work, and the terminal 110 undertakes the primary recognition work; alternatively, the image recognition platform 140 or the terminal 110 may be responsible for the recognition work separately.

Optionally, the image recognition platform 140 comprises: the system comprises an access server, an image recognition server and a database. The access server is used to provide access services for the terminal 110. The image recognition server is used for providing background services related to image recognition, and the image recognition server can be loaded with a graphics processor and supports multithreading parallel computing of the graphics processor. The image recognition server can be one or more. When the image recognition servers are multiple, at least two image recognition servers exist for providing different services, and/or at least two image recognition servers exist for providing the same service, for example, the same service is provided in a load balancing manner, which is not limited by the embodiment of the present application. The image recognition server may be provided with an image recognition model, and in the process of model training and application, the image recognition server may carry a Graphics Processing Unit (GPU) and support the GPU to perform parallel operations. In the embodiment of the application, the image recognition model is a recognition model constructed based on an attention mechanism.

The terminal 110 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 110.

Those skilled in the art will appreciate that the number of terminals may be greater or less. For example, the number of the terminals may be only one, or several tens or hundreds of the terminals, or a larger number, and in this case, the image recognition system further includes other terminals. The number of terminals and the type of the device are not limited in the embodiments of the present application.

Fig. 2 is a flowchart of an image recognition method according to an embodiment of the present invention. The method may be applied to the terminal or the server, and both the terminal and the server may be regarded as a computer device, so that, the embodiment of the present invention is described based on the computer device as an execution subject, referring to fig. 2, and the embodiment may specifically include the following steps:

201. the computer device obtains an image to be recognized.

The image to be recognized may include at least one character, such as a mathematical formula, a text, and the like. The image to be recognized may be one or a group of images stored in a computer device, may also be an image captured by the computer device in a video, and may also be an image acquired in real time by a computer device having an image acquisition function.

202. The computer device inputs the image into an image recognition model.

The image recognition model is used for recognizing character information contained in an image, and the image recognition model may be a model designed based on a deep Neural Network, for example, the deep Neural Network may be RNN (Recurrent Neural Network), CNN (Convolutional Neural Network), or the like.

The computer device can input an image with any size into the target recognition model, and can also adjust the image to the target size and input the image into the image recognition model. In one possible implementation, the computer device may scale the image to adjust the size of the image to a target size as is practical before inputting the image into the target recognition model. In a possible implementation manner, the width and height of the sample image may be counted in the training process of the image recognition model, the value of the width and height with the highest counted number may be used as the target size of the model input image, or all the values of the width and height that are counted may be respectively averaged, and the average value may be used as the target size of the model input image.

After the image is input into the target recognition model by the computer device, the target recognition model can preprocess the input image, and convert the image into a digital matrix consisting of a plurality of pixel values, so that the computer device can perform a subsequent operation process.

203. The computer equipment performs feature extraction on the image through the image recognition model to obtain a first feature map.

In an embodiment of the present invention, the computer device may perform feature extraction on the image through one or more convolution layers in the image recognition model to generate a first feature map. In a possible implementation manner, the image recognition model may include a plurality of convolutional layers, first, the computer device performs a convolution operation on a digital matrix corresponding to the graph and one convolutional layer to extract an image feature, uses a convolution operation result of the convolutional layer as a feature map of the image, then inputs the feature map into a next convolutional layer to continue the convolution operation, and finally, the computer device generates a first feature map based on a feature map output by a last convolutional layer.

Specifically, the convolution layer is taken as an example to describe the above convolution operation process, where one convolution layer may include at least one convolution kernel, each convolution kernel corresponds to one scanning window, and the size of the scanning window is the same as that of the convolution kernel, and in the process of performing convolution operation on the convolution kernels, the scanning window may slide on the feature map according to a target step size, and scan each region of the feature map in sequence, where the target step size may be set by a developer. Taking a convolution kernel as an example, in the convolution operation process, when the scanning window of the convolution kernel slides to any region of the feature map, the computer device reads the numerical value corresponding to each feature point in the region, performs point multiplication operation on the numerical value corresponding to each feature point of the convolution kernel, then accumulates each product, and takes the accumulated result as a feature point. And then, sliding the scanning window of the convolution kernel to the next area of the feature map according to the target step length, performing convolution operation again, outputting one feature point until all the areas of the feature map are scanned, and combining all the output feature points into a new feature map to be used as the input of the next convolution layer.

In the embodiment of the invention, in order to achieve the optimal recognition effect of the image recognition model, the number of the convolutional layers and the number of the convolutional cores in each convolutional layer can be set to be 5, the number of the convolutional layers in each convolutional layer is sequentially 64, 128, 256 and 512, and the size of each convolutional core is 3*3.

In order to make the input data distribution of each convolution layer more reasonable, improve the generalization performance of the image recognition model and prevent overfitting, the computer equipment can also optimize the characteristic diagram output by each convolution layer. In a possible implementation manner, a batch processing unit, a pooling unit and a linear modification unit may be further connected after each convolutional layer in sequence, wherein the batch processing unit may perform batch normalization on a feature map output by the convolutional layer, so that a mean value of the feature map in each dimension is 0 and a variance of the feature map is 1, so as to optimize a distribution of each feature value in the feature map, the pooling unit may include at least one pooling layer, the at least one pooling layer may scan each region of the feature map according to a target step through a scanning window, perform average pooling on feature values included in each region in the feature map, so as to perform dimension reduction on the feature map, and the linear modification unit may include a linear activation function, and the linear activation function may be used to perform a nonlinear transformation on the feature map. Wherein, the scanning window and the target step length of each pooling layer can be set by a developer.

In order to obtain a feature map with a finer granularity and improve the image recognition accuracy, in a possible implementation manner, different pooling layers may be applied according to images with different length-width ratios, and specifically, the computer device may adjust a step length of a sliding of a scanning window in a last pooling layer, and reduce the step length of the sliding of the scanning window on a longer side. In the embodiment of the present invention, when the number of convolutional layers is set to 5, the step sizes of the first pooling layer in the row and column directions may be set to 2, the step size of the second pooling layer in the row and column directions may be set to 2, the step size of the third pooling layer in the row direction may be set to 1, the step size in the column direction may be set to 2, the step size of the fourth pooling layer in the row direction may be set to 2, the step size in the column direction may be set to 1, when the length of the image is greater than the width, the step size of the fifth pooling layer in the row direction may be set to 1, and when the length of the image is less than the width, the step size of the fifth pooling layer in the row direction may be set to 2, and the step size in the column direction may be set to 1.

204. The computer equipment decodes the first feature map according to the importance degree of the feature points in the first feature map.

The computer device can construct an encoder and a decoder to decode the first characteristic diagram based on a long-time and short-time memory network, and the long-time and short-time memory network can memorize input information acquired each time, store the input information in the network and apply the input information to the current operation process. Referring to fig. 3, fig. 3 is a view illustrating a slide rail according to an embodiment of the present inventionA schematic structural diagram of a short-term memory network, fig. 3 (a) is a schematic structural diagram of a long-term memory network, and (b) is a schematic structural diagram of the long-term memory network expanded in time sequence, the long-term memory network may include an input unit 301, a hidden layer unit 302, and an output unit 303, where an input sequence of the input unit may be labeled as { x } ₀ ，x ₁ ，……，x _t-1 ，x _t The operation result of the hidden layer unit can be marked as { h } ₀ ，h ₁ ，……，h _t-1 ，h _t An output result of the output unit may be marked as y ₀ ，y ₁ ，……，y _t-1 ，y _t Wherein t is an integer greater than or equal to 0, as shown in (b), an input unit, an output unit, and at least one hidden layer unit may form a node 304, and the operation result of a hidden layer unit in the node 304 may be passed to a hidden layer unit of a next node 305, so that the hidden layer unit of the node 305 may operate based on the previous input sequence.

The decoding of the first feature map by the computer device may specifically include the following steps one to three:

step one, the computer device obtains a plurality of first sequences of the first feature graph, wherein each first sequence is used for representing feature information of one sub graph and sub graphs of which the scanning sequence is positioned in front of and behind the sub graph in the first feature graph.

The computer device may scan the regions of the first feature map in turn, with a set of feature points contained in each region as a sub-map of the first feature map. In a possible implementation manner, the computer device sequentially inputs each sub-graph in the first feature graph to an encoder, where the encoder includes at least one first hidden layer unit, and for each first hidden layer unit, the first hidden layer unit performs a weighting operation on a received sub-graph of the first feature graph and a first sequence output by a previous first hidden layer unit to obtain a first sequence.

The encoder may include at least one bidirectional long and short term memory network, each long and short term memory network may include a plurality of nodes, and each node may include at least one first hidden layer unit. The number of nodes of the bidirectional long-and-short-term memory network can be set by developers, and in the embodiment of the invention, the number of the nodes is the same as the number of pixel values in a corresponding digital matrix of the image to be identified. Each bidirectional long and short term memory network can simultaneously perform forward operation and backward operation, in the forward operation, a hidden layer unit in the bidirectional long and short term memory network can perform weighting operation based on a currently input subgraph and a first sequence output by a previous hidden layer unit to generate a first sequence, so that the content of a front part of a first characteristic diagram can be fully considered in the process of encoding the first characteristic diagram by the encoder, and in the backward operation, a hidden layer unit in the bidirectional long and short term memory network can perform weighting operation based on a currently input subgraph and a first sequence output by a next hidden layer unit to generate a first sequence, so that the content of a rear part of the first characteristic diagram can be fully considered in the process of encoding the first characteristic diagram by the encoder.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an encoder according to an embodiment of the present invention, where the encoder 400 includes a bidirectional long/short term memory network, and an operation process of the bidirectional long/short term memory network is described by taking

nodes

401, 402, and 403 in the bidirectional long/short term memory network as an example. Specifically, the forward operation is the first hidden layer unit of the node 402 based on the input sequence x _t And a first sequence h of first hidden layer units of the previous node 401 _t-1 Generate a first sequence h of the node 402 _t And the first sequence h _t A first hidden layer unit of a next node 403 is input; the postorder operation is the first hidden layer unit of the node 402 based on the input sequence x _t And a first sequence h of first hidden layer elements of the following node 403 _t+1 Generate a first sequence h of the node 402 _t And the first sequence h _t The first hidden layer unit of the previous node 401 is input.

In order to improve the accuracy of image recognition, the first hidden layer unit may further include multiple sub hidden layer units, and the multiple sub hidden layer units may be configured to perform a weighting operation on the content input by the first hidden layer unit. The number of the sub hidden layer units can be set by developers, and in the embodiment of the invention, in order to achieve the optimal recognition effect of the image recognition model, the number of the sub hidden layer units can be set to be 18.

The encoder may acquire context information of the image to be recognized and encode and embed the convolution features output by the convolution layer, for example, when identifying an equation in the image to be recognized, the encoder may acquire structural features of the equation from left to right and from top to bottom, and encode the image to be recognized based on the structural features of the equation. Referring to fig. 5, fig. 5 is a schematic diagram of an encoding embedding method according to an embodiment of the present invention, 501 is a schematic diagram of an encoder, and 502 is a schematic diagram of a convolution feature.

And step two, the computer equipment acquires a plurality of attention matrixes based on the plurality of first sequences, wherein one attention moment matrix is used for representing the importance degree of the corresponding subgraph to the first feature map.

In a possible implementation manner, the computer device may decode the plurality of first sequences through a decoder to obtain a plurality of attention matrices, and may specifically include the following steps:

first, the computer device may input the entire first sequence into a decoder, which includes at least one second hidden layer unit. The decoder may include at least one unidirectional long-short term memory network, each unidirectional long-short term memory network may include a plurality of nodes, each node may include at least one second hidden layer unit, and the number of nodes of the unidirectional long-short term memory network may be set by a developer.

Then, for each second hidden layer unit, the second hidden layer unit performs similarity comparison on a second sequence output by a last received second hidden layer unit and all the first sequences to obtain a second sequence, wherein a group of elements in the second sequence are used for indicating the similarity between the second sequence of the last second hidden layer unit and one first sequence, and the larger the similarity is, the larger the value of a group of elements used for indicating the similarity is.

Finally, the computer device performs weighted operation on the plurality of second sequences and all the first sequences to generate a plurality of attention matrixes.

Specifically, the generation process of the attention matrix is described with reference to fig. 6, referring to fig. 6, fig. 6 is a schematic structural diagram of a decoder according to an embodiment of the present invention, where the decoder 600 includes a unidirectional long-short term memory network, and the acquisition process of the attention matrix is described with an example of a node 601 in the unidirectional long-short term memory network, where the decoder 600 acquires all first sequences generated by a first hidden layer unit in the encoder 400 in the first step, and performs similarity comparison between a second sequence si generated by a second hidden layer unit in a node immediately above the node 601 and all the first sequences, and in a possible implementation manner, the computer device may acquire the second sequence s through an alignment model _i Similarity matrix e with the entire first sequence _i It can be expressed as the following formula (1):

e _ij ＝a(s _i-1 ，h _j ) (1)

wherein e is _ij Denotes a second sequence s _i-1 With the first sequence h _j The similarity matrix between the two, the non-linear function a is the alignment model, s _i-1 Representing a second sequence, h, generated by the previous node _j Indicating that the encoder generates the first sequence, i and j are both integers greater than 0.

After the computer device acquires the similarity matrix, the similarity matrix may be normalized by a softmax (normalized index) function to generate an attention weight matrix α, which may be expressed as the following formula (2):

wherein e is _ij A matrix of the degree of similarity is represented,exp (·) denotes an exponential operation, K denotes the number of similarity matrices, and K is an integer greater than 0.

The attention weight matrix α may be used to indicate the degree of importance of each sub-graph in the first feature map in the current decoding step, and the computer device performs a weighting operation on the attention weight matrix α and all the first sequences to generate an attention matrix c, which may be expressed by the following formula (3):

wherein alpha is _ij To the attention weight matrix, h _j And the sequence is a first sequence, T is the number of the first sequence, i, j and T are integers which are more than 0, and j is less than or equal to T.

And thirdly, decoding the first characteristic diagram based on the plurality of attention matrixes.

In the decoding process, a scanning window can be slid on the first feature map, the region determined every time the scanning window is slid can be referred to as a sub-map of the first feature map, and for each sub-map, a decoder performs a weighting operation based on each feature point in the sub-map and each element in the attention matrix corresponding to the sub-map to obtain a decoding result, that is, a feature vector.

205. When detecting that the positions of the feature points with the maximum importance degree in any sub-image and the previous sub-image in the first feature map are the same in the decoding process, the computer device finishes decoding and outputs the feature vectors obtained by decoding.

The computer device can judge whether the effective information in the image is completely recognized or not based on the position of the feature point with the maximum importance degree in each sub-image in the first feature map, and when the position of the feature point with the maximum importance degree in any sub-image is the same as that in the first feature map in the previous sub-image, the computer device determines that the effective information in the image is completely recognized and finishes decoding.

In one possible implementation, the computer device is based onAttention matrix c obtained in step 204 _i Output sequence y of decoder _i-1 And a second sequence s _i-1 Generating a second sequence s _i Based on the second sequence s _i Generating an output sequence y _i . The computer device is based on the second sequence s _i Determining the position of the feature point with the maximum importance degree in the subgraph specifically includes the following steps:

step one, the computer device acquires the position of the maximum value of the feature point in the attention matrix of the subgraph relative to the first feature map.

And step two, when the positions of the maximum value of the feature point in the attention matrix of any subgraph and the attention matrix of the previous subgraph relative to the first feature map are the same, the computer device determines that the image decoding is finished, and the decoding is finished.

Referring to fig. 7, fig. 7 is a schematic diagram of a truncation decoding process according to an embodiment of the present invention, when performing equation 700, the computer device may sequentially identify 701, 702, and 703 regions in an image, as shown in (a), (b), and (c) of fig. 7, when identifying 703 regions in the image, the computer device may determine, based on the equation characteristics, that identification of valid information in the image is completed by an attention mechanism, thereby truncating the decoding process in advance, avoiding decoding invalid information, and improving decoding efficiency.

In order to improve the accuracy of the image recognition result, in a possible implementation manner, the computer device may obtain, as a first position, a position of a maximum value of the feature points in the attention matrix of one sub-graph with respect to the first feature map, and obtain, as a group of second positions, positions of maximum values of the feature points in the attention matrices of the first N sub-graphs with respect to the first feature map, and when the first position and each of the second positions are the same, the computer device determines that the image decoding is completed, and ends the decoding. Wherein N is an integer greater than 1, and the specific value of N can be set by a developer.

206. And the computer equipment decodes the feature vector output by the image recognition model to obtain character information contained in the image.

The computer device may be based on a second sequence s _i Obtaining the output sequence y of the decoder _i The output sequence y _i As a feature vector for the image.

The computer equipment respectively compares the similarity of the plurality of characteristic vectors with a standard vector set, determines a plurality of standard vectors with the maximum similarity with each characteristic vector, and takes characters indicated by the plurality of standard vectors as characters contained in the image. The standard vector set comprises feature vectors corresponding to all characters in the character list. In one possible implementation manner, the computer device may obtain, by calculating distances between the plurality of feature vectors and respective vectors in a standard vector set, a character indicated by a vector in the standard vector set having the smallest distance to the feature vector, and obtain, as a result of decoding the feature vector, a result of decoding all feature vectors as character information included in the image.

Referring to fig. 8, fig. 8 is a schematic diagram of an image recognition result according to an embodiment of the present invention, where (a) in fig. 8 is a schematic diagram of an image input by a user, and (b) is a schematic diagram of an image recognition result output by a computer device, the user may input the image to be recognized as shown in (a) into the computer device, the computer device performs recognition through the image recognition process, and outputs the image recognition result as shown in (b).

According to the technical scheme provided by the embodiment of the invention, a first feature map is generated by extracting features of an image to be recognized, the first feature map is decoded based on the importance degree of feature points of sub-images in the first feature map, in the decoding process, when the positions of the feature points with the maximum importance degree in any sub-image and the previous sub-image relative to the first feature map are the same, the effective information such as characters contained in the image can be determined to be decoded, the computer equipment terminates the decoding process, and the effective information such as characters contained in the image is output according to the obtained decoding result. The image recognition mode can judge whether the effective information in the image is decoded or not in the decoding process, so that the decoding process is terminated in advance, the operation amount in the image recognition process is reduced, and the image recognition efficiency is improved. The technical scheme applies the learning capability of the deep learning network and the positioning capability of the attention mechanism, and can improve the identification capability and efficiency of the formula identification network when being applied to the field of formula identification.

For example, when an equation in an image is identified, whether the equation is identified is determined based on structural features of the equation, and when attention stays at the lower right corner of the image for a long time, the computer device may determine that the equation is identified, and truncate the operation process of decoding in advance, so as to avoid decoding invalid information and improve the decoding efficiency of the decoder.

Of course, the image recognition method can also be fused with other image recognition methods, for example, a single-word recognition method, and further determines a decoding result from different methods based on a voting technique, so as to improve the recognition accuracy of the title.

The above embodiments mainly describe the process of image recognition by a computer device, and before performing image recognition, a training data set needs to be constructed to train the image recognition model, where the training data set may include a plurality of labeled sample images. However, in the training data set construction process, the acquisition difficulty of the sample images is high, the labeling cost is high, the number of the sample images which can be acquired is small, the common character distribution form is difficult to be fully covered, and the training requirement of the model cannot be met. In one possible implementation, the computing device may construct a sample image based on the image features learned by the image recognition model, and may specifically include the following steps:

step one, the computer equipment constructs sample data based on the image characteristics extracted by the image recognition model.

The computer device can train the image recognition model by using a training data set comprising real data, and can obtain image features of each image in the training data set by adjusting each parameter in the image recognition model in the training process, and the computer device constructs sample data based on the obtained image features. For example, when the training data set is a set of images containing an equation, after the image recognition model is trained through the training data set, structural features of the equation, which may include probabilities of numbers or operators appearing in the equation, may be obtained, and the computer device may construct sample data based on the obtained structural features.

Rendering the sample data by the computer equipment to generate a sample image.

To optimize the effect of model training, in one possible implementation, after the computer device generates a sample image based on constructed sample data, the sample image may be transformed, for example, the computer device may perform processes such as text distortion, adding background noise, rotation, and font transformation on the sample image to enhance the diversity of images.

Referring to fig. 9, fig. 9 is a schematic diagram of a constructed sample image according to an embodiment of the present invention, where (a) the diagram in fig. 9 is a schematic diagram of performing font transformation on constructed sample data, and (b) the diagram is a schematic diagram of performing deformation on constructed sample data.

Fig. 10 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention, and referring to fig. 10, the apparatus includes:

an obtaining module 1001 configured to obtain an image to be identified;

an output module 1002, configured to input the image into an image recognition model, perform feature extraction on the image by using the image recognition model to obtain a first feature map, decode the first feature map based on importance of feature points in the first feature map, when it is detected in a decoding process that positions of feature points with the greatest importance in any sub-map of the first feature map and a preceding sub-map in the first feature map are the same, terminate decoding, and output a feature vector obtained by decoding;

the decoding module 1003 is configured to decode the feature vector output by the image recognition model to obtain character information included in the image.

In one possible implementation, the output module 1002 is configured to:

acquiring a plurality of first sequences of the first feature map, wherein each first sequence is used for representing feature information of one sub-map in the first feature map and sub-maps of which the scanning sequence is positioned before and after the sub-map;

and acquiring a plurality of attention matrixes based on the plurality of first sequences, wherein one attention moment matrix is used for representing the importance degree of the corresponding subgraph to the first feature map.

In one possible implementation, the output module 1002 is configured to:

sequentially inputting each sub-graph in the first characteristic graph into an encoder, wherein the encoder comprises at least one first hidden layer unit;

for each first hidden layer unit, the first hidden layer unit performs weighting operation on a received subgraph of the first characteristic graph and a first sequence output by a previous first hidden layer unit to obtain a first sequence.

In one possible implementation, the output module 1002 is configured to:

inputting the entire first sequence into a decoder, the decoder comprising at least one second hidden layer unit;

for each second hidden layer unit, the second hidden layer unit performs similarity comparison on a second sequence output by a last received second hidden layer unit and all the first sequences to obtain a second sequence, wherein a group of elements in the second sequence are used for indicating the similarity between the second sequence of the last second hidden layer unit and one first sequence, and the larger the similarity is, the larger the value of a group of elements used for indicating the similarity is;

and performing weighting operation on the plurality of second sequences and all the first sequences to generate a plurality of attention matrixes.

In one possible implementation, the output module 1002 is configured to:

acquiring the position of the maximum value of the element in the attention matrix of the subgraph relative to the first feature map;

and when the positions of the maximum values of the elements in the attention matrix of any sub-graph and the attention matrix of the previous sub-graph relative to the first feature graph are the same, determining that the image decoding is finished, and ending the decoding.

In one possible implementation, the decoding module 1003 is configured to:

and respectively comparing the similarity of the plurality of feature vectors with a standard vector set, determining a plurality of standard vectors with the maximum similarity to each feature vector, and taking characters indicated by the plurality of standard vectors as characters contained in the image.

All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.

It should be noted that: in the image recognition apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing image recognition, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the functions described above. In addition, the image recognition apparatus and the image recognition method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The computer device provided in the above technical solution may be implemented as a terminal or a server, for example, fig. 11 is a schematic structural diagram of a terminal provided in an embodiment of the present invention. The terminal 1100 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1100 can also be referred to as user equipment, portable terminals, laptop terminals, desktop terminals, and the like by other names.

In general, the terminal 1100 includes: one or more processors 1101 and one or more memories 1102.

Processor 1101 may include one or more processing cores, such as 4-core processors, 8-core processors, etc. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the image recognition methods provided by the method embodiments of the present invention.

In some embodiments, the terminal 1100 may further include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera 1106, audio circuitry 1107, positioning component 1108, and power supply 1109.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1104 may further include NFC (Near Field Communication) related circuit, which is not limited by the present invention.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1105 may be one, providing the front panel of terminal 1100; in other embodiments, the display screens 1105 can be at least two, respectively disposed on different surfaces of the terminal 1100 or in a folded design; in still other embodiments, display 1105 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera head assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1100. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 11011 may also include a headphone jack.

Positioning component 1108 is used to locate the current geographic position of terminal 1100 for purposes of navigation or LBS (Location Based Service). The Positioning component 1108 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, the grignard System in russia, or the galileo System in the european union.

Power supply 11011 is configured to provide power to various components within terminal 1100. The power supply 11011 may be alternating current, direct current, disposable or rechargeable. When the power supply 11011 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1100 can also include one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyro sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.

Acceleration sensor 1111 may detect acceleration levels in three coordinate axes of a coordinate system established with terminal 1100. For example, the acceleration sensor 1111 may be configured to detect components of the gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1111. The acceleration sensor 1111 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1112 may cooperate with the acceleration sensor 1111 to acquire a 3D motion of the user with respect to the terminal 1100. From the data collected by gyroscope sensor 1112, processor 1101 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1113 may be disposed on a side bezel of terminal 1100 and/or underlying display screen 1105. When the pressure sensor 1113 is disposed on the side frame of the terminal 1100, the holding signal of the user on the terminal 1100 can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1114 is configured to collect a fingerprint of the user, and the processor 1101 identifies the user according to the fingerprint collected by the fingerprint sensor 1114, or the fingerprint sensor 1114 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 1101 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 1114 may be disposed on the front, back, or side of terminal 1100. When a physical button or vendor Logo is provided on the terminal 1100, the fingerprint sensor 1114 may be integrated with the physical button or vendor Logo.

Optical sensor 1115 is used to collect ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the ambient light intensity collected by the optical sensor 1115. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1105 is increased; when the ambient light intensity is low, the display brightness of the display screen 1105 is reduced. In another embodiment, processor 1101 may also dynamically adjust the shooting parameters of camera assembly 1106 based on the ambient light intensity collected by optical sensor 1115.

Proximity sensor 1116, also referred to as a distance sensor, is typically disposed on a front panel of terminal 1100. Proximity sensor 1116 is used to capture the distance between the user and the front face of terminal 1100. In one embodiment, when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 is gradually decreased, the display screen 1105 is controlled by the processor 1101 to switch from a bright screen state to a dark screen state; when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 becomes progressively larger, the display screen 1105 is controlled by the processor 1101 to switch from a breath-screen state to a light-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present invention, where the server 1200 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where at least one program code is stored in the one or more memories 1202, and the at least one program code is loaded and executed by the one or more processors 1201 to implement the methods provided by the foregoing method embodiments. Certainly, the server 1200 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1200 may further include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor to perform the image recognition methods in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An image recognition method, characterized in that the method comprises:

acquiring an image to be identified;

inputting the image into an image recognition model, performing feature extraction on the image by using the image recognition model to obtain a first feature map, decoding the first feature map based on the importance degree of feature points in the first feature map, ending decoding when the situation that the positions of the feature points with the maximum importance degree in any subgraph and the previous subgraph in the first feature map are the same in the first feature map is detected in the decoding process, and outputting a feature vector obtained by decoding;

and decoding the characteristic vector output by the image recognition model to obtain character information contained in the image.

2. The method according to claim 1, wherein the decoding the first feature map based on the importance degree of the feature points in the first feature map comprises:

acquiring a plurality of first sequences of the first feature map, wherein each first sequence is used for representing feature information of one sub-map in the first feature map and sub-maps which are positioned in front of and behind the sub-map in the scanning sequence;

acquiring a plurality of attention matrixes based on the first sequences, wherein one attention moment matrix is used for representing the importance degree of the corresponding subgraph to the first feature map;

decoding the first feature map based on the plurality of attention matrices.

3. The method of claim 2, wherein said obtaining a plurality of first sequences of the first profile comprises:

sequentially inputting each sub-picture in the first feature picture into an encoder, wherein the encoder comprises at least one first hidden layer unit;

for each first hidden layer unit, the first hidden layer unit performs weighting operation on a received subgraph of the first feature graph and a first sequence output by a previous first hidden layer unit to obtain a first sequence.

4. The method of claim 2, wherein obtaining a plurality of attention matrices based on the plurality of first sequences comprises:

inputting the entire first sequence into a decoder comprising at least one second hidden layer unit;

for each second hidden layer unit, the second hidden layer unit performs similarity comparison on a second sequence output by a last received second hidden layer unit and all the first sequences to obtain a second sequence, wherein a group of elements in the second sequence are used for indicating the similarity between the second sequence of the last second hidden layer unit and the first sequence, and the larger the similarity is, the larger the value of the group of elements used for indicating the similarity is;

5. The method of claim 2, wherein the ending of decoding when it is detected during decoding that the position of the most important feature point in any sub-graph of the first feature graph is the same as the position of the most important feature point in the previous sub-graph in the first feature graph comprises:

and when the positions of the maximum values of the elements in the attention matrix of any subgraph and the attention matrix of the previous subgraph relative to the first feature map are the same, determining that the image decoding is finished, and ending the decoding.

6. The method according to claim 1, wherein the decoding the feature vector output by the image recognition model to obtain the character information included in the image comprises:

7. An image recognition apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring an image to be identified;

the output module is used for inputting the image into an image recognition model, extracting the features of the image by the image recognition model to obtain a first feature map, decoding the first feature map based on the importance degree of feature points in the first feature map, ending the decoding when the position of the feature point with the maximum importance degree in any sub-map and a previous sub-map in the first feature map is detected to be the same in the first feature map in the decoding process, and outputting the feature vector obtained by the decoding;

8. The apparatus of claim 7, wherein the output module is configured to:

acquiring a plurality of first sequences of the first feature map, wherein each first sequence is used for representing feature information of one sub-map in the first feature map and sub-maps which are positioned before and after the sub-map in the scanning sequence;

decoding the first feature map based on the plurality of attention matrices.

9. The apparatus of claim 8, wherein the output module is configured to:

sequentially inputting each sub-graph in the first feature graph into an encoder, wherein the encoder comprises at least one first hidden layer unit;

10. The apparatus of claim 8, wherein the output module is configured to:

11. The apparatus of claim 8, wherein the output module is configured to:

acquiring the position of the maximum value of the elements in the attention matrix of the subgraph relative to the first feature map;

12. The apparatus of claim 7, wherein the decoding module is configured to:

13. A computer device comprising one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded and executed by the one or more processors to perform operations performed by the image recognition method of any one of claims 1 to 6.

14. A computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded and executed by a processor to perform operations performed by the image recognition method according to any one of claims 1 to 6.