CN115376140A

CN115376140A - Image processing method, apparatus, device and medium

Info

Publication number: CN115376140A
Application number: CN202211032520.1A
Authority: CN
Inventors: 李兵
Original assignee: Shenzhen Xingtong Technology Co ltd
Current assignee: Shenzhen Xingtong Technology Co ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-11-22

Abstract

The present disclosure relates to an image processing method, apparatus, device, and medium, the method including: acquiring a target image containing a target formula; inputting the target image into a formula recognition model trained in advance to perform formula recognition, and obtaining a first character prediction result and a first relation prediction result corresponding to each character in the target formula; the formula recognition model is obtained by training based on a formula sample image with labeling information, and the labeling information comprises: the method comprises the steps of obtaining a character sequence, a semantic feature vector and a character detection box corresponding to a formula sample; and obtaining the recognition result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character. The method and the device can improve formula identification accuracy.

Description

Image processing method, apparatus, device and medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method, apparatus, device, and medium.

Background

Natural scene word recognition is the process of recognizing a sequence of characters from an image that includes textual content. Text content on images covers a very wide range where formulas are complex, such as multiple lines and single lines interleaved, including superscript subscripts, font sizes, and the like. Compared with the conventional text recognition, the formula is complex in arrangement, so that when the formula is recognized in a conventional text recognition mode, the recognition difficulty of the formula is high, and the recognition effect is poor.

Disclosure of Invention

To solve the technical problems described above or at least partially solve the technical problems, the present disclosure provides an image processing method, apparatus, device, and medium.

According to an aspect of the present disclosure, there is provided an image processing method including:

acquiring a target image containing a target formula;

inputting the target image into a pre-trained formula recognition model for formula recognition to obtain a first character prediction result and a first relation prediction result corresponding to each character in the target formula; the formula identification model is obtained by training based on a formula sample image with labeling information, wherein the labeling information comprises: a character sequence, a semantic feature vector and a character detection box corresponding to the formula sample;

and obtaining the recognition result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character.

According to another aspect of the present disclosure, there is provided an image processing apparatus including:

the image acquisition module is used for acquiring a target image containing a target formula;

the formula recognition module is used for inputting the target image into a pre-trained formula recognition model for formula recognition to obtain a first character prediction result and a first relation prediction result corresponding to each character in the target formula; the formula recognition model is obtained by training based on a formula sample image with labeling information, wherein the labeling information comprises: the method comprises the steps of obtaining a character sequence, a semantic feature vector and a character detection box corresponding to a formula sample;

and the result acquisition module is used for acquiring the recognition result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory storing a program, wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the image processing method according to the above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to image processing.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

according to the image processing method, the image processing device, the image processing equipment and the image processing medium, the target image containing the target formula is obtained; inputting the target image into a pre-trained formula recognition model for formula recognition to obtain a first character prediction result and a first relation prediction result corresponding to each character in the target formula; and obtaining the recognition result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character. The technical method can improve the accuracy of formula identification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the embodiments or technical solutions in the prior art description will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a flowchart of an image processing method provided in an embodiment of the present disclosure;

FIG. 2 is a flowchart of a model training method provided by an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure can be more clearly understood, embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Character recognition in a natural scene is a very challenging subject, and besides factors such as complex picture background and illumination change, complexity of recognizing an output space is also difficult. At present, the method generally comprises two identification methods, one is based on a bottom-up strategy, and the identification problem is divided into character detection, character identification and character combination to be solved one by one; the other is a strategy based on the whole analysis, namely a method of sequence to sequence, which firstly encodes the image and then decodes the sequence to directly obtain the whole character string. However, the first method requires labeling at a character level, that is, the position and information of each character on the input image need to be labeled, which consumes a lot of labor cost; although the second method is simple in labeling and only needs to transcribe a character string, the recognition result may have many recognized characters or may miss recognized characters.

The text content on the text image is very extensive, such as Chinese, english, numeral or a combination of the Chinese and English, and of course, a formula is also possible. The Chinese and English characters or numbers are simple in arrangement structure, can be obtained in what you see, and meanwhile, the size difference of the characters is not large, and the whole structure is linear; the arrangement structure of the formula is more complex than that of the general text content, such as interlacing of multiple lines and single line, including superscript subscript, different font sizes and the like, and the whole formula is in a nonlinear structure. It can be understood that, compared to conventional text recognition, recognition difficulty may be greater for a formula with a complicated arrangement structure,

at present, the formula can be identified by adopting a conventional text identification mode, but the identification accuracy is low and the identification effect is poor. For example, the method based on character labeling is high in cost, and the character detection part is difficult to accurately detect due to the size change and the position arrangement of characters; sequence-to-sequence based methods do not achieve good alignment and often suffer from missed or multiple identifications.

The CRNN (Convolutional Recurrent Neural Network) model is a method for text recognition based on sequence-to-sequence, which is well balanced in precision and speed, but still has the problems of overlarge parameter quantity and overlong training period; as a very effective codec network structure, the Transformer has achieved very good effect in tasks such as natural language processing and computer vision, and can perform decoding by using a CTC (connection Temporal Classification) based or Attention based approach. For the identification of complex formulas, effective alignment of predicted characters and corresponding feature areas thereof cannot be realized in a CTC-based or Attention-based manner, and thus both effects are poor. However, the accuracy of the Attention method is higher than that of the CTC method, and the inventors have found that the reason is: firstly, the alignment mode of the Attention mode is more flexible, global information can be utilized, and secondly, the Attention mode can learn an implicit language model in the decoding process. It is further considered that the decoding structure based on both CTC and Attention is linear, with only one character per vertical position for left to right decoding, whereas for the non-linear structure of the formula, there may be one or more than two characters per position of the formula, which results in a less regular alignment or a regular and more complex alignment being learned forcibly if linear decoding is used. For the network, the more complex the rule, the greater the difficulty of learning, and the more difficult the effect is to be guaranteed.

For a formula image, the following key information may be considered: the number of characters in the whole formula, the position information of each character, the structure information between the characters and the semantic information of the whole formula.

Based on the above situation, in order to improve the identification accuracy for a formula with a complex structure, the embodiments of the present disclosure provide an image processing method, apparatus, device, and medium; the embodiment constructs a new formula recognition model with a network structure on the basis of some existing network models, so that the formula recognition model can comprehensively utilize the number of characters, the positions of the characters and semantic information to carry out formula recognition on images, and the accuracy of formula recognition is improved. For ease of understanding, the embodiments of the present disclosure are described below.

Fig. 1 is a flowchart of an image processing method according to an embodiment of the present disclosure, which may be performed by an image processing apparatus configured in a terminal or a server, and the apparatus may be implemented by software and/or hardware. Referring to fig. 1, the method includes the steps of:

step S102, a target image containing a target formula is obtained.

The embodiment can acquire the target image in the terminal through image selection operation, image shooting operation or image uploading operation and the like. The target formula contained in the target image may be a formula in which the relation between several quantities is expressed by mathematical symbols in natural sciences such as mathematics, physics, chemistry, etc., the characters in the target formula may include chinese characters, english characters, numbers and/or operation symbols, etc., and each position of the target formula includes at least one character.

Step S104, inputting the target image into a pre-trained formula recognition model for formula recognition, and obtaining a first character prediction result and a first relation prediction result corresponding to each character in the target formula; the formula identification model is obtained based on formula sample image training with labeling information, and the labeling information comprises: and the formula sample corresponds to a character sequence, a semantic feature vector and a character detection box.

In this embodiment, the annotation information of the formula sample image includes: the method comprises the steps of obtaining a character sequence, a semantic feature vector and a character detection box corresponding to a formula sample; the character sequence can represent characters and the number of the characters in the public sample, the semantic feature vector can represent logical semantic information of a proposition defined or expressed by the formula sample, and the character detection box can represent the position of each character in the formula sample. The formula identification model obtained by training the formula sample image with the labeling information can learn the character number, the character position and the semantic information of the formula sample. On the basis, the target image containing the target formula is input into the trained formula recognition model, and when the formula recognition model performs formula recognition on the target image, the formula recognition model can fully utilize multi-dimensional information such as the number of characters, the position of the characters, semantic information and the like of the target formula, so that the accuracy of formula recognition is improved.

The formula recognition model may include: the system comprises a feature mapping module, a self-attention module, a semantic extraction module and a character prediction module, wherein the feature mapping module, the self-attention module and the semantic extraction module are connected in sequence, and the character prediction module is respectively connected with the self-attention module and the semantic extraction module. The input of the feature mapping module is a target image, and the output of the feature mapping module is a first mapping feature corresponding to the target image; the input of the self-attention module is a first mapping characteristic, and the output of the self-attention module is a second mapping characteristic; the input of the semantic extraction module is a second mapping characteristic, and the output is a first global semantic vector; the input of the character prediction module is a second mapping characteristic and a first global semantic characteristic, and the output is a first character prediction result and a first relation prediction result corresponding to each character in the target formula, wherein the first relation prediction result is used for indicating the character c at any position i _i With character c at adjacent previous position i-1 _i-1 For example, the relationship of (b) is a square, a cube, a superscript, a subscript, or the like.

And S106, obtaining the recognition result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character.

It can be understood that after the first character prediction result and the first relation prediction result corresponding to each character are obtained, the first character prediction results of each character may be arranged according to the first relation prediction result to obtain the recognition result of the target formula in the target image.

The image processing method provided by the embodiment of the disclosure comprises the steps of obtaining a target image containing a target formula; inputting the target image into a pre-trained formula recognition model for formula recognition to obtain a first character prediction result and a first relation prediction result corresponding to each character in the target formula; and obtaining the recognition result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character. According to the image processing method, the formula recognition model can fully utilize the number of characters, the positions of the characters and semantic information, so that formula recognition is carried out from the multiple dimensions when the target image is processed, and the accuracy of formula recognition is effectively improved.

In order to enable the formula recognition model to be directly applied to the formula recognition of the image, the formula recognition model needs to be trained, parameters of the formula recognition model need to be obtained through training, and the purpose of training the formula recognition model is to finally determine the parameters which can meet the requirements. And the formula recognition model can obtain expected formula recognition effect by using the trained parameters. As shown in fig. 2, this embodiment provides a method for training a formula recognition model, which refers to the following steps:

step S202, a first training set and a second training set are obtained; the character sequence, the semantic feature vector and the character detection frame corresponding to the formula sample are marked on the formula sample image in the first training set, and the character sequence and the semantic feature vector corresponding to the formula sample are marked on the formula sample image in the second training set.

In one embodiment, a plurality of original formula sample images are acquired. The original formula sample image includes a single line of formula text content, i.e., a formula sample. In order to increase richness and diversity of the training samples, the public sample in this embodiment may be a straight text, an inclined text, and/or a curved text, and the original formula sample image may be an image with effects of a blurred image, a photocopy image, a high-definition image, and the like. Labeling text character information on an original formula sample image to obtain a character sequence corresponding to a formula sample; specifically, for example, a LaTeX (lauth) character sequence corresponding to a formula sample may be labeled manually.

In this embodiment, a dictionary may also be established according to the character sequences labeled in the multiple original formula sample images, where the dictionary includes characters in the multiple original formula sample images; the specific establishing method comprises the following steps: and performing single-character collection and operation on the marked character sequence to obtain a dictionary, wherein each character in the dictionary is independent and not repeated. The dictionary established here can be used for a character database in the subsequent character recognition based on a formula recognition model, and specific characters are determined in the character recognition database according to recognition probability.

And marking a character detection frame on the original formula sample image. In a specific manner, the acquired original formula sample images can be divided into a first image set and a second image set according to a preset allocation ratio, for example, the number of images in the first image set is one ninth of the number of images in the second image set. Then, a character detection box labeling is carried out on each character on the original formula sample image of the first image set, and the character detection box can be a rectangular box.

And marking semantic feature vectors for the original formula sample images. An embodiment of obtaining semantic feature vectors is provided herein, and reference is made to the following.

Acquiring a second formula sample image marked with a character sequence and a character detection frame; or, the second formula sample image is any original formula sample image labeled with a character sequence and a character detection box.

And carrying out three-tuple encoding on each character in the second formula sample image according to the marked character sequence and the character detection frame to obtain a triple structure, wherein the triple structure comprises: a current character, an adjacent character at a previous position corresponding to the current character, and a relationship between the current character and the adjacent character. In a specific example, according to the character sequence and the character detection box, every two adjacent characters in the sample formula are determined, and each two adjacent characters comprise a current character c corresponding to any position i _i Adjacent character c at the previous position i-1 corresponding to the current character _i-1 . In the natural sciences of mathematics, physics, etc., there is a finite relation between two adjacent characters in a formula, such as square, cubic, superscript, etc,Subscripts, and the like; thus, the relationship between each adjacent two characters is determined according to the grammatical rules of the sample formula. Encoding each two adjacent characters and the relationship between the two characters into a triple structure, wherein the triple structure can be expressed as (node 1, node 2, relationship), and the node 1 and the node 2 respectively correspond to the current character c _i And the adjacent character c at the previous position _i-1 。

Then, obtaining a word embedding vector corresponding to the triple structure by a word embedding method; and inputting the word embedding vector corresponding to each character into a pre-trained semantic information extraction model to obtain a semantic feature vector corresponding to the character sequence of the second formula sample image.

The semantic information extraction model can be constructed based on a Transformer model. The main structure of the semantic information extraction model is basically the same as that of a Transformer, and the differences include the following two points. The method comprises the following steps that firstly, a position coding part of an encoder in a semantic information extraction model divides characters in a sample image of a second formula by using grids so as to ensure that each character has a two-dimensional coordinate position; the input of the encoder is a one-dimensional sine and cosine vector, and the output is a two-dimensional vector. Secondly, the decoder part of the semantic information extraction model only carries out decoding once, and the input of the decoder is any character c in the character sequence _i The output is the next character c _i-1 The predicted result of (1). In the embodiment, the Transformer can greatly accelerate the time of network training and reasoning, and can effectively improve the precision of various tasks, so that the semantic feature vectors are extracted through the semantic information extraction model constructed based on the Transformer, and the extracted semantic feature vectors can obtain better effects on precision and accuracy.

The semantic information extraction model needs to be trained in advance, and possible training modes of the semantic information extraction model comprise: and training the model to be trained by using the first image set marked with the character sequence and the character detection box and the second image set marked with the character sequence, and only keeping the encoder when the training is finished to obtain the trained semantic information extraction model.

Inputting word embedding vectors corresponding to all characters into a trained semantic information extraction model, wherein the word embedding vectors are in a one-dimensional sine and cosine direction; and changing word embedded vector coding corresponding to each character into a two-dimensional vector through a semantic information extraction model to obtain a semantic feature vector corresponding to the character sequence of the second formula sample image. And then, labeling the semantic feature vector of the original formula sample image.

According to the embodiment, a plurality of original formula sample images are divided into a first image set and a second image set, wherein the images in the first image set are marked with a character sequence, a semantic feature vector and a character detection box, and the images in the second image set are marked with a character sequence and a semantic feature vector; then, the first image set labeled with the completion information is used as a first training set, and the second image set labeled with the completion information is used as a second training set.

Step S204, a first model to be trained is obtained; the first model includes: the device comprises a feature mapping module, a self-attention module, a semantic extraction module, a character prediction module, a convolution module and a quantity prediction module.

In one embodiment, the feature mapping module may use a Resnet network. The Resnet network can effectively solve the problem that the model performance is degraded after the number of layers of the network model is deepened, so that the number of layers of the network can be increased (deepened) to extract more complex characteristic modes. The main core of the Resnet network is a structure called a residual Block (Block), which is mainly characterized by cross-layer hopping connection, i.e., one Block includes a plurality of convolutional layers, the output of which after the input passes through the Block will perform one channel-by-point addition operation with the input, which is equivalent to that the input has two branches, one passes through the Block, the other directly and rapidly bypasses the Block, and the last two branches are merged. The Resnet network can achieve good effect on natural scene image classification.

In one specific example, the feature mapping module may use a Resnet18 network, the Resnet18 comprising N Block blocks, each Block consisting of several convolution operations, the output of each Block being the input of the next Block; here, the output of each Block is obtained to obtain N groups of mapping characteristics; in the embodiment, it is considered that the previous Block can obtain shallow detail features, the subsequent Block can obtain high-level semantic features, and meanwhile, a great amount of detail information may be lost in the features obtained by the subsequent Block, so that N groups of mapping features can be scaled to the same size and stacked in series to obtain a better representation through information fusion, and finally, a third mapping feature corresponding to the target image is obtained. In one example, for the case that there may be multiple characters in the same vertical position of the formula, the number of Block blocks is not less than four in the height direction. And the third mapping characteristic obtained after the characteristic mapping module is used as the input of the self-attention module.

The self-attention module may further perform feature extraction on the third mapping feature extracted by the feature mapping module by using two self-attention layers to obtain a new set of feature mappings, which is a fourth mapping feature. The fourth mapped feature dimension is the same as the third mapped feature dimension.

The semantic extraction module mainly comprises a convolution layer and a maximum pooling layer, and the size of the pooling window is consistent with the size of the second mapping characteristic scale to obtain a second global semantic vector.

The convolution module is parallel to the semantic extraction module, can comprise three equal-width convolution layers and outputs a channel mapping characteristic. The convolution module is mainly used for predicting the character center point of each character in the sample formula based on the fourth mapping characteristic.

The quantity prediction module is parallel to the convolution module and the semantic extraction module and can comprise two convolution layers and a full connection layer; the two convolution layers carry out feature extraction on the fourth mapping feature, and the full-connection layer is used for integrating the features extracted by the two convolution layers to predict the number of characters in the sample formula.

The character prediction module comprises a word embedding layer, an attention layer and an LSTM (Long Short-Term Memory network) unit, wherein the LSTM unit is used as a decoder. The input of the character prediction module comprises a fourth mapping characteristic and a second global semantic vector, the second global semantic vector is used as an initial hidden state, the fourth mapping characteristic is used as a key and a value, and the prediction result of the current character and the relation between the current character and the character at the adjacent previous position are calculated and output.

Step S206, training the first model according to the first training set. The specific training process includes the following.

Inputting a first formula sample image to a first model; the first formula sample image is a formula sample image in a first training set.

Outputting a third mapping characteristic corresponding to the first formula sample image through a characteristic mapping module; outputting, by the self-attention module, a fourth mapping feature based on the third mapping feature; outputting, by a semantic extraction module, a second global semantic vector based on the fourth mapping feature; outputting, by a convolution module, a character center point of each character in a formula sample of the first formula sample image based on the fourth mapping feature; outputting, by a quantity prediction module, a quantity of characters in a formula sample of the first formula sample image based on the fourth mapping feature; and outputting a second character prediction result and a second relation prediction result corresponding to each character in the formula samples of the first formula sample image through a character prediction module based on the fourth mapping characteristic and the second global semantic vector.

And training the first model according to a character sequence, a semantic feature vector and a character detection frame corresponding to the formula sample marked on the first formula sample image, a second character prediction result, a second relation prediction result, a second global semantic vector, a character central point, the number of characters and a preset loss function.

The method specifically comprises the following steps: and calculating a first loss function value of the semantic extraction module according to the semantic feature vector, the second global semantic vector and the L1 loss function marked on the first formula sample image. The L1 loss function, which may also be referred to as a minimum absolute value deviation or absolute value loss function, is used to minimize the absolute difference between the target value (labeled semantic feature vector) and the estimate value (second global semantic vector).

And calculating a second Loss function value of the convolution module according to the character detection frame, the character center point and the classification Loss function Focal local Loss function corresponding to the formula sample marked on the first formula sample image.

And calculating a third loss function value of the quantity prediction module according to the character detection frame, the character quantity and the multi-class cross entropy loss function corresponding to the formula sample marked on the first formula sample image.

Obtaining a predicted character sequence of the formula sample in the first formula sample image according to the second character prediction result and the second relation prediction result; and calculating a fourth loss function value of the character prediction module according to the predicted character sequence, the character sequence corresponding to the formula sample marked on the first formula sample image and the multi-class cross entropy loss function.

The first model is trained based on the first, second, third, and fourth loss function values. For example, the first loss function value, the second loss function value, the third loss function value, and the fourth loss function value may be weighted according to a preset weighting coefficient to obtain a composite loss function value, and the first model may be trained according to the composite loss function value.

And S208, after the training of the first model is finished, keeping the parameters of the convolution module unchanged to obtain a second model.

Step S210, training the second model according to the second training set. The manner of training the second model according to the second training set is similar to the above step S206, except that the convolution module does not participate in the training process, and other modules use the second training set for training.

And step S212, when the second model training is converged, determining the feature mapping module, the self-attention module, the semantic extraction module and the character prediction module as formula recognition models. And when the second model training is converged, removing the convolution module and the quantity prediction module, and determining the reserved feature mapping module, the self-attention module, the semantic extraction module and the character prediction module as formula recognition models which are directly used for formula recognition.

In the embodiment, the training process is divided into two stages, namely, the first stage of obtaining the second model by using the first training set for training and the second stage of obtaining the formula recognition model by using the second training set for training, so that the learning rate of the model can be effectively improved, and the training period of the model can be shortened. The formula recognition model obtained by training in the method can perform formula recognition on the input target image, and the accuracy of formula recognition is effectively improved.

Thus, a trained formula recognition model can be obtained according to the above embodiment. On this basis, the present embodiment provides an image processing method based on a formula recognition model, which specifically includes the following contents.

A target image containing a target formula is acquired. Inputting the target image into a formula recognition model trained in advance for formula recognition; the formula recognition model includes: the system comprises a feature mapping module, a self-attention module, a semantic extraction module and a character prediction module; wherein:

inputting the target image into a feature mapping module, and performing feature mapping on the target image through the feature mapping module to obtain a first mapping feature corresponding to the target image; performing feature extraction on the first mapping feature through a self-attention module to obtain a second mapping feature; extracting semantic features of the second mapping features through a semantic extraction module to obtain a first global semantic vector; obtaining a first character prediction result and a first relation prediction result corresponding to each character in the target formula based on the second mapping characteristic and the first global semantic characteristic through a character prediction module; wherein the first relation prediction result is used for indicating the relation between the character and the character at the adjacent previous position.

For a character prediction module, comprising: a word embedding layer, an attention layer, and a decoder; in the prediction process of the character prediction module, vectorization representation can be carried out on the second mapping characteristics through the word embedding layer, and semantic characteristics of the second mapping characteristics are obtained; obtaining an attention vector through the attention layer based on the semantic features of the second mapping features; and obtaining a first character prediction result and a first relation prediction result corresponding to each character in the target formula based on the attention vector and the first global semantic feature through a decoder. In order to improve the recognition accuracy, the first character prediction result may be obtained according to a pre-established dictionary including a plurality of characters.

And then, obtaining the recognition result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character.

In summary, in the image processing method provided by the above embodiment, the trained formula recognition model is used to perform formula recognition on the target image, and the formula recognition model fully utilizes multidimensional information, such as the number of characters, the positions of the characters, semantic information, and the like, included in the target image, so that the recognition accuracy and precision of the formula can be effectively improved.

Fig. 3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. The image processing apparatus provided by the embodiment of the present disclosure may execute the processing flow provided by the embodiment of the image processing method, as shown in fig. 3, the image processing apparatus 300 includes:

an image obtaining module 302, configured to obtain a target image containing a target formula;

a formula recognition module 304, configured to input the target image to a pre-trained formula recognition model for formula recognition, so as to obtain a first character prediction result and a first relationship prediction result corresponding to each character in the target formula; the formula recognition model is obtained by training based on a formula sample image with labeling information, wherein the labeling information comprises: the method comprises the steps of obtaining a character sequence, a semantic feature vector and a character detection box corresponding to a formula sample;

and the result obtaining module 306 is configured to obtain an identification result of the target formula in the target image according to the first character prediction result and the first relation prediction result corresponding to each character.

In some embodiments, the formula identification model comprises: the system comprises a feature mapping module, a self-attention module, a semantic extraction module and a character prediction module; the formula identification module 304 is further configured to:

inputting the target image into the feature mapping module, and performing feature mapping on the target image through the feature mapping module to obtain a first mapping feature corresponding to the target image;

performing feature extraction on the first mapping feature through the self-attention module to obtain a second mapping feature;

extracting semantic features of the second mapping features through the semantic extraction module to obtain a first global semantic vector;

obtaining, by the character prediction module, a first character prediction result and a first relationship prediction result corresponding to each character in the target formula based on the second mapping feature and the first global semantic feature; wherein the first relation prediction result is used for indicating the relation between the character and the character at the adjacent previous position.

In some embodiments, the character prediction module comprises: a word embedding layer, an attention layer, and a decoder; the formula identification module 304 is further configured to:

vectorizing and representing the second mapping characteristic through the word embedding layer to obtain a semantic characteristic of the second mapping characteristic;

obtaining an attention vector based on the semantic features of the second mapping features through the attention layer;

and obtaining a first character prediction result and a first relation prediction result corresponding to each character in the target formula by the decoder based on the attention vector and the first global semantic feature.

In some embodiments, the image processing apparatus 300 further comprises a model training module for:

acquiring a first training set and a second training set; the formula sample images in the first training set are marked with character sequences, semantic feature vectors and character detection frames corresponding to the formula samples, and the formula sample images in the second training set are marked with character sequences and semantic feature vectors corresponding to the formula samples;

acquiring a first model to be trained; the first model includes: the system comprises a feature mapping module, a self-attention module, a semantic extraction module, a character prediction module, a convolution module and a quantity prediction module;

training the first model according to the first training set;

after the training of the first model is finished, keeping the parameters of the convolution module unchanged to obtain a second model;

training the second model according to the second training set;

when the second model training is converged, determining the feature mapping module, the self-attention module, the semantic extraction module and the character prediction module as the formula recognition model.

In some embodiments, the model training module is further to:

inputting a first formula sample image to the first model; wherein the first formula sample image is a formula sample image in the first training set;

outputting a third mapping characteristic corresponding to the first formula sample image through the characteristic mapping module;

outputting, by the self-attention module, a fourth mapping feature based on the third mapping feature;

outputting, by the semantic extraction module, a second global semantic vector based on the fourth mapped feature;

outputting, by the convolution module, a character center point for each character in a formula sample of the first formula sample image based on the fourth mapping feature;

outputting, by the quantity prediction module, a quantity of characters in a formula sample of the first formula sample image based on the fourth mapping feature;

outputting, by the character prediction module, a second character prediction result and a second relation prediction result corresponding to each character in formula samples of the first formula sample image based on the fourth mapping feature and the second global semantic vector;

and training the first model according to a character sequence, a semantic feature vector and a character detection frame which correspond to the formula sample marked on the first formula sample image, the second character prediction result, the second relation prediction result, the second global semantic vector, the character central point, the character quantity and a preset loss function.

In some embodiments, the model training module is further to:

calculating a first loss function value of the semantic extraction module according to the semantic feature vector, the second global semantic vector and the L1 loss function marked on the first formula sample image;

calculating a second Loss function value of the convolution module according to a character detection frame corresponding to the formula sample marked on the first formula sample image, the character center point and a classification Loss function Focal local Loss function;

calculating a third loss function value of the quantity prediction module according to a character detection frame corresponding to the formula sample marked on the first formula sample image, the character quantity and a multi-class cross entropy loss function;

obtaining a predicted character sequence of the formula sample in the first formula sample image according to the second character prediction result and the second relation prediction result;

calculating a fourth loss function value of the character prediction module according to the predicted character sequence, the character sequence corresponding to the formula sample marked on the first formula sample image and the multi-class cross entropy loss function;

training the first model according to the first, second, third, and fourth loss function values.

In some embodiments, the image processing apparatus 300 further comprises a semantic feature vector acquisition module for:

acquiring a second formula sample image marked with a character sequence and a character detection frame;

and carrying out three-tuple encoding on each character in the second formula sample image according to the marked character sequence and the character detection frame to obtain a triple structure, wherein the triple structure comprises: a current character, an adjacent character at a previous position corresponding to the current character, and a relationship between the current character and the adjacent character;

obtaining a word embedding vector corresponding to the triple structure by a word embedding method;

and inputting the word embedding vector corresponding to each character into a pre-trained semantic information extraction model to obtain a semantic feature vector corresponding to the character sequence of the second formula sample image.

The device provided by the embodiment has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 4, a block diagram of an electronic device 400 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The calculation unit 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in the electronic device 400 are connected to the I/O interface 405, including: an input unit 406, an output unit 407, a storage unit 408, and a communication unit 409. The input unit 406 may be any type of device capable of inputting information to the electronic device 400, and the input unit 406 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 407 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 404 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 401 executes the respective methods and processes described above. For example, in some embodiments, the image processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. In some embodiments, the computing unit 401 may be configured to perform the image processing method by any other suitable means (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The previous description is only for the purpose of describing particular embodiments of the present disclosure, so as to enable those skilled in the art to understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image processing method, comprising:

acquiring a target image containing a target formula;

2. The method of claim 1, wherein the formula-identifying model comprises: the system comprises a feature mapping module, a self-attention module, a semantic extraction module and a character prediction module;

the method for inputting the target image into a pre-trained formula recognition model for formula recognition to obtain a character prediction result and a relation prediction result corresponding to each character in the target formula comprises the following steps:

3. The method of claim 2, wherein the character prediction module comprises: a word embedding layer, an attention layer, and a decoder; the obtaining, by the character prediction module, a first character prediction result and a first relation prediction result corresponding to each character in the target formula based on the second mapping feature and the first global semantic feature includes:

vectorizing and representing the second mapping characteristics through the word embedding layer to obtain semantic characteristics of the second mapping characteristics;

obtaining an attention vector based on semantic features of the second mapping features through the attention layer;

4. The method of claim 1, wherein the training process of the formula recognition model comprises:

acquiring a first training set and a second training set; the method comprises the steps that a formula sample image in a first training set is marked with a character sequence, a semantic feature vector and a character detection frame corresponding to the formula sample, and a formula sample image in a second training set is marked with a character sequence and a semantic feature vector corresponding to the formula sample;

training the first model according to the first training set;

training the second model according to the second training set;

5. The method of claim 4, wherein the training the first model according to the first training set comprises:

and training the first model according to a character sequence, a semantic feature vector and a character detection frame which correspond to the formula sample marked on the first formula sample image, the second character prediction result, the second relation prediction result, the second global semantic vector, the character central point, the character number and a preset loss function.

6. The method according to claim 5, wherein the training the first model according to the character sequence, semantic feature vector and character detection box corresponding to the formula sample labeled on the first formula sample image, the second character prediction result, the second relation prediction result, the second global semantic vector, the character center point and the character number, and a preset loss function comprises:

calculating a first loss function value of the semantic extraction module according to the semantic feature vector, the second global semantic vector and the L1 loss function labeled on the first formula sample image;

calculating a second Loss function value of the convolution module according to a character detection frame corresponding to the formula sample marked on the first formula sample image, the character center point and a classification Loss function Focal Loss function;

training the first model according to the first loss function value, the second loss function value, the third loss function value, and the fourth loss function value.

7. The method of claim 4, further comprising:

8. An image processing apparatus characterized by comprising:

the formula recognition module is used for inputting the target image into a pre-trained formula recognition model for formula recognition to obtain a first character prediction result and a first relation prediction result corresponding to each character in the target formula; the formula identification model is obtained by training based on a formula sample image with labeling information, wherein the labeling information comprises: a character sequence, a semantic feature vector and a character detection box corresponding to the formula sample;

9. An electronic device, characterized in that the electronic device comprises:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the image processing method according to any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the image processing method according to any one of claims 1 to 7.