CN113052156A

CN113052156A - Optical character recognition method, device, electronic equipment and storage medium

Info

Publication number: CN113052156A
Application number: CN202110270866.4A
Authority: CN
Inventors: 吴亮; 刘珊珊; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-06-29
Anticipated expiration: 2041-03-12
Also published as: CN113052156B

Abstract

The disclosure discloses an optical character recognition method, an optical character recognition device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning. The specific implementation scheme is as follows: acquiring a target formula area in a picture to be identified; extracting features of the pictures in the target formula area to obtain a first feature map with a preset height, wherein the preset height is greater than 1; unfolding the first feature map to obtain a target feature; and generating a target formula according to the target characteristics. The optical character recognition method, the optical character recognition device, the electronic equipment and the storage medium can improve the recognition effect of the formula in the optical character recognition scene and better solve the recognition problem of the picture scene formula.

Description

Optical character recognition method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision and deep learning in the field of artificial intelligence technologies, and in particular, to an optical character recognition method, an optical character recognition device, an electronic device, and a storage medium.

Background

With the rapid development of computer technology, the deep fusion of the emerging technology of the internet and teaching activities greatly promotes the work and study efficiency of education-related practitioners.

In the related art, Optical Character Recognition (OCR) is mainly used for image-text Recognition, and the technology is generally applicable to general scenes such as street view characters, photographed characters and the like, and comprises two parts of detection and Recognition, wherein a Character area is detected from an input image, and then the Character area image is independently sent to a Recognition network for Recognition. However, this recognition method is not ideal when recognizing formula data having a spatial structure.

Disclosure of Invention

An optical character recognition method, an apparatus, an electronic device, and a storage medium are provided.

According to a first aspect, there is provided an optical character recognition method comprising: acquiring a target formula area in a picture to be identified; extracting features of the pictures in the target formula area to obtain a first feature map with a preset height, wherein the preset height is greater than 1; unfolding the first feature map to obtain a target feature; and generating a target formula according to the target characteristics.

According to a second aspect, there is provided an optical character recognition apparatus comprising: the acquisition module is used for acquiring a target formula area in the picture to be identified; the extraction module is used for extracting features of the pictures in the target formula area to obtain a first feature map with a preset height, and the preset height is greater than 1; the unfolding module is used for unfolding the first feature map to obtain a target feature; and the generating module is used for generating a target formula according to the target characteristics.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of optical character recognition according to the first aspect of the present disclosure.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the optical character recognition method according to the first aspect of the present disclosure.

According to a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the optical character recognition method according to the first aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow chart diagram of an optical character recognition method according to a first embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of an optical character recognition method according to a second embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram of an optical character recognition method according to a third embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of an optical character recognition method according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram of an optical character recognition method according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic flow chart diagram of an optical character recognition method according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram of an optical character recognition method according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic flow chart diagram of an optical character recognition method according to an eighth embodiment of the present disclosure;

FIG. 9 is a schematic flow chart diagram of an optical character recognition method according to a ninth embodiment of the present disclosure;

FIG. 10 is a diagram illustrating a detection stage of the optical character recognition method according to the embodiment of the disclosure;

FIG. 11 is a diagram illustrating a recognition stage in the optical character recognition method according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of an optical character recognition device according to a first embodiment of the present disclosure;

FIG. 13 is a block diagram of an optical character recognition device according to a second embodiment of the present disclosure;

FIG. 14 is a block diagram of an electronic device for implementing the optical character recognition method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Artificial Intelligence (AI) is a technical science that studies and develops theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.

Computer Vision (also known as Machine Vision) is a simulation of biological Vision using a Computer and related equipment, and further refers to a method of using a camera and a Computer to replace human eyes to perform Machine Vision such as identification, tracking and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument to detect.

Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), and learns the intrinsic rules and representation levels of sample data, and information obtained in the Learning process is helpful for interpreting data such as text, images, and sound. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. As for specific research content, the method mainly comprises a neural network system based on convolution operation, namely a convolution neural network; a multilayer neuron based self-coding neural network; and pre-training in a multilayer self-coding neural network mode, and further optimizing the deep confidence network of the neural network weight by combining the identification information. Deep learning has achieved many achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technologies, and other related fields. The deep learning enables the machine to imitate human activities such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes great progress on the artificial intelligence related technology.

An optical character recognition method, an apparatus, an electronic device, and a storage medium according to embodiments of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of an optical character recognition method according to a first embodiment of the present disclosure.

As shown in fig. 1, the optical character recognition method according to the embodiment of the present disclosure may specifically include the following steps:

s101, acquiring a target formula area in the picture to be identified.

Specifically, the executing body of the optical character recognition method according to the embodiment of the present disclosure may be the optical character recognition apparatus provided in the embodiment of the present disclosure, and the optical character recognition apparatus may be a hardware device having a data information processing capability and/or necessary software for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other devices. The user terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, and the like.

In the embodiment of the present disclosure, the picture to be recognized may be a three-channel picture of a test paper, an exercise book, and the like obtained by a camera and an electronic device with an image capturing device, and the picture may specifically include, but is not limited to, at least one of formula, text, and the like. The embodiment of the disclosure simply divides the content displayed by the picture into non-formula content, printing formula content and handwriting formula content. And acquiring a target formula area in the picture to be identified by detecting the picture, extracting the characteristics and processing the characteristic picture. The target formula area is an area in which formula contents are detected and determined in the picture to be identified.

S102, extracting the features of the picture in the target formula area to obtain a first feature map with the preset height, wherein the preset height is larger than 1.

Specifically, according to the target formula area obtained in step S101, an image in the target formula area in the image to be identified is obtained. And inputting the pictures in the target formula area into a deep learning network for feature extraction to obtain a first feature map with a preset height. In order to store the image information of the image at the latitude of the longitudinal axis, the extracted features are maintained at a certain height without directly compressing the extracted features into a 1-dimensional sequence, that is, the preset height is set to be a value greater than 1. For example, a 512 × 64-sized picture in the target formula area may be input into a Convolutional Neural Network (CNN) to extract features of the picture, so as to obtain a first feature map with a size of 32 × 4, where the preset height is 4.

And S103, unfolding the first feature map to obtain the target feature.

Specifically, the first feature map of the picture in the target formula area with the height greater than 1 obtained in step S102 is expanded to obtain the target feature. For example, a first feature map of 32 x 4 size expands to obtain a 128-dimensional target feature.

And S104, generating a target formula according to the target characteristics.

Specifically, the target formula is an identification result of a formula in a picture in the target formula area. And generating a target formula through operations such as coding and decoding according to the target characteristics corresponding to the pictures in the target formula area obtained in the step S103.

It should be noted that, as can be understood by those skilled in the art, the optical character recognition method of the embodiment of the present disclosure can effectively recognize a printing formula or a handwriting formula on a test paper, and obtain a better accuracy rate, and the scheme can be directly applied to, for example, searching questions by taking pictures, identifying a mathematical physical chemical formula and searching in a question bank, which is more efficient than directly searching pictures, and meanwhile, the method can also be used for intelligent marking, identifying question stem information to determine answers, and comparing the answer with handwritten answers of students to score, thereby greatly relieving the pressure of teachers at present.

In summary, in the optical character recognition method according to the embodiment of the present disclosure, a target formula area in a picture to be recognized is obtained, feature extraction is performed on the picture in the target formula area, a first feature map with a preset height is obtained, the preset height is greater than 1, the first feature map is expanded to obtain a target feature, and a target formula is generated according to the target feature. By determining the target formula area, the first characteristic diagram with the height larger than 1 is obtained according to the picture in the target formula area, the picture information of the picture at the longitudinal axis latitude is stored, the recognition effect of the formula in the optical character recognition scene is improved, and the recognition problem of the picture scene formula is well solved.

Fig. 2 is a schematic flow chart of an optical character recognition method according to a second embodiment of the present disclosure.

As shown in fig. 2, on the basis of the embodiment shown in fig. 1, the optical character recognition method according to the embodiment of the present disclosure may specifically include the following steps:

the step S101 of acquiring the target formula area in the picture to be recognized in the above embodiment may specifically include the following steps S201 to S204.

S201, generating a binary image according to the image to be identified, wherein the binary image comprises a character area and a formula area.

Specifically, in the embodiment of the present disclosure, when the formula is detected, in addition to the original three-channel to-be-identified picture, a binary picture is additionally input. In the embodiment of the present disclosure, the binary image may be regarded as a two-dimensional matrix, which is composed of only two values, i.e., 0 and 1, and the size of the two-dimensional matrix is consistent with that of the image to be recognized, the area where the text exists is set to 1, the area where the text does not exist is set to 0, and the text position may be obtained by single character position labeling or general text recognition. The binary image is input into the detection network as a fourth channel, and character position information can be additionally added into the detection network to help the detection network to distinguish a character area from a formula area, so that the situation that an independent letter is also used as a formula to cause false detection is prevented.

S202, performing feature extraction on the picture to be recognized and the binary picture to obtain a second feature map.

Specifically, the picture to be recognized and the binary picture can be input into the convolutional neural network CNN for feature extraction, in the embodiment of the disclosure, an Unet type convolutional neural network is used for extracting image features, and in the extraction process, multiple sampling operations are performed on the input three-channel picture and the binary picture to obtain a second feature map.

And S203, generating a score feature map and an offset feature map according to the second feature map.

Specifically, a score state map score map and an offset feature map geometry map are generated according to the second feature map obtained in step S202. Each pixel location on the fractional feature map and the offset feature map is considered to be a region. The score feature map is used for determining whether characters exist in the corresponding area, and specifically, the area belongs to a non-formula area, a printing formula area or a handwriting formula area. The offset characteristic diagram is used for determining the offset of the characters of the corresponding area, and particularly represents the offset position of the bounding box of the target formula from the area.

And S204, performing non-maximum suppression processing on the fractional feature map and the offset feature map to obtain a target formula area.

Specifically, in the embodiment of the present disclosure, the fractional feature map and the offset feature map may be represented as a multi-dimensional matrix, and the multi-dimensional matrix includes the length, the width, and the number of channels of the feature map. The channel number of the fractional feature map is 3, which respectively represents the probability that a certain pixel in the feature map belongs to a non-formula, a printing formula and a handwriting formula, and the type of the pixel region is judged by comparing the channel values. The number of channels of the offset characteristic diagram is 5, and the channel number represents the distance from a certain pixel point to the upper part, the lower part, the left part and the right part of the surrounding frame of the area where the target formula is located and the frame rotation angle respectively. And performing Non-Maximum Suppression (NMS for short) processing on the fractional feature map and the offset feature map to obtain the bounding box coordinates of the formula, namely the target formula area. The non-maximum suppression processing is a post-processing technology commonly used for character detection, and plays a role in filtering the repeated frame, and the specific process is not described here again.

S205, extracting features of the picture in the target formula area to obtain a first feature map with a preset height, wherein the preset height is larger than 1.

And S206, unfolding the first feature map to obtain the target feature.

And S207, generating a target formula according to the target characteristics.

Specifically, steps S205 to S207 in this embodiment are the same as steps S102 to S104 in the above embodiment, and are not described again here.

Further, as shown in fig. 3, the step S207 "generating the target formula according to the target feature" in the embodiment shown in fig. 2 may specifically include the following steps:

s301, generating a sequence feature sequence and a reverse sequence feature sequence according to the target features.

Specifically, the target feature obtained in step S206 is a feature sequence after the first feature map is developed. Wherein, the characteristic sequence of the forward sequence is a sequence characteristic sequence, and the characteristic sequence of the reverse sequence is a reverse sequence characteristic sequence.

S302, generating a forward coding characteristic sequence and a backward coding characteristic sequence according to the sequence characteristic sequence and the reverse sequence characteristic sequence.

Specifically, the sequence feature sequence and the reverse sequence feature sequence generated in step S301 are encoded, so that the information on the width of the first feature map is fully fused, and a forward encoded feature sequence fw and a backward encoded feature sequence bw are obtained.

S303, generating a forward decoding result and a backward decoding result according to the forward coding characteristic sequence and the backward coding characteristic sequence.

Specifically, the forward encoding characteristic sequence and the backward encoding characteristic sequence generated in step S302 are subjected to bidirectional decoding processing, so as to obtain a forward decoding result and a backward decoding result.

S304, the forward decoding result and the backward decoding result are fused to obtain a target formula.

Specifically, the forward decoding result and the backward decoding result generated in step S303 are fused to obtain a target formula, for example, the choice of the character is determined according to the confidence of the character at each character position in the forward decoding result and the backward decoding result, so as to obtain the target formula.

Further, as shown in fig. 4, the step S302 "generating the forward encoding feature sequence and the backward encoding feature sequence according to the sequential feature sequence and the reverse sequential feature sequence" in the embodiment shown in fig. 3 may specifically include the following steps:

s401, inputting the sequence characteristic sequence and the reverse sequence characteristic sequence into a gated cyclic unit network respectively to obtain a sequence coding characteristic sequence and a reverse sequence coding characteristic sequence.

Specifically, the sequence characteristic sequence and the reverse sequence characteristic sequence are respectively input to a Gated Repeat Unit (GRU) network for encoding to obtain a sequence encoding characteristic sequence and a reverse encoding characteristic sequence.

S402, inputting the sequence coding characteristic sequence and the reverse sequence coding characteristic sequence into a full-connection network respectively to obtain a forward coding characteristic sequence and a backward coding characteristic sequence.

Specifically, the sequential coding feature sequence and the reverse coding feature sequence obtained in step S401 are respectively input to a Fully Connected (FC for short) network to obtain a forward coding feature sequence fw and a backward coding feature sequence bw.

Further, as shown in fig. 5, the step S303 "generating a forward decoding result and a backward decoding result according to the forward encoding feature sequence and the backward encoding feature sequence" in the embodiment shown in fig. 3 may specifically include the following steps:

and S501, generating a forward attention map and a backward attention map according to the forward coding characteristic sequence and the backward coding characteristic sequence.

And S502, respectively decoding the forward attention map and the backward attention map to obtain a forward decoding result and a backward decoding result.

Further, as shown in fig. 6, the step S501 "generating a forward attention map and a backward attention map according to the forward coded signature sequence and the backward coded signature sequence" in the embodiment shown in fig. 5 may specifically include the following steps:

s601, generating a hidden state vector according to the forward coding characteristic sequence and the backward coding characteristic sequence.

Specifically, the forward coding feature sequence fw and the backward coding feature sequence bw are spliced in a feature dimension, and then input into a full connection layer (FC layer) to calculate and generate a hidden state vector F.

And S602, generating a forward attention map according to the forward coding feature sequence and the hidden state vector.

Specifically, the forward coding feature sequence fw is input into a full-concatenation layer (FC layer) to be calculated to obtain a feature coding sequence having the same dimension as the hidden state vector F, and the feature coding sequence and the hidden state vector F are subjected to a series of calculations including direct addition, tanh, full concatenation, softmax (normalization), and the like to obtain a forward attention force diagram. Wherein, the forward direction attention map is a two-dimensional map, one dimension represents the length of an input sequence during decoding, and one dimension represents the length of an output sequence. One dimension represents the length of the input sequence at decoding and one dimension represents the length of the output sequence. The forward attention map stores the impact weights of all input sequence states at each decoding time (at each position of the output sequence). The formula is expressed as:

e_i,j＝Utanh(Ws_i-1+Vf_j) (1)

in the above formula, U, W, V represents the full-join calculation, j represents a certain time of the input sequence, i represents a certain time of the output, formula (1) represents the relation between the ith time step of the output sequence and the jth characteristic of the input sequence, s_i-1Indicating the hidden state(s) at time i-1_CConverted from fw and subsequently calculated from the decoding process), F is a feature at each position of F. Equation (2) is the softmax calculation, α_i，jIs a forward attention map and represents the weight value generated by the j-th characteristic of the input sequence on the current decoding at the decoding ith time step.

And S603, generating a backward attention map according to the backward coding feature sequence and the hidden state vector.

Specifically, the specific process of this step is similar to the specific process of step S602, and is not described here again.

Further, as shown in fig. 7, the step S502 "decoding the forward attention map and the backward attention map respectively in the embodiment shown in fig. 5 to obtain a forward decoding result and a backward decoding result" may specifically include the following steps:

and S701, calculating to obtain an output result of the current time step according to the output result of the last time step in the decoding process, the hidden state of the last time step in the hidden state vector and the weight value of the current time step in the forward attention map, and concatenating the output results of each time step to obtain a forward decoding result.

Specifically, the output of the ith time step is calculated, and the influence of the whole hidden state vector F in decoding is firstly calculated, namely

c_iConsidering the characteristics of the picture to be input when decoding the output of the ith time step, we will output y of the (i-1) th time step_i-1And hidden state s_i-1Calculated together, the formula is:

s_i＝tannh(A[y_i-1,s_i-1,c_i]+b) (4)

the hidden state s of the ith time step can be obtained_iAnd then the output y of the ith time step is obtained through calculation of the full connection layer_i。

S702, calculating to obtain an output result of the current time step according to the output result of the last time step in the decoding process, the hidden state of the last time step in the hidden state vector and the weight value of the current time step in the backward attention map, and serializing the output results of all the time steps to obtain a backward decoding result.

Specifically, the specific process of this step is similar to the specific process of step S701, and is not described herein again.

Further, as shown in fig. 8, the step S304 "of fusing the forward decoding result and the backward decoding result in the embodiment shown in fig. 3 to obtain the target formula" may specifically include the following steps:

s801, acquiring the character corresponding to the editing operation and the editing operation required when the editing distance between the forward decoding result and the backward decoding result is minimum.

Specifically, the edit distance is a minimum number of operations for calculating two character strings to achieve the equality of the character strings through three character-level operations, namely, insertion, replacement and deletion operations, and is generally implemented programmatically through dynamic programming. And recording the process that the forward or backward decoding result is changed into the backward or forward decoding result after the steps of inserting, replacing and deleting while calculating the editing distance, wherein the process comprises editing operation and characters corresponding to the editing operation.

And S802, fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence coefficient of the character corresponding to the editing operation to obtain a target formula.

Specifically, the characters corresponding to the editing operation are operated according to the editing operation and the confidence degrees of the characters corresponding to the editing operation and a certain rule, so that the forward decoding result and the backward decoding result are fused, and the target formula is obtained.

Further, as shown in fig. 9, step S802 "in the embodiment shown in fig. 8, according to the editing operation and the confidence of the character corresponding to the editing operation, fusing the forward decoding result and the backward decoding result to obtain the target formula" may specifically include the following steps:

s901, the editing operation is an inserting operation, and the confidence coefficient of the character corresponding to the inserting operation is larger than the average confidence coefficient of the forward decoding result or the backward decoding result, the character corresponding to the inserting operation is written into the target formula. Otherwise, the character corresponding to the insert operation is not written into the target formula.

And S902, writing the characters corresponding to the deletion operation into a target formula if the confidence of the characters corresponding to the deletion operation is not less than the average confidence of the forward decoding result or the backward decoding result or not less than a preset confidence threshold. Otherwise, the character corresponding to the deletion operation is not written into the target formula.

And S903, if the editing operation is a replacing operation, writing the character with high reliability in the two characters corresponding to the replacing operation into the target formula.

And S904, writing the characters which do not correspond to the editing operation into the target formula.

Specifically, the character which is not edited is written into a new character string as it is to obtain the target formula.

For example, the forward string in the forward decoding result is name, the confidences of five characters are 0.7, 0.8, 0.9, 0.8 and 0.9 respectively, the backward string in the backward decoding result is hop, the confidences of six characters are 0.9, 0.7, 0.8, 0.8, 0.6 and 0.9 respectively, the least operation steps for changing the backward string into the forward string are to replace n with h, replace a with o and delete l, namely 3 steps are needed. Then, the process of obtaining the new character string is to judge the 3 steps:

(1) and replacing n by h, wherein the confidence coefficient of h in the backward character string is 0.9, and the confidence coefficient of n in the forward character string is 0.7, and 0.9>0.7, so that the position of the new character string is determined as h.

(2) And similarly, since the confidence coefficient of o in the backward character string is 0.7 and the confidence coefficient of a in the forward character string is 0.8, 0.8>0.7, the position of the new character string is determined as a.

(3) Delete l, not write l to the new string since the average confidence of six characters in the backward string is 0.78, the confidence of l in the backward string is 0.6, 0.6<0.78, and 0.6<0.7 (preset confidence threshold). Other positions are consistent with the original character string, so that the new character string obtained by final fusion is happy.

In summary, in the optical character recognition method according to the embodiment of the present disclosure, a target formula area in a picture to be recognized is obtained, feature extraction is performed on the picture in the target formula area, a first feature map with a preset height is obtained, the preset height is greater than 1, the first feature map is expanded to obtain a target feature, and a target formula is generated according to the target feature. The target formula area can be more accurately positioned by increasing the binary image channel, the first characteristic diagram with the height larger than 1 is obtained according to the image in the target formula area, the image information of the image at the latitude of the longitudinal axis is stored, and the problem that the subsequent character prediction is influenced by the error of a certain character prediction midway during unidirectional decoding is solved by adopting a bidirectional decoding method, the probability of false detection and false detection in the recognition process of the formula under an optical character recognition scene is reduced, the recognition effect of the formula is improved, and the recognition problem of the image scene formula is better solved.

In order to clearly illustrate the optical character recognition method according to the embodiment of the present disclosure, the optical character recognition method according to the embodiment of the present disclosure is described in detail below with reference to fig. 10 to 11.

Fig. 10 is a schematic diagram of a detection stage in the optical character recognition method according to the embodiment of the disclosure, and as shown in fig. 10, in the detection stage, the picture to be recognized and the binary picture generated according to the picture to be recognized are input into a convolutional neural network CNN to obtain a second feature map, the second feature map is predicted to obtain a score feature map and an offset feature map, and the score feature map and the offset feature map are subjected to non-maximum suppression NMS processing to obtain a target formula region.

Fig. 11 is a schematic diagram of an identification stage in the optical character identification method according to the embodiment of the disclosure, as shown in fig. 11, in the identification stage, a picture in a target formula region obtained in the detection stage is input into a convolutional neural network CNN to obtain a first feature map, the first feature map is expanded to obtain a sequential feature sequence, the sequential feature sequence and an inverse feature sequence obtained according to the sequential feature sequence are input into a gated cyclic unit GRU network to obtain a sequential encoding feature sequence and an inverse encoding feature sequence, the sequential encoding feature sequence and the inverse encoding feature sequence generate a forward encoding feature sequence fw and a backward encoding feature sequence bw through a fully-connected network, a forward attention map fw attention map is generated according to the forward encoding feature sequence fw and a hidden state map F, a backward attention map is generated according to the backward encoding feature sequence bw and a hidden state map F, and respectively decoding the forward attention drawing and the backward attention drawing to obtain a forward decoding result and a backward decoding result, and fusing the forward decoding result and the backward decoding result to obtain a target formula.

Fig. 12 is a block diagram of an optical character recognition apparatus according to a first embodiment of the present disclosure.

As shown in fig. 12, the optical character recognition apparatus 1200 according to the embodiment of the present disclosure includes: an acquisition module 1201, an extraction module 1202, an expansion module 1203, and a generation module 1204.

An obtaining module 1201, configured to obtain a target formula area in a picture to be identified.

The extraction module 1202 is configured to perform feature extraction on the picture in the target formula region to obtain a first feature map with a preset height, where the preset height is greater than 1.

An unfolding module 1203 is configured to unfold the first feature map to obtain a target feature.

A generating module 1204, configured to generate a target formula according to the target feature.

It should be noted that the above explanation of the embodiment of the optical character recognition method is also applicable to the optical character recognition apparatus in the embodiment of the present disclosure, and the specific process is not repeated here.

In summary, the optical character recognition device according to the embodiment of the present disclosure obtains a target formula area in a picture to be recognized, performs feature extraction on the picture in the target formula area to obtain a first feature map with a preset height, where the preset height is greater than 1, expands the first feature map to obtain a target feature, and generates a target formula according to the target feature. By determining the target formula area, the first characteristic diagram with the height larger than 1 is obtained according to the picture in the target formula area, the picture information of the picture at the longitudinal axis latitude is stored, the recognition effect of the formula in the optical character recognition scene is improved, and the recognition problem of the picture scene formula is well solved.

Fig. 13 is a block diagram of an optical character recognition apparatus according to a second embodiment of the present disclosure.

As shown in fig. 13, the optical character recognition apparatus 1300 according to the embodiment of the disclosure may specifically include: an acquisition module 1301, an extraction module 1302, an expansion module 1303, and a generation module 1304.

The obtaining module 1301 has the same function and structure as the obtaining module 1201 in the foregoing embodiment, the extracting module 1302 has the same function and structure as the extracting module 1202 in the foregoing embodiment, the expanding module 1303 has the same function and structure as the expanding module 1203 in the foregoing embodiment, and the generating module 1304 has the same function and structure as the generating module 1204 in the foregoing embodiment.

The obtaining module 1301 may specifically include: a fourth generation submodule 1305, an extraction submodule 1306, a fifth generation submodule 1307, and a processing submodule 1308.

The fourth generating sub-module 1305 is configured to generate a binary picture according to the picture to be identified, where the binary picture includes a text region and a formula region.

And the extraction submodule 1306 is used for performing feature extraction on the picture to be identified and the binary picture to obtain a second feature map.

And a fifth generating submodule 1307 for generating a score feature map and an offset feature map according to the second feature map.

The processing submodule 1308 is configured to perform non-maximum suppression processing on the fractional feature map and the offset feature map to obtain a target formula area.

Further, the generating module 1304 may specifically include: the first generation submodule is used for generating a sequence characteristic sequence and a reverse sequence characteristic sequence according to the target characteristic; the second generation submodule is used for generating a forward coding characteristic sequence and a backward coding characteristic sequence according to the sequence characteristic sequence and the reverse sequence characteristic sequence; the third generation submodule is used for generating a forward decoding result and a backward decoding result according to the forward coding characteristic sequence and the backward coding characteristic sequence; and the fusion submodule is used for fusing the forward decoding result and the backward decoding result to obtain a target formula.

Further, the second generation submodule may specifically include: the first input unit is used for respectively inputting the sequence characteristic sequence and the reverse sequence characteristic sequence into the gated circulation unit network to obtain a sequence coding characteristic sequence and a reverse sequence coding characteristic sequence; and the second input unit is used for respectively inputting the sequence coding characteristic sequence and the reverse sequence coding characteristic sequence into the full-connection network to obtain a forward coding characteristic sequence and a backward coding characteristic sequence.

Further, the third generation sub-module may specifically include: a generating unit, configured to generate a forward attention map and a backward attention map according to the forward coding feature sequence and the backward coding feature sequence; and the decoding unit is used for decoding the forward attention drawing and the backward attention drawing respectively to obtain a forward decoding result and a backward decoding result.

Further, the generating unit may specifically include: the first generating subunit is used for generating a hidden state vector according to the forward coding characteristic sequence and the backward coding characteristic sequence; a second generating subunit, configured to generate a forward attention map according to the forward coding feature sequence and the hidden state vector; and a third generating subunit, configured to generate a backward attention map according to the backward encoded feature sequence and the hidden state vector.

Further, the decoding unit may specifically include: the first calculating subunit is used for calculating to obtain an output result of the current time step according to the output result of the last time step in the decoding process, the hidden state of the last time step in the hidden state vector and the weight value of the current time step in the forward attention map, and concatenating the output results of each time step to obtain a forward decoding result; and the second calculation subunit is used for calculating an output result of the current time step according to the output result of the previous time step in the decoding process, the hidden state of the previous time step in the hidden state vector and the weight value of the current time step in the backward attention map, and serializing the output result of each time step to obtain a backward decoding result.

Further, the fusion submodule may specifically include: the device comprises an acquisition unit, a decoding unit and a processing unit, wherein the acquisition unit is used for acquiring the editing operation needed when the editing distance between the forward decoding result and the backward decoding result is minimum and the character corresponding to the editing operation; and the fusion unit is used for fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence coefficient of the character corresponding to the editing operation to obtain the target formula.

Further, the fusion unit may specifically include: the first writing word unit is used for writing the characters corresponding to the inserting operation into a target formula if the editing operation is the inserting operation and the confidence coefficient of the characters corresponding to the inserting operation is greater than the average confidence coefficient of the forward decoding result or the backward decoding result; the second writing word unit is used for writing the characters corresponding to the deletion operation into the target formula if the editing operation is the deletion operation and the confidence coefficient of the characters corresponding to the deletion operation is not less than the average confidence coefficient of the forward decoding result or the backward decoding result or not less than a preset confidence coefficient threshold; the third writing word unit is used for writing the character with higher reliability in the two characters corresponding to the replacement operation into the target formula if the editing operation is the replacement operation; and a fourth write word unit for writing characters not corresponding to the editing operation into the target formula.

In summary, the optical character recognition device according to the embodiment of the present disclosure obtains a target formula area in a picture to be recognized, performs feature extraction on the picture in the target formula area to obtain a first feature map with a preset height, where the preset height is greater than 1, expands the first feature map to obtain a target feature, and generates a target formula according to the target feature. The target formula area can be more accurately positioned by increasing the binary image channel, the first characteristic diagram with the height larger than 1 is obtained according to the image in the target formula area, the image information of the image at the latitude of the longitudinal axis is stored, and the problem that the subsequent character prediction is influenced by the error of a certain character prediction midway during unidirectional decoding is solved by adopting a bidirectional decoding method, the probability of false detection and false detection in the recognition process of the formula under an optical character recognition scene is reduced, the recognition effect of the formula is improved, and the recognition problem of the image scene formula is better solved.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 14 shows a schematic block diagram of an example electronic device 1400 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 14, the electronic device 1400 includes a computing unit 1401 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1402 or a computer program loaded from a storage unit 1408 into a Random Access Memory (RAM) 1403. In the RAM1403, various programs and data required for the operation of the electronic device 1400 can also be stored. The calculation unit 1401, the ROM 1402, and the RAM1403 are connected to each other via a bus 1404. An input/output (I/O) interface 1405 is also connected to bus 1404.

A number of components in the electronic device 1400 are connected to the I/O interface 1405, including: an input unit 1406 such as a keyboard, a mouse, or the like; an output unit 1407 such as various types of displays, speakers, and the like; a storage unit 1408 such as a magnetic disk, optical disk, or the like; and a communication unit 1409 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1409 allows the electronic device 1400 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1401 executes the respective methods and processes described above, such as the optical character recognition method described in fig. 1 to 11. For example, in some embodiments, the optical character recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1408. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 1400 via the ROM 1402 and/or the communication unit 1409. When the computer program is loaded into the RAM1403 and executed by the computing unit 1401, one or more steps of the optical character recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 1401 may be configured to perform the optical character recognition method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the optical character recognition method according to the above-mentioned embodiment of the present disclosure.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An optical character recognition method comprising:

acquiring a target formula area in a picture to be identified;

extracting features of the pictures in the target formula area to obtain a first feature map with a preset height, wherein the preset height is greater than 1;

unfolding the first feature map to obtain a target feature; and

and generating a target formula according to the target characteristics.

2. The identification method of claim 1, wherein the generating a target formula from the target feature comprises:

generating a sequence characteristic sequence and a reverse sequence characteristic sequence according to the target characteristics;

generating a forward coding characteristic sequence and a backward coding characteristic sequence according to the sequence characteristic sequence and the reverse sequence characteristic sequence;

generating a forward decoding result and a backward decoding result according to the forward coding characteristic sequence and the backward coding characteristic sequence; and

and fusing the forward decoding result and the backward decoding result to obtain the target formula.

3. The identification method of claim 2, wherein the generating of the forward encoded signature sequence and the backward encoded signature sequence from the sequential signature sequence and the reverse sequential signature sequence comprises:

respectively inputting the sequence characteristic sequence and the reverse sequence characteristic sequence into a gated cyclic unit network to obtain a sequence coding characteristic sequence and a reverse coding characteristic sequence; and

and respectively inputting the sequence coding characteristic sequence and the reverse sequence coding characteristic sequence into a full-connection network to obtain the forward coding characteristic sequence and the backward coding characteristic sequence.

4. The identification method of claim 2, wherein the generating of the forward decoding result and the backward decoding result from the forward coding signature sequence and the backward coding signature sequence comprises:

generating a forward attention map and a backward attention map from the forward encoded signature sequence and the backward encoded signature sequence; and

and decoding the forward attention diagram and the backward attention diagram respectively to obtain the forward decoding result and the backward decoding result.

5. The identification method of claim 4, wherein the generating a forward attention map and a backward attention map from the forward encoded signature sequence and the backward encoded signature sequence comprises:

generating a hidden state vector according to the forward coding characteristic sequence and the backward coding characteristic sequence;

generating the forward attention map from the forward encoded feature sequence and the hidden state vector; and

generating the backward attention map according to the backward encoding feature sequence and the hidden state vector.

6. The identification method of claim 5, wherein the decoding the forward attention map and the backward attention map, respectively, to obtain the forward decoding result and the backward decoding result comprises:

calculating to obtain an output result of the current time step according to an output result of the last time step in the decoding process, the hidden state of the last time step in the hidden state vector and the weight value of the current time step in the forward attention map, and concatenating the output results of each time step to obtain the forward decoding result; and

and calculating to obtain an output result of the current time step according to an output result of the last time step in the decoding process, the hidden state of the last time step in the hidden state vector and the weight value of the current time step in the backward attention map, and serializing the output result of each time step to obtain the backward decoding result.

7. The identification method of claim 2, wherein the fusing the forward decoding result and the backward decoding result to obtain the target formula comprises:

acquiring editing operation required when the editing distance between the forward decoding result and the backward decoding result is minimum and characters corresponding to the editing operation; and

and fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence coefficient of the character corresponding to the editing operation to obtain the target formula.

8. The identification method according to claim 7, wherein the fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence of the character corresponding to the editing operation to obtain the target formula comprises:

the editing operation is an inserting operation, and the confidence coefficient of the character corresponding to the inserting operation is greater than the average confidence coefficient of the forward decoding result or the backward decoding result, the character corresponding to the inserting operation is written into the target formula;

the editing operation is a deleting operation, and the confidence of the character corresponding to the deleting operation is not less than the average confidence of the forward decoding result or the backward decoding result or not less than a preset confidence threshold, and the character corresponding to the deleting operation is written into the target formula;

if the editing operation is a replacing operation, writing a character with high confidence coefficient in two characters corresponding to the replacing operation into the target formula; and

and writing characters which do not correspond to the editing operation into the target formula.

9. The identification method according to claim 1, wherein the obtaining of the target formula area in the picture to be identified comprises:

generating a binary image according to the image to be identified, wherein the binary image comprises a character area and a formula area;

performing feature extraction on the picture to be identified and the binary picture to obtain a second feature map;

generating a score feature map and an offset feature map according to the second feature map; and

and carrying out non-maximum suppression processing on the score feature map and the offset feature map to obtain the target formula area.

10. An optical character recognition apparatus comprising:

the acquisition module is used for acquiring a target formula area in the picture to be identified;

the extraction module is used for extracting features of the pictures in the target formula area to obtain a first feature map with a preset height, and the preset height is greater than 1;

the unfolding module is used for unfolding the first feature map to obtain a target feature; and

and the generating module is used for generating a target formula according to the target characteristics.

11. The identification apparatus of claim 10, wherein the generating means comprises:

the first generation submodule is used for generating a sequence characteristic sequence and a reverse sequence characteristic sequence according to the target characteristic;

the second generation submodule is used for generating a forward coding characteristic sequence and a backward coding characteristic sequence according to the sequence characteristic sequence and the reverse sequence characteristic sequence;

a third generation submodule, configured to generate a forward decoding result and a backward decoding result according to the forward coding feature sequence and the backward coding feature sequence;

and the fusion submodule is used for fusing the forward decoding result and the backward decoding result to obtain the target formula.

12. The identification apparatus of claim 11, wherein the second generation submodule comprises:

the first input unit is used for respectively inputting the sequence characteristic sequence and the reverse sequence characteristic sequence into a gated circulation unit network to obtain a sequence coding characteristic sequence and a reverse sequence coding characteristic sequence;

and the second input unit is used for respectively inputting the sequence coding characteristic sequence and the reverse sequence coding characteristic sequence into a full-connection network to obtain the forward coding characteristic sequence and the backward coding characteristic sequence.

13. The identification apparatus of claim 11, wherein the third generation submodule comprises:

a generating unit, configured to generate a forward attention map and a backward attention map according to the forward coding feature sequence and the backward coding feature sequence; and

a decoding unit, configured to decode the forward attention map and the backward attention map respectively to obtain the forward decoding result and the backward decoding result.

14. The identification apparatus of claim 13, wherein the generation unit comprises:

a first generating subunit, configured to generate a hidden state vector according to the forward encoded feature sequence and the backward encoded feature sequence;

a second generating subunit, configured to generate the forward attention map according to the forward encoded feature sequence and the hidden state vector; and

a third generating subunit, configured to generate the backward attention map according to the backward encoded feature sequence and the hidden state vector.

15. The identification apparatus of claim 14, wherein the decoding unit comprises:

the first calculating subunit is configured to calculate an output result of the current time step according to an output result of a previous time step in a decoding process, a hidden state of the previous time step in the hidden state vector, and a weight value of the current time step in the forward attention map, and concatenate the output results of each time step to obtain the forward decoding result; and

and the second calculating subunit is used for calculating an output result of the current time step according to the output result of the last time step in the decoding process, the hidden state of the last time step in the hidden state vector and the weight value of the current time step in the backward attention map, and serializing the output result of each time step to obtain the backward decoding result.

16. The identification device of claim 11, wherein the fusion submodule comprises:

an obtaining unit, configured to obtain an editing operation required when an editing distance between the forward decoding result and the backward decoding result is minimum, and a character corresponding to the editing operation; and

and the fusion unit is used for fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence coefficient of the character corresponding to the editing operation to obtain the target formula.

17. The identification device of claim 16, wherein the fusion unit comprises:

a first writing word unit, configured to write a character corresponding to the insertion operation into the target formula if the editing operation is an insertion operation and a confidence of the character corresponding to the insertion operation is greater than an average confidence of the forward decoding result or the backward decoding result;

a second word writing unit, configured to write the character corresponding to the deletion operation into the target formula if the editing operation is a deletion operation and a confidence of the character corresponding to the deletion operation is not less than an average confidence of the forward decoding result or the backward decoding result or is not less than a preset confidence threshold;

a third writing word unit, configured to write, if the editing operation is a replacement operation, a character with a high confidence level in two characters corresponding to the replacement operation into the target formula; and

and the fourth writing word unit is used for writing characters which do not correspond to the editing operation into the target formula.

18. The identification device of claim 10, wherein the acquisition module comprises:

the fourth generation submodule is used for generating a binary image according to the image to be identified, wherein the binary image comprises a character area and a formula area;

the extraction submodule is used for extracting the features of the picture to be identified and the binary picture to obtain a second feature map;

a fifth generation submodule, configured to generate a score feature map and an offset feature map according to the second feature map; and

and the processing submodule is used for carrying out non-maximum suppression processing on the score feature map and the offset feature map to obtain the target formula area.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the optical character recognition method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the optical character recognition method according to any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the optical character recognition method according to any one of claims 1-9.