CN113052156B

CN113052156B - Optical character recognition method, device, electronic equipment and storage medium

Info

Publication number: CN113052156B
Application number: CN202110270866.4A
Authority: CN
Inventors: 吴亮; 刘珊珊; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2023-08-04
Anticipated expiration: 2041-03-12
Also published as: CN113052156A

Abstract

The disclosure discloses an optical character recognition method, an optical character recognition device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, and particularly relates to the technical field of computer vision and deep learning. The specific implementation scheme is as follows: acquiring a target formula area in a picture to be identified; extracting features of the pictures in the target formula area to obtain a first feature image with a preset height, wherein the preset height is larger than 1; expanding the first feature map to obtain target features; and generating a target formula according to the target characteristics. The optical character recognition method, the device, the electronic equipment and the storage medium can improve the recognition effect of the formula under the optical character recognition scene and better solve the recognition problem of the picture scene formula.

Description

Optical character recognition method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision and deep learning in the field of artificial intelligence technology, and in particular, to an optical character recognition method, an optical character recognition device, an electronic apparatus, and a storage medium.

Background

With the rapid development of computer technology, the deep integration of the internet emerging technology and teaching activities greatly promotes the work and learning efficiency of education-related practitioners.

In the related art, for image-text recognition, optical character recognition (Optical Character Recognition, abbreviated as OCR) is mainly adopted, and the technology is generally applicable to general scenes, such as street view characters, photographed characters and the like, and comprises two parts of detection and recognition, wherein a character area is detected from an input picture, and then the character area picture is independently sent to a recognition network for recognition. However, this method of identification is not ideal for identifying formula data having a spatial structure.

Disclosure of Invention

Provided are an optical character recognition method, an optical character recognition apparatus, an electronic device, and a storage medium.

According to a first aspect, there is provided an optical character recognition method comprising: acquiring a target formula area in a picture to be identified; extracting features of the pictures in the target formula area to obtain a first feature image with a preset height, wherein the preset height is larger than 1; expanding the first feature map to obtain target features; and generating a target formula according to the target characteristics.

According to a second aspect, there is provided an optical character recognition device comprising: the acquisition module is used for acquiring a target formula area in the picture to be identified; the extraction module is used for extracting the characteristics of the pictures in the target formula area to obtain a first characteristic image with a preset height, wherein the preset height is larger than 1; the unfolding module is used for unfolding the first feature map to obtain target features; and the generation module is used for generating a target formula according to the target characteristics.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the optical character recognition method of the first aspect of the present disclosure.

According to a fourth aspect, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the optical character recognition method according to the first aspect of the present disclosure.

According to a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the optical character recognition method according to the first aspect of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of optical character recognition according to a first embodiment of the present disclosure;

FIG. 2 is a flow chart of an optical character recognition method according to a second embodiment of the present disclosure;

FIG. 3 is a flow chart of an optical character recognition method according to a third embodiment of the present disclosure;

FIG. 4 is a flow chart of an optical character recognition method according to a fourth embodiment of the present disclosure;

FIG. 5 is a flow chart of an optical character recognition method according to a fifth embodiment of the present disclosure;

FIG. 6 is a flow chart of an optical character recognition method according to a sixth embodiment of the present disclosure;

FIG. 7 is a flow chart of an optical character recognition method according to a seventh embodiment of the present disclosure;

FIG. 8 is a flow chart of an optical character recognition method according to an eighth embodiment of the present disclosure;

FIG. 9 is a flow chart of an optical character recognition method according to a ninth embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a detection stage in an optical character recognition method according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a recognition stage in an optical character recognition method according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of an optical character recognition device according to a first embodiment of the present disclosure;

FIG. 13 is a block diagram of an optical character recognition device according to a second embodiment of the present disclosure;

fig. 14 is a block diagram of an electronic device for implementing the optical character recognition method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Artificial intelligence (Artificial Intelligence, AI for short) is a piece of technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.

Computer Vision (also called Machine Vision), which is a simulation of biological Vision using a Computer and related equipment, further refers to using a camera and a Computer to replace human eyes to perform Machine Vision such as recognition, tracking and measurement on objects, and further performing graphic processing, so that the Computer processes into images more suitable for human eyes to observe or transmit to an instrument for detection.

Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), and learns the internal rules and presentation layers of sample data, and the information obtained in the Learning process is greatly helpful to the interpretation of data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. For the specific research content, the method mainly comprises a neural network system based on convolution operation, namely a convolution neural network; a self-encoding neural network based on a plurality of layers of neurons; and (3) pre-training in a multi-layer self-coding neural network mode, and further optimizing a deep confidence network of the neural network weight by combining the identification information. Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation, and personalization technologies, as well as other related fields. The deep learning makes the machine imitate the activities of human beings such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes the related technology of artificial intelligence greatly advanced.

Optical character recognition methods, apparatuses, electronic devices, and storage media according to embodiments of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a flow chart of an optical character recognition method according to a first embodiment of the present disclosure.

As shown in fig. 1, the optical character recognition method according to the embodiment of the present disclosure may specifically include the following steps:

s101, acquiring a target formula area in a picture to be identified.

Specifically, the optical character recognition method according to the embodiment of the present disclosure may be implemented by an optical character recognition device according to the embodiment of the present disclosure, where the optical character recognition device may be a hardware device having a data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.

In the embodiment of the disclosure, the picture to be identified may be a three-channel picture of a test paper, an exercise book, etc. obtained through a camera and an electronic device with an image capturing device, and the picture may specifically include, but is not limited to, at least one of information such as a formula and a text. The embodiment of the disclosure simply divides the content of the picture display into non-formula content, printing formula content and handwriting formula content. And obtaining a target formula area in the picture to be identified through detecting the picture, extracting the characteristics and processing the characteristic diagram. The target formula area is the area where the formula content is determined to be located in the picture to be identified through detection.

S102, extracting features of the pictures in the target formula area to obtain a first feature image with a preset height, wherein the preset height is larger than 1.

Specifically, according to the target formula area obtained in step S101, a picture in the target formula area in the picture to be identified is obtained. And inputting the pictures in the target formula area into a deep learning network for feature extraction, and obtaining a first feature map with a preset height. In order to save the picture information of the picture at the latitude of the vertical axis, the extracted features are not directly compressed into a 1-dimensional sequence, but are maintained at a certain height, namely the preset height is set to be a numerical value larger than 1, and the specific numerical value disclosed by the embodiment of the disclosure is not excessively limited and can be set according to the needs. For example, a picture with a size of 512×64 in the target formula area may be input into a convolutional neural network (Convolutional Neural Networks, abbreviated as CNN) to extract features of the picture, so as to obtain a first feature map with a size of 32×4, where the preset height is 4.

S103, expanding the first feature map to obtain target features.

Specifically, the first feature map of the picture in the target formula area with the height greater than 1 obtained in step S102 is expanded to obtain the target feature. For example, a first feature map expansion of 32 x 4 dimensions yields 128-dimensional target features.

S104, generating a target formula according to the target characteristics.

Specifically, the target formula is a recognition result of a formula in the picture in the target formula area. And generating a target formula through operations such as encoding and decoding according to the target characteristics corresponding to the picture in the target formula area obtained in the step S103.

It should be noted that, it can be understood by those skilled in the art that the optical character recognition method of the embodiment of the present disclosure can effectively recognize a printing formula or a handwriting formula on a test paper, and obtain a better accuracy.

In summary, according to the optical character recognition method of the embodiment of the present disclosure, a target formula area in a picture to be recognized is obtained, feature extraction is performed on the picture in the target formula area, a first feature map with a preset height is obtained, the preset height is greater than 1, the first feature map is expanded to obtain target features, and a target formula is generated according to the target features. By determining the target formula area, a first feature map with the height larger than 1 is obtained according to the picture in the target formula area, picture information of the picture in the vertical axis dimension is saved, the recognition effect of the formula under the optical character recognition scene is improved, and the recognition problem of the picture scene formula is well solved.

Fig. 2 is a flow chart of an optical character recognition method according to a second embodiment of the present disclosure.

As shown in fig. 2, on the basis of the embodiment shown in fig. 1, the optical character recognition method according to the embodiment of the present disclosure may specifically include the following steps:

step S101 "acquiring the target formula area" in the picture to be identified in the above embodiment may specifically include the following steps S201 to S204.

S201, generating a binary image according to the image to be identified, wherein the binary image comprises a text area and a formula area.

Specifically, in the embodiment of the disclosure, during formula detection, besides the input of the original three-channel picture to be identified, a binary picture is additionally input. In the embodiment of the disclosure, the binary image can be regarded as a two-dimensional matrix, and only comprises 0 and 1 values, the size is consistent with that of the image to be identified, the area with characters is set to 1, the area without characters is set to 0, and the character position can be obtained by single character position marking or general character recognition. The binary image is used as a fourth channel to be input into the detection network, and text position information can be additionally added to the detection network so as to help the detection network to distinguish text areas from formula areas and prevent false detection caused by taking independent letters as formulas.

S202, extracting features of the picture to be identified and the binary picture to obtain a second feature map.

Specifically, the image to be identified and the binary image may be input into a convolutional neural network CNN for feature extraction, and in the embodiment of the present disclosure, image features are extracted by using a Unet convolutional neural network, and in the extraction process, multiple sampling operations are performed on the input three-channel image and binary image, so as to obtain a second feature image.

And S203, generating a score characteristic diagram and an offset characteristic diagram according to the second characteristic diagram.

Specifically, a score map and an offset feature map geometry map are generated according to the second feature map obtained in step S202. Each pixel location on the fractional feature map and the offset feature map is considered an area. The score feature map is used to determine whether text exists in the corresponding region, specifically, the region belongs to a non-formula region, a printing formula region or a handwriting formula region. The offset feature map is used for determining the offset of characters of a corresponding area, and specifically represents the offset position of a bounding box of a target formula from the area.

S204, performing non-maximum value inhibition processing on the score characteristic diagram and the offset characteristic diagram to obtain a target formula area.

Specifically, in the embodiments of the present disclosure, the fractional feature map and the offset feature map may each be represented as a multi-dimensional matrix including the length, width, and number of channels of the feature map. The number of channels of the fractional feature map is 3, which represents the probability that a certain pixel point in the feature map belongs to a non-formula, a printing formula and a handwriting formula respectively, and the type of the pixel point region is judged by comparing the sizes of the channel values. The number of channels of the offset feature map is 5, and the distance from a certain pixel point to the surrounding frame of the area where the target formula is located, the left and right distances and the frame rotation angle are respectively represented. Non-maximum suppression (NMS) processing is carried out on the fractional characteristic diagram and the offset characteristic diagram to obtain bounding box coordinates of the formula, and then the target formula area is obtained. The non-maximum value suppression processing is a post-processing technology commonly used for text detection, and plays a role in filtering a repeated frame, and specific processes are not repeated here.

S205, extracting features of the pictures in the target formula area to obtain a first feature map with a preset height, wherein the preset height is larger than 1.

S206, expanding the first feature map to obtain target features.

S207, generating a target formula according to the target characteristics.

Specifically, steps S205 to S207 in this embodiment are the same as steps S102 to S104 in the above embodiment, and will not be described here again.

Further, as shown in fig. 3, step S207 "generating the target formula" according to the target feature in the embodiment shown in fig. 2 may specifically include the following steps:

s301, generating a sequential characteristic sequence and an inverse sequence characteristic sequence according to the target characteristic.

Specifically, the target feature obtained in step S206 is a feature sequence after the first feature map is expanded. The feature sequence of the forward sequence is a sequence feature sequence, and the feature sequence of the reverse sequence is a reverse sequence feature sequence.

S302, generating a forward coding feature sequence and a backward coding feature sequence according to the sequential feature sequence and the reverse sequence feature sequence.

Specifically, the sequential feature sequence and the reverse sequence feature sequence generated in step S301 are subjected to coding processing, so that the information on the width of the first feature map is fully fused, and a forward coding feature sequence fw and a backward coding feature sequence bw are obtained.

S303, generating a forward decoding result and a backward decoding result according to the forward coding feature sequence and the backward coding feature sequence.

Specifically, bidirectional decoding is performed on the forward coding feature sequence and the backward coding feature sequence generated in step S302, so as to obtain a forward decoding result and a backward decoding result.

S304, fusing the forward decoding result and the backward decoding result to obtain a target formula.

Specifically, the forward decoding result and the backward decoding result generated in step S303 are fused to obtain a target formula, for example, the choice of the character is determined according to the confidence of the character at each character position in the forward decoding result and the backward decoding result, so as to obtain the target formula.

Further, as shown in fig. 4, step S302 "in the embodiment shown in fig. 3, generating the forward coding feature sequence and the backward coding feature sequence according to the sequential feature sequence and the reverse sequence feature sequence" may specifically include the following steps:

s401, respectively inputting the sequential characteristic sequence and the reverse sequence characteristic sequence into a gating circulation unit network to obtain a sequential coding characteristic sequence and a reverse sequence coding characteristic sequence.

Specifically, the sequential characteristic sequence and the reverse sequence characteristic sequence are respectively input into a gating circulation unit (gated recurrent unit, abbreviated as GRU) network for coding, so as to obtain the sequential coding characteristic sequence and the reverse sequence coding characteristic sequence.

S402, respectively inputting the sequence coding feature sequence and the reverse sequence coding feature sequence into a fully-connected network to obtain a forward coding feature sequence and a backward coding feature sequence.

Specifically, the sequential coding feature sequence and the reverse coding feature sequence obtained in step S401 are respectively input into a Full Connected (FC) network, so as to obtain a forward coding feature sequence fw and a backward coding feature sequence bw.

Further, as shown in fig. 5, step S303 "in the embodiment shown in fig. 3, generating a forward decoding result and a backward decoding result according to the forward coding feature sequence and the backward coding feature sequence" may specifically include the following steps:

s501, generating a forward attention map and a backward attention map according to the forward coding feature sequence and the backward coding feature sequence.

S502 of the process of the present invention, decoding the forward attention attempt and the backward attention attempt separately, and obtaining a forward decoding result and a backward decoding result.

Further, as shown in fig. 6, step S501 "of generating a forward attention map and a backward attention map according to the forward coding feature sequence and the backward coding feature sequence" in the embodiment shown in fig. 5 may specifically include the following steps:

s601, generating a hidden state vector according to the forward coding feature sequence and the backward coding feature sequence.

Specifically, the forward coding feature sequence fw and the backward coding feature sequence bw are spliced in feature dimensions, and then are input into a fully connected layer (FC layer) for calculation to generate a hidden state vector F.

S602, generating a forward attention diagram according to the forward coding feature sequence and the hidden state vector.

Specifically, the forward coding feature sequence fw is input into a full connection layer (FC layer) to perform calculation to obtain a feature coding sequence consistent with the hidden state vector F in dimension, and the feature coding sequence and the hidden state vector F are subjected to a series of calculations including direct addition, tanh, full connection, softmax (normalization) and the like to obtain a forward attention diagram. Where forward attention is intended to be a two-dimensional graph, one dimension representing the length of the sequence input at decoding and one dimension representing the length of the output sequence. The forward attention map stores the impact weights of all input sequence states at each decoding time (each position of the output sequence). The formula is:

e _i,j =Utanh(Ws _i-1 +Vf _j )（1）

（2）

in the above formula, U, W, V represents full-connection calculation, j represents a certain moment of the input sequence, i represents a certain moment of the output, formula (1) represents the relation between the ith time step of the output sequence and the jth feature of the input sequence,s _i-1 representing hidden state of i-1 moments ₀ Converted from fw, and subsequently calculated according to the decoding process), F is a feature at each position of F. Equation (2) is a softmax calculation, α _i,j Is a forward attention diagram representing the weight value that the j-th feature of the input sequence produces for the current decoding when decoding the i-th time step.

And S603, generating a backward attention map according to the backward coding feature sequence and the hidden state vector.

Specifically, the specific process of this step is similar to the specific process of step S602 described above, and will not be repeated here.

Further, as shown in fig. 7, step S502 "in the embodiment shown in fig. 5 decodes the forward attention attempt and the backward attention attempt to obtain a forward decoding result and a backward decoding result" specifically may include the following steps:

s701, calculating to obtain the output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step of the hiding state vector and the weight value of the current time step in the forward attention map, and stringing the output result of each time step to obtain the forward decoding result.

Specifically, the output of the ith time step is calculated, we first calculate the effect of the entire hidden state vector F on decoding, i.e

（3）

c _i It can be considered that the characteristics of the picture to be input when decoding the i-th time step output, we will output the i-1 th time step again y _i-1 Hidden states _i-1 Together, the calculations are formulated as:

s _i =tannh(A[y _i-1 ,s _i-1 ,c _i ]+b)（4）

the hidden state of the ith time step can be obtaineds _i Then the output of the ith time step is obtained through calculation of the full connection layery _i 。

S702, calculating to obtain the output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step of the hiding state vector and the weight value of the current time step in the backward attention map, and stringing the output result of each time step to obtain the backward decoding result.

Specifically, the specific process of this step is similar to the specific process of step S701 described above, and will not be repeated here.

Further, as shown in fig. 8, step S304 "in the embodiment shown in fig. 3, to fuse the forward decoding result and the backward decoding result, to obtain the target formula" may specifically include the following steps:

s801, acquiring the forward decoding result and the backward decoding result, and the editing operation and the character corresponding to the editing operation required when the editing distance is minimum.

Specifically, the edit distance is the minimum number of operations for calculating the two strings to achieve the string equality through three character level operations, namely, insert, replace and delete, and the editing distance is generally achieved through dynamic programming. And recording the process of changing the forward or backward decoding result into the backward or forward decoding result after the steps of inserting, replacing and deleting while calculating the editing distance, wherein the process comprises editing operation and characters corresponding to the editing operation.

S802, fusing a forward decoding result and a backward decoding result according to the editing operation and the confidence coefficient of the characters corresponding to the editing operation to obtain a target formula.

Specifically, according to the editing operation and the confidence coefficient of the character corresponding to the editing operation, the character corresponding to the editing operation is operated according to a certain rule, so that the fusion of the forward decoding result and the backward decoding result is realized, and a target formula is obtained.

Further, as shown in fig. 9, step S802 "in the embodiment shown in fig. 8, according to the editing operation and the confidence level of the character corresponding to the editing operation, fuses the forward decoding result and the backward decoding result to obtain the target formula" may specifically include the following steps:

s901, editing operation is insertion operation, and if the confidence coefficient of the character corresponding to the insertion operation is larger than the average confidence coefficient of the forward decoding result or the backward decoding result, writing the character corresponding to the insertion operation into a target formula. Otherwise, the character corresponding to the insert operation is not written into the target formula.

S902, the editing operation is a deleting operation, and if the confidence coefficient of the character corresponding to the deleting operation is not smaller than the average confidence coefficient of the forward decoding result or the backward decoding result or is not smaller than a preset confidence coefficient threshold value, the character corresponding to the deleting operation is written into the target formula. Otherwise, the character corresponding to the deleting operation is not written into the target formula.

S903, if the editing operation is a replacement operation, writing the character with larger confidence in the two characters corresponding to the replacement operation into the target formula.

S904, writing characters which do not correspond to the editing operation into the target formula.

Specifically, writing the original character which is not subjected to editing operation into a new character string to obtain a target formula.

For example, the forward character string in the forward decoding result is nappy, the confidence of five characters is 0.7,0.8,0.9,0.8,0.9, the backward character string in the backward decoding result is hoply, the confidence of six characters is 0.9,0.7,0.8,0.8,0.6,0.9, the least operation step of changing from the backward character string to the forward character string is to replace n with h, replace a with o, and delete l, i.e. 3 steps are needed. The new string is then obtained by determining the steps of 3:

(1) N is replaced by h, and the confidence of h in the backward character string is 0.9, and the confidence of n in the forward character string is 0.7 and 0.9>0.7, so that the position of the new character string is determined to be h.

(2) Similarly, o replaces a, and the confidence of o in the backward character string is 0.7, the confidence of the forward character string a is 0.8,0.8>0.7, and the position of the new character string is determined to be a.

(3) Deleting l, wherein the average confidence coefficient of six characters of the backward character string is 0.78, the confidence coefficient of l in the backward character string is 0.6,0.6<0.78 and 0.6<0.7 (preset confidence coefficient threshold value), so that l is not written into the new character string. The other positions are consistent with the original character string, so that the new character string obtained by final fusion is happy.

In summary, according to the optical character recognition method of the embodiment of the present disclosure, a target formula area in a picture to be recognized is obtained, feature extraction is performed on the picture in the target formula area, a first feature map with a preset height is obtained, the preset height is greater than 1, the first feature map is expanded to obtain target features, and a target formula is generated according to the target features. The binary image channel is added to more accurately position the target formula area, the first characteristic image with the height larger than 1 is obtained according to the image in the target formula area, the image information of the image in the vertical axis dimension is saved, the problem that the prediction of the subsequent character is affected by the prediction error of a certain character in the middle of unidirectional decoding is avoided by adopting a bidirectional decoding method, the probability of false detection in the identification process of the formula under the optical character identification scene is reduced, the identification effect of the formula is improved, and the identification problem of the image scene formula is better solved.

In order to clearly illustrate the optical character recognition method of the embodiment of the present disclosure, the optical character recognition method of the embodiment of the present disclosure will be described in detail with reference to fig. 10 to 11.

Fig. 10 is a schematic diagram of a detection stage in an optical character recognition method according to an embodiment of the present disclosure, as shown in fig. 10, in the detection stage, a picture to be recognized and a binary picture generated according to the picture to be recognized are input into a convolutional neural network CNN to obtain a second feature map, the second feature map is predicted to obtain a fractional feature map and an offset feature map, and the fractional feature map and the offset feature map are subjected to non-maximum suppression NMS processing to obtain a target formula area.

Fig. 11 is a schematic diagram of an identification stage in an optical character identification method according to an embodiment of the present disclosure, as shown in fig. 11, in the identification stage, a first feature map is obtained by inputting a picture in a target formula area obtained in a detection stage into a convolutional neural network CNN, a sequential feature sequence is obtained by expanding the first feature map, a sequential feature sequence and an inverse sequence feature sequence obtained according to the sequential feature sequence are input into a gate-controlled loop unit GRU network to obtain a sequential coding feature sequence and an inverse sequence coding feature sequence, the sequential coding feature sequence and the inverse sequence coding feature sequence generate a forward coding feature sequence fw and a backward coding feature sequence bw through a full-connection network, a forward attention map is generated according to the forward coding feature sequence fw and a hidden state vector F, a backward attention map bw attention map is generated according to the backward coding feature sequence bw and the hidden state vector F, a forward decoding result and a backward decoding result are respectively obtained, and the forward decoding result and the backward decoding result are fused to obtain the target formula.

Fig. 12 is a block diagram of an optical character recognition device according to a first embodiment of the present disclosure.

As shown in fig. 12, an optical character recognition apparatus 1200 of an embodiment of the present disclosure includes: an acquisition module 1201, an extraction module 1202, an expansion module 1203, and a generation module 1204.

An obtaining module 1201 is configured to obtain a target formula area in the picture to be identified.

The extracting module 1202 is configured to perform feature extraction on a picture in the target formula area to obtain a first feature map with a preset height, where the preset height is greater than 1.

The expansion module 1203 is configured to expand the first feature map to obtain the target feature.

A generating module 1204, configured to generate a target formula according to the target feature.

It should be noted that the explanation of the embodiment of the optical character recognition method is also applicable to the optical character recognition device of the embodiment of the disclosure, and the specific process is not repeated here.

In summary, the optical character recognition device according to the embodiment of the present disclosure obtains a target formula area in a picture to be recognized, performs feature extraction on the picture in the target formula area to obtain a first feature map with a preset height, expands the first feature map to obtain target features, and generates a target formula according to the target features. By determining the target formula area, a first feature map with the height larger than 1 is obtained according to the picture in the target formula area, picture information of the picture in the vertical axis dimension is saved, the recognition effect of the formula under the optical character recognition scene is improved, and the recognition problem of the picture scene formula is well solved.

Fig. 13 is a block diagram of an optical character recognition device according to a second embodiment of the present disclosure.

As shown in fig. 13, an optical character recognition device 1300 according to an embodiment of the disclosure may specifically include: an acquisition module 1301, an extraction module 1302, an expansion module 1303, and a generation module 1304.

Wherein the acquiring module 1301 has the same function and structure as the acquiring module 1201 in the above embodiment, the extracting module 1302 has the same function and structure as the extracting module 1202 in the above embodiment, the expanding module 1303 has the same function and structure as the expanding module 1203 in the above embodiment, and the generating module 1304 has the same function and structure as the generating module 1204 in the above embodiment.

The obtaining module 1301 may specifically include: a fourth generation sub-module 1305, an extraction sub-module 1306, a fifth generation sub-module 1307, and a processing sub-module 1308.

And the fourth generation submodule 1305 is used for generating a binary image according to the image to be identified, wherein the binary image comprises a text area and a formula area.

The extracting sub-module 1306 is configured to perform feature extraction on the picture to be identified and the binary picture, so as to obtain a second feature map.

A fifth generation sub-module 1307 is configured to generate a fractional feature map and an offset feature map from the second feature map.

A processing sub-module 1308 is configured to perform non-maximum suppression processing on the score feature map and the offset feature map, so as to obtain a target formula area.

Further, the generating module 1304 may specifically include: the first generation submodule is used for generating a sequential characteristic sequence and an inverse sequence characteristic sequence according to the target characteristic; the second generation submodule is used for generating a forward coding characteristic sequence and a backward coding characteristic sequence according to the sequence characteristic sequence and the reverse sequence characteristic sequence; the third generation sub-module is used for generating a forward decoding result and a backward decoding result according to the forward coding feature sequence and the backward coding feature sequence; and the fusion sub-module is used for fusing the forward decoding result and the backward decoding result to obtain a target formula.

Further, the second generating submodule may specifically include: the first input unit is used for respectively inputting the sequential characteristic sequence and the reverse sequence characteristic sequence into the gating circulation unit network to obtain a sequential coding characteristic sequence and a reverse sequence coding characteristic sequence; and the second input unit is used for respectively inputting the sequence coding characteristic sequence and the reverse sequence coding characteristic sequence into the fully-connected network to obtain a forward coding characteristic sequence and a backward coding characteristic sequence.

Further, the third generating submodule may specifically include: a generation unit for generating a forward attention map and a backward attention map from the forward coding feature sequence and the backward coding feature sequence; and the decoding unit is used for respectively decoding the forward attention force diagram and the backward attention force diagram to obtain a forward decoding result and a backward decoding result.

Further, the generating unit may specifically include: a first generation subunit, configured to generate a hidden state vector according to the forward coding feature sequence and the backward coding feature sequence; a second generation subunit for generating a forward attention map from the forward coding feature sequence and the hidden state vector; and a third generation subunit for generating a backward attention map from the backward coding feature sequence and the hidden state vector.

Further, the decoding unit may specifically include: the first calculating subunit is used for calculating to obtain the output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step in the hiding state vector and the weight value of the current time step in the forward attention map, and stringing the output result of each time step to obtain a forward decoding result; and the second calculating subunit is used for calculating to obtain the output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step in the hiding state vector and the weight value of the current time step in the backward attention map, and stringing the output result of each time step to obtain the backward decoding result.

Further, the fusion submodule may specifically include: the acquisition unit is used for acquiring the forward decoding result and the backward decoding result, and editing operation and characters corresponding to the editing operation are needed when the editing distance of the forward decoding result and the backward decoding result is minimum; and the fusion unit is used for fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence coefficient of the character corresponding to the editing operation to obtain a target formula.

Further, the fusion unit may specifically include: the first writing subunit is used for editing the characters corresponding to the inserting operation into the target formula if the inserting operation is the inserting operation and the confidence coefficient of the characters corresponding to the inserting operation is larger than the average confidence coefficient of the forward decoding result or the backward decoding result; the second writing subunit is used for editing the operation as a deleting operation, and writing the character corresponding to the deleting operation into the target formula if the confidence coefficient of the character corresponding to the deleting operation is not smaller than the average confidence coefficient of the forward decoding result or the backward decoding result or is not smaller than a preset confidence coefficient threshold value; a third writing subunit, configured to write, when the editing operation is a replacement operation, a character with a larger confidence coefficient in two characters corresponding to the replacement operation into the target formula; and a fourth writing subunit, configured to write the character that is not corresponding to the editing operation into the target formula.

In summary, the optical character recognition device according to the embodiment of the present disclosure obtains a target formula area in a picture to be recognized, performs feature extraction on the picture in the target formula area to obtain a first feature map with a preset height, expands the first feature map to obtain target features, and generates a target formula according to the target features. The binary image channel is added to more accurately position the target formula area, the first characteristic image with the height larger than 1 is obtained according to the image in the target formula area, the image information of the image in the vertical axis dimension is saved, the problem that the prediction of the subsequent character is affected by the prediction error of a certain character in the middle of unidirectional decoding is avoided by adopting a bidirectional decoding method, the probability of false detection in the identification process of the formula under the optical character identification scene is reduced, the identification effect of the formula is improved, and the identification problem of the image scene formula is better solved.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 14 shows a schematic block diagram of an example electronic device 1400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 14, the electronic device 1400 includes a computing unit 1401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1402 or a computer program loaded from a storage unit 1408 into a Random Access Memory (RAM) 1403. In the RAM 1403, various programs and data required for the operation of the electronic device 1400 can also be stored. The computing unit 1401, the ROM 1402, and the RAM 1403 are connected to each other through a bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.

A number of components in electronic device 1400 are connected to I/O interface 1405, including: an input unit 1406 such as a keyboard, a mouse, or the like; an output unit 1407 such as various types of displays, speakers, and the like; a storage unit 1408 such as a magnetic disk, an optical disk, or the like; and a communication unit 1409 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1409 allows the electronic device 1400 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 1401 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1401 performs the respective methods and processes described above, such as the optical character recognition method described in fig. 1 to 11. For example, in some embodiments, the optical character recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1400 via the ROM 1402 and/or the communication unit 1409. When a computer program is loaded into RAM 1403 and executed by computing unit 1401, one or more steps of the optical character recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 1401 may be configured to perform the optical character recognition method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

According to an embodiment of the present disclosure, the present disclosure further provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the optical character recognition method according to the above-described embodiments of the present disclosure.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An optical character recognition method comprising:

acquiring a target formula area in a picture to be identified;

extracting features of the pictures in the target formula area to obtain a first feature image with a preset height, wherein the preset height is larger than 1;

Expanding the first feature map to obtain target features; and

generating a target formula according to the target characteristics; wherein the generating a target formula according to the target feature includes:

generating a sequential characteristic sequence and an inverse sequence characteristic sequence according to the target characteristic;

generating a forward coding feature sequence and a backward coding feature sequence according to the sequential feature sequence and the reverse sequence feature sequence;

generating a forward decoding result and a backward decoding result according to the forward coding feature sequence and the backward coding feature sequence; and

fusing the forward decoding result and the backward decoding result to obtain the target formula;

the fusing the forward decoding result and the backward decoding result to obtain the target formula includes:

acquiring an editing operation required when the editing distance between the forward decoding result and the backward decoding result is minimum and characters corresponding to the editing operation; and

and fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence coefficient of the characters corresponding to the editing operation to obtain the target formula.

2. The identification method of claim 1, wherein the generating forward and backward encoded feature sequences from the sequential and reverse order feature sequences comprises:

respectively inputting the sequential characteristic sequence and the reverse sequence characteristic sequence into a gating circulation unit network to obtain a sequential coding characteristic sequence and a reverse sequence coding characteristic sequence; and

and respectively inputting the sequence coding feature sequence and the reverse sequence coding feature sequence into a fully-connected network to obtain the forward coding feature sequence and the backward coding feature sequence.

3. The identification method of claim 1, wherein the generating forward and backward decoding results from the forward and backward coding feature sequences comprises:

generating a forward attention map and a backward attention map from the forward coding feature sequence and the backward coding feature sequence; and

and respectively decoding the forward attention force diagram and the backward attention force diagram to obtain the forward decoding result and the backward decoding result.

4. A method of identification as claimed in claim 3 wherein said generating a forward attention profile and a backward attention profile from said forward coded feature sequence and said backward coded feature sequence comprises:

Generating a hidden state vector according to the forward coding feature sequence and the backward coding feature sequence;

generating the forward attention map from the forward coding feature sequence and the hidden state vector; and

the backward attention map is generated from the backward encoding feature sequence and the hidden state vector.

5. The identification method of claim 4, wherein the decoding the forward attention attempt and the backward attention attempt to obtain the forward decoding result and the backward decoding result, respectively, comprises:

calculating to obtain an output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step of the hiding state vector and the weight value of the current time step in the forward attention map, and stringing the output result of each time step to obtain the forward decoding result; and

and calculating to obtain the output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step of the hiding state vector and the weight value of the current time step in the backward attention map, and stringing the output result of each time step to obtain the backward decoding result.

6. The method of identifying according to claim 1, wherein the fusing the forward decoding result and the backward decoding result according to the confidence of the editing operation and the character corresponding to the editing operation to obtain the target formula includes:

the editing operation is an inserting operation, and if the confidence coefficient of the character corresponding to the inserting operation is larger than the average confidence coefficient of the forward decoding result or the backward decoding result, writing the character corresponding to the inserting operation into the target formula;

the editing operation is a deleting operation, and if the confidence coefficient of the character corresponding to the deleting operation is not smaller than the average confidence coefficient of the forward decoding result or the backward decoding result or is not smaller than a preset confidence coefficient threshold value, writing the character corresponding to the deleting operation into the target formula;

the editing operation is a replacing operation, and writing the character with larger confidence coefficient in the two characters corresponding to the replacing operation into the target formula; and

and writing characters which do not correspond to the editing operation into the target formula.

7. The identifying method according to claim 1, wherein the obtaining a target formula area in the picture to be identified includes:

Generating a binary image according to the image to be identified, wherein the binary image comprises a text area and a formula area;

extracting features of the picture to be identified and the binary picture to obtain a second feature map;

generating a fraction characteristic diagram and an offset characteristic diagram according to the second characteristic diagram; and

and performing non-maximum value inhibition processing on the score characteristic diagram and the offset characteristic diagram to obtain the target formula area.

8. An optical character recognition device comprising:

the acquisition module is used for acquiring a target formula area in the picture to be identified;

the extraction module is used for extracting the characteristics of the pictures in the target formula area to obtain a first characteristic image with a preset height, wherein the preset height is larger than 1;

the unfolding module is used for unfolding the first feature map to obtain target features; and

the generation module is used for generating a target formula according to the target characteristics;

wherein, the generating module includes:

the first generation submodule is used for generating a sequential characteristic sequence and an inverse sequential characteristic sequence according to the target characteristic;

the second generation submodule is used for generating a forward coding characteristic sequence and a backward coding characteristic sequence according to the sequence characteristic sequence and the reverse sequence characteristic sequence;

A third generation sub-module, configured to generate a forward decoding result and a backward decoding result according to the forward coding feature sequence and the backward coding feature sequence;

the fusion sub-module is used for fusing the forward decoding result and the backward decoding result to obtain the target formula;

wherein, the fusion submodule includes:

an obtaining unit, configured to obtain an editing operation required when an editing distance between the forward decoding result and the backward decoding result is minimum, and a character corresponding to the editing operation; and

and the fusion unit is used for fusing the forward decoding result and the backward decoding result according to the editing operation and the confidence coefficient of the character corresponding to the editing operation to obtain the target formula.

9. The identification device of claim 8, wherein the second generation submodule comprises:

the first input unit is used for inputting the sequence characteristic sequence and the reverse sequence characteristic sequence into a gating circulation unit network respectively to obtain a sequence coding characteristic sequence and a reverse sequence coding characteristic sequence;

and the second input unit is used for respectively inputting the sequence coding feature sequence and the reverse sequence coding feature sequence into a fully-connected network to obtain the forward coding feature sequence and the backward coding feature sequence.

10. The identification device of claim 8, wherein the third generation submodule comprises:

a generating unit for generating a forward attention map and a backward attention map from the forward coding feature sequence and the backward coding feature sequence; and

and the decoding unit is used for respectively decoding the forward attention force diagram and the backward attention force diagram to obtain the forward decoding result and the backward decoding result.

11. The identification device of claim 10, wherein the generation unit comprises:

a first generation subunit, configured to generate a hidden state vector according to the forward coding feature sequence and the backward coding feature sequence;

a second generation subunit for generating the forward attention map from the forward coding feature sequence and the hidden state vector; and

a third generation subunit for generating the backward attention map from the backward coding feature sequence and the hidden state vector.

12. The identification device of claim 11, wherein the decoding unit comprises:

the first calculating subunit is used for calculating to obtain the output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step of the hiding state vector and the weight value of the current time step in the forward attention map, and stringing the output result of each time step to obtain the forward decoding result; and

And the second calculating subunit is used for calculating to obtain the output result of the current time step according to the output result of the previous time step in the decoding process, the hiding state of the previous time step in the hiding state vector and the weight value of the current time step in the backward attention map, and stringing the output result of each time step to obtain the backward decoding result.

13. The identification device of claim 8, wherein the fusion unit comprises:

the first writing subunit is used for writing the characters corresponding to the inserting operation into the target formula if the editing operation is the inserting operation and the confidence coefficient of the characters corresponding to the inserting operation is larger than the average confidence coefficient of the forward decoding result or the backward decoding result;

the second writing subunit is used for writing the characters corresponding to the deleting operation into the target formula if the editing operation is the deleting operation and the confidence coefficient of the characters corresponding to the deleting operation is not smaller than the average confidence coefficient of the forward decoding result or the backward decoding result or is not smaller than a preset confidence coefficient threshold;

a third writing subunit, configured to write, when the editing operation is a replacement operation, a character with a larger confidence coefficient in two characters corresponding to the replacement operation into the target formula; and

And the fourth writing subunit is used for writing characters which do not correspond to the editing operation into the target formula.

14. The identification device of claim 8, wherein the acquisition module comprises:

a fourth generation sub-module, configured to generate a binary image according to the image to be identified, where the binary image includes a text area and a formula area;

the extraction submodule is used for carrying out feature extraction on the picture to be identified and the binary picture to obtain a second feature map;

a fifth generation sub-module, configured to generate a score feature map and an offset feature map according to the second feature map; and

and the processing sub-module is used for carrying out non-maximum value inhibition processing on the fraction characteristic diagram and the offset characteristic diagram to obtain the target formula area.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the optical character recognition method of any one of claims 1-7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the optical character recognition method according to any one of claims 1-7.