CN110147785B

CN110147785B - Image recognition method, related device and equipment

Info

Publication number: CN110147785B
Application number: CN201810274802.XA
Authority: CN
Inventors: 李辉
Original assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2023-01-10
Anticipated expiration: 2038-03-29
Also published as: CN110147785A

Abstract

The invention discloses an image identification method, which comprises the following steps: carrying out binarization processing on the image to obtain a binary image; the image comprises a plurality of characters; performing skeleton extraction on the binary image to extract skeleton information of the characters; extracting stroke information from the skeleton information; the stroke information comprises the number of the stroke characteristic points and position information between the adjacent stroke characteristic points; and analyzing the stroke information through a time sequence recognition engine based on a deep learning network, and recognizing the characters and the position relation information among the characters. The invention also discloses an image recognition device and equipment, which do not need manual design characteristics and character separation, and solve the technical problem of low recognition accuracy rate caused by the fact that separation algorithm cannot well process the adhered characters in the prior art.

Description

Image recognition method, related device and equipment

Technical Field

The invention relates to the field of computers, in particular to an image quilt method, a related device and equipment.

Background

Optical Character Recognition (OCR) refers to a process in which an electronic device (e.g., a scanner or a digital camera) examines a printed Character on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer text using a Character Recognition method. The false recognition rate or the recognition accuracy rate is an important index for measuring the good and bad performance of the OCR.

At present, the application field of OCR mathematical character recognition is very wide, and the OCR mathematical character recognition can replace a keyboard to finish high-speed character input humanity in many occasions. For example, OCR is used to perform recognition entry of print documents, which is one of the methods frequently used by many office departments; the method also can carry out automatic segmentation print recognition on complex layouts such as graphs, images, texts and the like; the automatic mail sorting system is realized by recognizing the handwritten numbers; and the handwriting body surface form data can be automatically input, and the handwriting body surface form data can be widely applied to the input and processing of form data such as statement forms, questionnaires and the like in various industries such as governments, tax, insurance, commerce, medical treatment, finance, factories and mines and the like.

In the prior art, when characters in an image are identified, particularly when a mathematical formula is identified, the image is often binarized, then characters are separated, a single mathematical character is extracted by segmentation, the characteristics of the mathematical character are extracted, and then the mathematical formula is generated by deducing the mathematical expression by using a random context grammar-incapable rule according to the position relation between the characters. Then, for the characters with adhesion, the separation algorithm cannot process the characters well, so that the recognition accuracy is low.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide an image recognition method, an image recognition device, an image recognition apparatus, and a computer-readable storage medium, which solve the technical problem in the prior art that a separation algorithm cannot process the characters with adhesion well, resulting in low recognition accuracy.

In order to solve the above technical problem, one aspect of the embodiments of the present invention discloses an image recognition method, including:

carrying out binarization processing on the image to obtain a binary image; the image comprises a plurality of characters;

performing skeleton extraction on the binary image to extract skeleton information of the characters;

extracting stroke information from the skeleton information; the stroke information comprises the number of the stroke characteristic points and position information between the adjacent stroke characteristic points;

and analyzing the stroke information through a time sequence recognition engine based on a deep learning network, and recognizing the characters and the position relation information among the characters.

With reference to the above image recognition method, the performing skeleton extraction on the binary image includes:

carrying out iterative corrosion treatment on the binary image until no new pixel point is corroded compared with the binary image subjected to the last corrosion; and each iteration corrosion comprises traversing pixel points in the binary image in sequence and corroding the pixel points meeting specified conditions.

With reference to the image recognition method, the pixels meeting the specified conditions include target pixels meeting any one of the following conditions:

the number of pixels with binary value 1 in 8 adjacent pixels around the target pixel is greater than or equal to a first threshold and less than or equal to a second threshold; the first threshold is less than the second threshold;

checking 8 adjacent pixel points around the target pixel point in a clockwise direction, wherein the frequency that the binary sequence of two adjacent pixel points is 01 is equal to a third threshold value;

in 4 adjacent pixel points which are relatively nearest, the binary value of at least one pixel point is 0; the distance includes a distance from a center of a pixel adjacent to a target pixel to a center of the target pixel.

In combination with the above image recognition method, the passing the stroke information through a time sequence recognition engine based on a deep learning network to recognize the multiple characters and the information of the position relationship between the characters includes:

performing feature extraction on the stroke information by a Convolutional Neural Network (CNN);

and inputting the extracted features into a Long Short-Term Memory network (LSTM) for character recognition, and recognizing the characters and the position relation information among the characters.

In combination with the image recognition method, the long-short term memory network LSTM is a bidirectional LSTM.

With reference to the foregoing image recognition method, the performing binarization processing on the image includes:

and (3) carrying out binarization processing on the image by adopting a Maximum Stable Extreme Region (MSER) algorithm.

In combination with the above-mentioned image recognition method, the plurality of characters include mathematical expressions;

after the plurality of characters and the information of the position relationship between the characters are identified, the method further comprises the following steps: outputting a Lateh (LaTex) expression based on the identified plurality of characters.

With reference to the foregoing image recognition method, the extracting stroke information from the skeleton information includes:

traversing the skeleton information according to a connected domain, and extracting stroke feature points; and preferentially extracting the stroke feature points with smaller direction angles with the previous stroke feature point under the condition of stroke bifurcation.

Another aspect of an embodiment of the present invention discloses an image recognition apparatus, including:

the processing unit is used for carrying out binarization processing on the image to obtain a binary image; the image includes a plurality of characters;

an extraction unit, configured to perform skeleton extraction on the binary image, and extract skeleton information of the plurality of characters;

an extraction information unit for extracting stroke information from the skeleton information; the stroke information comprises the number of the stroke characteristic points and position information between the adjacent stroke characteristic points;

the recognition unit is used for analyzing the stroke information through a time sequence recognition engine based on a deep learning network, and recognizing the characters and the position relation information among the characters.

With reference to the image recognition apparatus, the extraction unit is specifically configured to perform iterative etching processing on the binary image until no new pixel point is etched with respect to the binary image after the last etching; and each iteration corrosion comprises traversing pixel points in the binary image in sequence and corroding the pixel points meeting specified conditions.

With reference to the image recognition apparatus, the pixels meeting the specified conditions include target pixels meeting any one of the following conditions:

in the nearest adjacent pixel points, the binary value of at least one pixel point is 0; the distance includes a distance from a center of a pixel adjacent to a target pixel to a center of the target pixel.

In combination with the above image recognition apparatus, the recognition unit includes:

the characteristic extraction unit is used for extracting the characteristics of the stroke information through a Convolutional Neural Network (CNN);

and the character recognition unit is used for inputting the extracted features into the long-short term memory network LSTM to perform character recognition and recognizing the characters and the position relation information among the characters.

In combination with the above-mentioned image recognition apparatus, the plurality of characters include mathematical expressions;

the recognizing unit outputting the recognized characters includes: and outputting a LaTex expression according to the plurality of recognized characters.

With reference to the image recognition apparatus, the information extraction unit is specifically configured to traverse the skeleton information according to a connected domain, and extract a stroke feature point; and preferentially extracting the stroke feature points with smaller direction angles with the previous stroke feature point under the condition of stroke bifurcation.

In another aspect of the embodiment of the present invention, an image recognition apparatus is disclosed, which includes a processor and a memory, where the processor and the memory are connected to each other, where the memory is used for storing application program codes, and the processor is configured to call the program codes to execute an image recognition method as described above.

Another aspect of the embodiments of the present invention discloses a computer-readable storage medium storing a computer program, the computer program comprising program instructions, which, when executed by a processor, cause the processor to execute an image recognition method as described above.

By implementing the embodiment of the invention, the skeleton of the binary image is extracted, the skeleton information of a plurality of characters is extracted, then the stroke information is extracted from the skeleton information, the stroke information passes through a time sequence recognition engine based on a deep learning network, and the plurality of characters and the position relation information among the characters are recognized, so that the characteristics do not need to be designed manually, and the character separation does not need to be carried out, thereby solving the technical problem of low recognition accuracy rate caused by the fact that separation algorithm cannot be well processed for the characters with adhesion in the prior art; particularly, the embodiment of the invention identifies the digital characters through a time sequence-based deep learning identification model, and inputs the characteristics extracted through CNN into a bidirectional LSTM network to output a LaTex expression without segmenting the characters of the image and analyzing the spatial position relationship among the characters, and the information is obtained by the deep learning identification model, namely, the end-to-end identification is realized, so the embodiment of the invention can adapt to various complex scenes, and the identification accuracy is greatly improved.

Drawings

For the purpose of illustrating embodiments of the present invention or solutions in the prior art, the drawings used in the description of the embodiments or solutions in the prior art will be briefly described below.

FIG. 1 is a schematic flow chart diagram illustrating a pattern recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an input image provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of a binary map provided by an embodiment of the present invention;

FIG. 4 is a diagram illustrating image skeleton extraction provided by an embodiment of the invention;

fig. 5 is a schematic structural diagram of a pixel provided in an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a pixel point according to another embodiment of the present invention;

fig. 7 is a schematic structural diagram of an example of a pixel according to another embodiment of the present invention;

FIG. 8 is a schematic diagram of image skeleton extraction according to another embodiment of the present invention;

FIG. 9a is a schematic diagram of stroke information provided by an embodiment of the present invention;

FIG. 9b is a schematic diagram of stroke information according to another embodiment of the present invention;

FIG. 10 is a schematic diagram of a timing identification engine provided by an embodiment of the invention;

fig. 11 is a schematic structural diagram of an LSTM network provided by an embodiment of the present invention;

FIG. 12 is a schematic diagram of a timing identification engine according to another embodiment of the present invention;

fig. 13 is a schematic structural diagram of a bidirectional LSTM network provided by an embodiment of the present invention;

fig. 14 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention;

FIG. 15 is a schematic structural diagram of an identification unit provided in an embodiment of the present invention;

fig. 16 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

In specific implementations, the terminal or device described in the embodiments of the present invention includes, but is not limited to, portable mobile terminals such as desktop computers, laptop computers, tablet computers, smart terminals such as smart phones, smart watches, smart glasses, and the like.

In order to better understand an image recognition method, an image recognition device, and an image recognition apparatus provided in the embodiments of the present invention, an image recognition scene in the embodiments of the present invention is described first. The image recognition in the embodiment of the present invention is a process of recognizing an image including a plurality of characters, for example, a mathematical formula, and outputting the characters in the image after the image recognition device or the image recognition apparatus acquires the image to be recognized. The output characters are convenient for relevant personnel to enter information, or for a postal system to sort letters, or for subsequent searching of relevant information matched with the letters, and the like.

An image recognition method, an image recognition apparatus, and an image recognition device according to embodiments of the present invention are described in detail below with reference to the accompanying drawings. Fig. 1 shows a flow chart of a pattern recognition method according to an embodiment of the present invention, which may include the following steps:

step S100: carrying out binarization processing on the image to obtain a binary image;

specifically, the image in the embodiment of the present invention may include a plurality of characters; binarization (Image Binarization) of an Image is a process of setting the gray value of a pixel point on the Image to be 0 or 255 so as to obtain a binary Image, i.e. the whole Image presents an obvious black-white effect. The embodiment of the invention can represent the binary value of the pixel point with the gray value of 0 as 0 after the binary value, and represent the binary value of the pixel point with the gray value of 255 as 1.

In one embodiment of the present invention, the binarization algorithm may adopt a Maximum Stable Extremum Region (MSER) algorithm with the best performance for affine invariant Regions to extract connected Regions, filter out Regions with small size, large size and abnormal aspect ratio, and output a binary image. Referring specifically to fig. 2, which is a schematic diagram of an input image provided by an embodiment of the present invention, the image in fig. 2 includes a plurality of characters that form a mathematical expression; after the image is binarized in step S100, a schematic diagram of the binary image provided by the embodiment of the present invention as shown in fig. 3 is obtained, and an image exhibiting a significant black and white effect is output.

Step S102: performing skeleton extraction on the binary image to extract skeleton information of the characters;

specifically, as shown in fig. 4, the image skeleton extraction is provided in the embodiment of the present invention, and the image skeleton extraction is to extract a central pixel contour of the target on the image, that is, to refine the target with reference to the target center. The framework extraction algorithm can be divided into two categories of iteration and non-iteration, and in the iteration algorithm, the framework extraction algorithm is divided into two categories of parallel iteration and sequential iteration, and the like.

In one embodiment of the present invention, iterative etching processing may be performed on the binary image until no new pixel point is etched with respect to the binary image after the last etching; and each iterative corrosion comprises traversing pixel points in the binary image in sequence and corroding the pixel points meeting specified conditions.

It should be noted that the erosion in the embodiment of the present invention may refer to removing some parts of the image in morphology, and specifically may refer to deleting some pixels in the boundary of the object, and then the erosion on the binary image may refer to deleting a pixel point with a binary value of 1 in the binary image, that is, changing the pixel point with the binary value of 1 into a pixel point with a binary value of 0.

Specifically, the specified condition may be set according to the self skeletonization requirement, for example, the pixel point meeting the specified condition in the present invention may include a target pixel point meeting any one of the following conditions:

the condition a is that the number of pixels with binary value 1 in 8 adjacent pixels around a target pixel is greater than or equal to a first threshold and less than or equal to a second threshold; the first threshold is less than the second threshold; specifically, the following formula 1 may be referred to:

formula 1 where the first threshold is less than or equal to B (P1) and less than or equal to the second threshold

Referring to fig. 5, which shows a schematic structural diagram of a pixel provided in the embodiment of the present invention, P1 is a target pixel that we want to determine whether to corrode (or delete), and pixels adjacent to 8 around P1 are marked as P2, P3, P4, P5, P6, P7, P8, and P9; in the embodiment of the present invention, taking binary value of a pixel point as 0 or 1 as an example, B (P1) refers to the number of pixel points with binary value 1 in 8 adjacent pixel points around the center pixel point P1 (i.e., the target pixel point), that is, B (P1) = P2+ P3+ P4+ P5+ P6+ P7+ P8+ P9. In one embodiment, the first threshold may be 2 and the second threshold may be 6.

B, checking 8 adjacent pixel points around the target pixel point in a clockwise direction, wherein the frequency that the binary sequence of the two adjacent pixel points is 01 is equal to a third threshold value; specifically, the following formula 2 may be referred to:

a (P1) = third threshold value formula 2

Referring to fig. 6, a schematic structural diagram of a pixel point according to another embodiment of the present invention may be shown, where the direction is clockwise, i.e., from P3 to P4 to P5 to P6, and so on, and the direction is from P2 to P3; a (P1) is the number of times that 8 adjacent pixels around the target pixel are viewed in the clockwise direction, and the binary sequence of the two adjacent pixels is 01.

In one embodiment, the third threshold may be 1, and then, taking fig. 7 as an example, fig. 7 shows an example structural diagram of a pixel point according to another embodiment of the present invention, it can be seen from the example on the left that the number of times that the binary sequence of two adjacent pixel points is 01 is 2, the sequence from P2 to P3 is 01, and the sequence from P6 to P7 is 01, and then the condition b is not met; and the right example shows that the number of times that the binary sequence of two adjacent pixel points is 01 is 1, and only if the binary sequence from P9 to P2 is 01, the P1 point is corroded if the condition b is met.

Under the condition c, in 4 adjacent pixel points which are relatively closest to each other, the binary value of at least one pixel point is 0; the distance includes a distance from a center of a pixel adjacent to the target pixel to a center of the target pixel. Specifically, the following formula 3 may be referred to:

p2 × P4 × P6 × P8=0 formula 3

With reference to the structural schematic diagram of the pixel points provided in the embodiment of the present invention shown in fig. 5, taking P1 as a target pixel point, the adjacent pixel points relatively closest to P1 are P2, P4, P6, and P8, that is, the distances from the centers of P2, P4, P6, and P8 to the center of P1 are all smaller than the distances from the centers of P3, P5, P7, and P9 to the center of P1; in a particularly ideal situation, the distances from the centers of P2, P4, P6, and P8 to the center of P1 are equal, and all are nearest neighboring pixels, that is, condition c in the embodiment of the present invention may also be that, in the nearest neighboring pixels, at least one pixel has a binary value of 0. For example, if the binary value of P2 is 0, the condition c is met, and the P1 point is corroded. If none of the binary values of P2, P4, P6, and P8 is 0, the condition c is not met.

Further, when the iteration is an odd number of iterations, it may be determined whether P2 × P4 × P6=0 or P4 × P6 × P8=0 is established, and if so, the condition c is met, and the P1 point is corroded; when the iteration is even iterations, whether P2 × P4 × P8=0 or P2 × P6 × P8=0 is determined, and if yes, the condition c is met, and the P1 point is corroded.

Taking the binary image shown in fig. 3 as an example, the skeleton extraction is performed through step S102 to extract skeleton information of the plurality of characters, and the obtained effect image can refer to the schematic diagram of image skeleton extraction of another embodiment provided by the present invention shown in fig. 8, and skeletonization of the character image is realized through multiple iterations of expansion and erosion, so that the target in the image becomes thinner and thinner.

Step S104: extracting stroke information from the skeleton information;

specifically, in the embodiment of the present invention, the stroke information is extracted from the skeleton information by using a stroke extraction algorithm, for example, as shown in a schematic diagram of the stroke information provided in the embodiment of the present invention in fig. 9a, the stroke information in the embodiment of the present invention may include the number of stroke feature points and position information between adjacent stroke feature points; as shown in fig. 9a, each point is a brush-tip feature point, and a positional relationship exists between adjacent brush-tip feature points, for example, a positional relationship exists between a brush-tip feature point a and an adjacent brush-tip feature point b in fig. 9a, and a direction angle from the brush-tip feature point a to the adjacent brush-tip feature point b can be represented by vector information.

In one embodiment, extracting the stroke information from the skeleton information may include traversing the skeleton information according to a connected domain, and extracting stroke feature points; and preferentially extracting the stroke feature points with smaller direction angles with the previous stroke feature point under the condition of stroke bifurcation. The connected domain in the embodiment of the invention can be a region connected with the pen-touch characteristic points; the stroke bifurcation in the embodiment of the invention can mean that when traversing the stroke feature points from a certain stroke feature point along a certain direction, when a plurality of next connected stroke feature points exist, the stroke bifurcation occurs; the direction angle in the embodiment of the present invention refers to a direction angle existing between the current stroke feature point and the previous connected stroke feature point, and specifically may be an included angle between a direction of traversing the previous connected stroke feature point and a direction of traversing the current stroke feature point. Specifically, as shown in fig. 9b, which is a schematic diagram of the pen-touch information according to another embodiment provided by the present invention, the pen-touch information in fig. 9b is an enlarged display of the pen-touch information x in fig. 9a, and when a pen-touch feature point c starts to branch, and a pen-touch feature point f, a pen-touch feature point g, and a pen-touch feature point h branch from a pen-touch feature point c, a next pen-touch feature point d is traversed according to a connected domain, the pen-touch feature point f with a direction angle of 0 degree is traversed first, the pen-touch feature point g with a direction angle of 90 degrees is traversed second, and the pen-touch feature point h with a direction angle of 270 degrees is traversed last.

Step S106: and analyzing the stroke information through a time sequence recognition engine based on a deep learning network to recognize the plurality of characters and the position relation information among the characters.

The timing sequence recognition engine of the embodiment of the invention can adopt a deep learning network based on a Long Short-Term Memory network (LSTM). Specifically, after the stroke information obtained in step S104 is input, the Network may extract features by a Convolutional Neural Network (CNN), and then input the extracted features into the LSTM Network to complete the recognition of the multiple characters and the information of the position relationship between the characters, and finally may output the recognized multiple characters.

Referring to the schematic diagram of the timing recognition engine provided in the embodiment of the present invention as shown in fig. 10, the input stroke information includes the number of stroke feature points and the position information between adjacent stroke feature points, the features are extracted through the CNN network 10, the convolution layer with the channel number of 3 × 3 twice being 64 is processed, then the pooling layer processing is performed, the convolution layer with the channel number of 3 × 3 twice being 128 is processed, then the pooling layer processing is performed, the convolution layer with the channel number of 3 × 3 twice being 256 is processed, then the pooling layer processing is performed, finally the convolution layer with the channel number of 3 × 3 twice being 512 is processed, and then the extracted features are output through the pooling layer processing. The embodiment of the present invention is not limited to the convolution of 3 × 3 in fig. 10, and may also be 5 × 5, and the like, and the extracted features may be divided into stroke information of a plurality of time sequence units, and then sequentially input to the LSTM network to complete the recognition of the plurality of characters and the information of the position relationship between the characters, and finally output the plurality of recognized characters. For a specific structure of the LSTM network, referring to the schematic structural diagram of the LSTM network provided by the embodiment of the present invention shown in fig. 11, taking the image in fig. 2 as an example, the stroke information of 11 time sequence units can be extracted from the CNN network, the stroke information of each time sequence unit is removed or added to the cell state through a well-designed structure called a "gate" according to time sequence, and finally the plurality of recognized characters can be output.

According to the embodiment of the invention, the skeleton extraction is carried out on the binary image, the skeleton information of a plurality of characters is extracted, then the stroke information is extracted from the skeleton information, the stroke information passes through the time sequence recognition engine based on the deep learning network, the plurality of characters and the position relation information among the characters are recognized, the characteristics do not need to be designed manually, and the character separation does not need to be carried out, so that the technical problem of low recognition accuracy rate caused by the fact that separation algorithm cannot be well processed for the characters with adhesion in the prior art is solved.

Still further, as shown in fig. 12, which is a schematic diagram of a timing recognition engine according to another embodiment of the present invention, the LSTM in step S106 according to the embodiment of the present invention may be a bidirectional LSTM, and specifically, refer to a schematic diagram of a structure of a bidirectional LSTM network according to the embodiment of the present invention shown in fig. 13, so that, also taking the image in fig. 2 as an example, stroke information of 11 timing units may be extracted from the CNN network, the stroke information of each timing unit may be removed or added to a cell state through a well-designed structure called a "gate" according to a timing sequence, and finally, the plurality of recognized characters may be output.

In one embodiment, the plurality of characters in the embodiments of the present invention may include a mathematical expression, and outputting the identified plurality of characters may include: and outputting a LaTex expression according to the plurality of recognized characters. The embodiment of the invention identifies the digital characters through the time sequence-based deep learning identification model, and inputs the characteristics extracted through CNN into the bidirectional LSTM network to output the LaTex expression without segmenting the characters of the image or analyzing the spatial position relationship among the characters, and the information is obtained by the deep learning identification model, namely, the end-to-end identification is realized, so the embodiment of the invention can adapt to various complex scenes, and the identification accuracy is greatly improved.

In order to better implement the above scheme of the embodiment of the present invention, the present invention further provides an image recognition apparatus, which is described in detail below with reference to the accompanying drawings:

as shown in fig. 14, which is a schematic structural diagram of an image recognition apparatus provided in an embodiment of the present invention, the image recognition apparatus 14 may include: a processing unit 140, an extraction unit 142, an extracted information unit 144, and a recognition unit 146, wherein,

the processing unit 140 is configured to perform binarization processing on the image to obtain a binary image; the image includes a plurality of characters;

the extracting unit 142 is configured to perform skeleton extraction on the binary image, and extract skeleton information of the plurality of characters;

the extraction information unit 144 is configured to extract stroke information from the skeleton information; the stroke information comprises the number of the stroke characteristic points and position information between the adjacent stroke characteristic points;

the recognition unit 146 is configured to recognize the plurality of characters and the information of the position relationship between the characters by analyzing the stroke information through a timing recognition engine based on a deep learning network, and output the recognized plurality of characters.

The extracting unit 142 is specifically configured to perform iterative etching processing on the binary image until no new pixel point is etched with respect to the binary image after the last etching; and each iteration corrosion comprises traversing pixel points in the binary image in sequence and corroding the pixel points meeting specified conditions.

The pixel points meeting the specified conditions in the embodiment of the invention can comprise target pixel points meeting any one of the following conditions:

the condition a is that the number of pixel points with binary value 1 in 8 adjacent pixel points around the target pixel point is more than or equal to a first threshold value and less than or equal to a second threshold value; the first threshold is less than the second threshold;

b, checking 8 adjacent pixel points around the target pixel point in a clockwise direction, wherein the frequency that the binary sequence of two adjacent pixel points is 01 is equal to a third threshold;

under the condition c, in 4 adjacent pixel points which are relatively nearest, the binary value of at least one pixel point is 0; the distance includes a distance from a center of a pixel adjacent to the target pixel to a center of the target pixel.

In one embodiment of the present invention, the information extracting unit 1404 may be specifically configured to traverse according to a connected domain for the skeleton information, and extract a brush stroke feature point; and preferentially extracting the stroke feature points with smaller direction angles with the previous stroke feature point under the condition of stroke bifurcation.

Specifically, the extracting unit 142 in the embodiment of the present invention may extract the stroke information from the skeleton information through a stroke extraction algorithm, as shown in a schematic diagram of the stroke information provided in the embodiment of the present invention shown in fig. 9a, where the stroke information in the embodiment of the present invention may include the number of the stroke feature points and the position information between the adjacent stroke feature points; as shown in fig. 9a, each point is a brush-tip feature point, and a positional relationship exists between adjacent brush-tip feature points, for example, a positional relationship exists between a brush-tip feature point a and an adjacent brush-tip feature point b in fig. 9a, and a direction angle from the brush-tip feature point a to the adjacent brush-tip feature point b can be represented by vector information.

In one embodiment, the extracting unit 142 may extract the stroke information from the skeleton information, including traversing the skeleton information according to the connected domain, and extracting the stroke feature points; and preferentially extracting the stroke feature points with smaller direction angles with the previous stroke feature point under the condition of stroke bifurcation. Specifically, as shown in fig. 9b, the stroke information in fig. 9b is an enlarged display diagram of the stroke information of x in fig. 9a, and from the stroke feature point c, the next stroke feature point d is traversed according to the connected domain, and when the stroke feature point e starts to diverge and is bifurcated with the stroke feature point f, the stroke feature point g, and the stroke feature point h, the stroke feature point f with the direction angle of 0 degree is traversed preferentially, the stroke feature point g with the direction angle of 90 degrees is traversed secondarily, and the stroke feature point h with the direction angle of 270 degrees is traversed finally.

In one embodiment of the present invention, as shown in fig. 15, which is a schematic structural diagram of the recognition unit provided in the embodiment of the present invention, the recognition unit 146 may include a feature extraction unit 1460 and a character recognition unit 1462, wherein,

the feature extraction unit 1460 is configured to perform feature extraction on the stroke information by using a convolutional neural network CNN;

the character recognition unit 1462 is configured to input the extracted features to the long-term and short-term memory network LSTM for character recognition, and recognize the plurality of characters and the positional relationship information between the characters.

In one embodiment of the present invention, the long short term memory network LSTM may be a bidirectional LSTM.

In one embodiment of the present invention, the plurality of characters may include a mathematical expression;

the timing sequence recognition engine of the embodiment of the invention can adopt a deep learning network based on a Long Short-Term Memory network (LSTM). Specifically, after the stroke information obtained by the information unit 144 is extracted, the Network may extract features by a Convolutional Neural Network (CNN), and then input the extracted features into the LSTM Network to complete the recognition of the plurality of characters and the information of the position relationship between the characters, and finally output the plurality of recognized characters. FIG. 10 is a schematic diagram of a timing recognition engine according to an embodiment of the present invention

Referring to fig. 10, which is a schematic diagram of a time sequence recognition engine provided in an embodiment of the present invention, the input stroke information includes the number of stroke feature points and position information between adjacent stroke feature points, and features are extracted through the CNN network 10, the embodiment of the present invention is not limited to convolution with 3 × 3 in fig. 10, and may also be convolution with 5 × 5, and the features extracted by the feature extraction unit 1460 may be divided into stroke information of a plurality of time sequence units, and then the stroke information is sequentially input to the LSTM network to complete recognition of the plurality of characters and position relationship information between characters, and finally the plurality of recognized characters are output. For a specific structure of the LSTM network, referring to the schematic structural diagram of the LSTM network provided by the embodiment of the present invention shown in fig. 11, taking the image in fig. 2 as an example, the stroke information of 11 time sequence units can be extracted from the CNN network, the character recognition unit 1462 removes or adds information to the cell state according to the stroke information of each time sequence unit through a well-designed structure called "gate", and finally, the plurality of recognized characters can be output.

By implementing the embodiment of the invention, the skeleton of the binary image is extracted, the skeleton information of a plurality of characters is extracted, the stroke information is extracted from the skeleton information, the stroke information passes through the time sequence recognition engine based on the deep learning network, the plurality of characters and the position relation information among the characters are recognized, the characteristics do not need to be designed manually, and the character separation does not need to be carried out, so that the technical problem of low recognition accuracy caused by the fact that the separation algorithm cannot be well processed for the characters with adhesion in the prior art is solved.

Still further, as shown in fig. 12, which is a schematic diagram of a timing recognition engine according to another embodiment of the present invention, an LSTM according to an embodiment of the present invention may be a bidirectional LSTM, and specifically, refer to a schematic diagram of a structure of a bidirectional LSTM network according to an embodiment of the present invention shown in fig. 13, so that, also taking an image in fig. 2 as an example, stroke information of 11 timing units may be extracted from a CNN network, a character recognition unit 1462 removes or adds information to a cell state from the stroke information of each timing unit according to a time sequence through a well-designed structure called a "gate", and finally, the plurality of recognized characters may be output.

In one embodiment, the plurality of characters in the embodiment of the present invention may include mathematical expressions, and the outputting of the recognized plurality of characters by the recognition unit 146 may include: and outputting a LaTex expression according to the plurality of recognized characters. The embodiment of the invention identifies the digital characters through the time sequence-based deep learning identification model, inputs the characteristics extracted through CNN into the bidirectional LSTM network, and can output the LaTex expression without segmenting the characters of the image and analyzing the space position relationship among the characters, and the information is obtained by the deep learning identification model, namely the end-to-end identification is realized.

In order to better implement the above solution of the embodiment of the present invention, the present invention further provides an image recognition apparatus, which is described in detail below with reference to the accompanying drawings:

as shown in fig. 16, which is a schematic structural diagram of the image recognition apparatus provided in the embodiment of the present invention, the image recognition apparatus 16 may include a processor 161, an input unit 162, a recognition unit 163, a memory 164, and a communication unit 165, and the processor 161, the input unit 162, the recognition unit 163, the memory 164, and the communication unit 165 may be connected to each other by a bus 166. The memory 164 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory, and the memory 704 includes a flash in an embodiment of the present invention. The memory 164 may optionally be at least one memory system located remotely from the processor 161. The memory 164 is used for storing application program codes and may include an operating system, a network communication module, a user interface module, and an image recognition program, and the communication unit 165 is used for information interaction with an external unit; processor 161 is configured to call the program code to perform the following steps:

carrying out binarization processing on an input image to obtain a binary image; the image includes a plurality of characters;

and the stroke information passes through a time sequence recognition engine based on a deep learning network, the characters and the position relation information among the characters are recognized, and the recognized characters are output.

In one embodiment, the processor 161 performs skeleton extraction on the binary image, which may include:

carrying out iterative corrosion treatment on the binary image until no new pixel point is corroded compared with the binary image subjected to the last corrosion; and each iterative corrosion comprises traversing pixel points in the binary image in sequence and corroding the pixel points meeting specified conditions.

In one embodiment, the pixel points meeting the specified condition include target pixel points meeting any one of the following conditions:

in 4 adjacent pixel points which are relatively nearest, the binary value of at least one pixel point is 0; the distance includes a distance from a center of a pixel adjacent to the target pixel to a center of the target pixel.

In one embodiment, the process 161 of passing the stroke information through a time sequence recognition engine based on a deep learning network to recognize the plurality of characters and the information of the position relationship between the characters may include:

extracting the characteristics of the stroke information by a Convolutional Neural Network (CNN);

and inputting the extracted features into a long-short term memory network (LSTM) for character recognition, and recognizing the characters and the position relation information among the characters.

In one embodiment, the long short term memory network LSTM is a two-way LSTM.

In one embodiment thereof, the plurality of characters may comprise a mathematical expression;

the processor 161 outputting the recognized plurality of characters may include: and outputting a LaTex expression according to the plurality of recognized characters.

In one embodiment, the processor 161 extracting the stroke information from the skeleton information may include:

traversing the skeleton information according to the connected domain, and extracting the stroke feature points; and preferentially extracting the stroke feature points with smaller direction angles with the previous stroke feature point under the condition of stroke bifurcation.

By implementing the embodiment of the invention, the skeleton extraction is carried out on the binary image, the skeleton information of a plurality of characters is extracted, then the stroke information is extracted from the skeleton information, the stroke information passes through a time sequence recognition engine based on a deep learning network, and the plurality of characters and the position relation information among the characters are recognized, so that the characteristics are not required to be manually designed, and the character separation is not required, thereby solving the technical problem of low recognition accuracy rate caused by the fact that the separation algorithm cannot be well processed for the characters with adhesion in the prior art; particularly, the embodiment of the invention identifies the digital characters through a time sequence-based deep learning identification model, and inputs the characteristics extracted through CNN into a bidirectional LSTM network to output a LaTex expression without segmenting the characters of the image and analyzing the spatial position relationship among the characters, and the information is obtained by the deep learning identification model, namely, the end-to-end identification is realized, so the embodiment of the invention can adapt to various complex scenes, and the identification accuracy is greatly improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. An image recognition method, comprising:

analyzing the stroke information through a time sequence recognition engine based on a deep learning network, and recognizing the plurality of characters and the position relation information among the characters;

after obtaining the stroke information, extracting features through a convolutional neural network included in the time sequence recognition engine based on the deep learning network, and inputting the extracted features into a long-short term memory network included in the time sequence recognition engine based on the deep learning network to complete recognition of the characters and the position relation information among the characters.

2. The method of claim 1, wherein the skeleton extraction of the binary image comprises:

carrying out iterative corrosion treatment on the binary image until no new pixel point is corroded compared with the binary image subjected to the last corrosion; and each iterative corrosion comprises traversing the pixel points in the binary image in sequence and corroding the pixel points meeting the specified conditions.

3. The method of claim 2, wherein the pixels meeting the specified condition comprise target pixels meeting any one of the following conditions:

4. The method of claim 1, wherein the identifying the plurality of characters and the inter-character positional relationship information by analyzing the stroke information through a deep learning network-based timing recognition engine comprises:

5. The method as claimed in claim 1, wherein the binarizing processing on the image comprises:

and (4) carrying out binarization processing on the image by adopting a MSER algorithm.

6. The method of claim 4, wherein the plurality of characters comprise a mathematical expression;

after the plurality of characters and the information of the position relationship between the characters are identified, the method further comprises the following steps: and outputting a LaTex expression according to the plurality of recognized characters.

7. The method of claim 1, wherein the extracting stroke information from the skeletal information comprises:

8. An image recognition apparatus, comprising:

the recognition unit is used for analyzing the stroke information through a time sequence recognition engine based on a deep learning network and recognizing the characters and the position relation information among the characters;

after obtaining the stroke information, the recognition unit is configured to extract features through a convolutional neural network included in the deep learning network-based timing sequence recognition engine, and input the extracted features into a long-term and short-term memory network included in the deep learning network-based timing sequence recognition engine to complete recognition of the plurality of characters and the information on the position relationship between the characters.

9. The apparatus according to claim 8, wherein the extracting unit is specifically configured to perform iterative erosion processing on the binary image until no new pixel point is eroded relative to the binary image after the last erosion; and each iterative corrosion comprises traversing the pixel points in the binary image in sequence and corroding the pixel points meeting the specified conditions.

10. The apparatus of claim 9, wherein the pixels meeting the specified condition comprise target pixels meeting any of the following conditions:

11. The apparatus of claim 8, wherein the identification unit comprises:

12. The apparatus of claim 11, wherein the plurality of characters comprise mathematical expressions;

the identification unit is further used for outputting a LaTex expression according to the identified characters.

13. The apparatus of claim 8, wherein the information extracting unit is specifically configured to extract a brush-touch feature point by traversing the skeleton information according to a connected component; and preferentially extracting the stroke feature points with smaller direction angles with the previous stroke feature point under the condition of stroke bifurcation.

14. An image recognition device comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is configured to store application program code, and wherein the processor is configured to invoke the program code to perform the method of any of claims 1-7.

15. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.