CN111476067B

CN111476067B - Character recognition method and device for image, electronic equipment and readable storage medium

Info

Publication number: CN111476067B
Application number: CN201910065232.8A
Authority: CN
Inventors: 杨帆; 高文龙; 欧贫扶
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2023-04-07
Anticipated expiration: 2039-01-23
Also published as: CN111476067A

Abstract

The application provides a character recognition method and device for an image, an electronic device and a readable storage medium, wherein the method comprises the following steps: carrying out multi-scale detection on an image to be processed to obtain a pixel-level thermodynamic diagram; performing example segmentation based on the pixel level thermodynamic diagram, and extracting information of each text box corresponding to the image to be processed; determining text box images respectively corresponding to the text box information in the image to be processed according to the text box information; the text box images are recognized to obtain corresponding character recognition results, namely, the text recognition method based on the pixel level prediction and the example segmentation are adopted, texts with various angles and various bent shapes can be segmented in the image to be processed, interference of noise, wrinkles, light inequality and other factors in a complex scene is effectively resisted, recognition errors possibly caused by single character segmentation are avoided by recognizing the whole text, and accuracy and recall rate of character recognition of the image are remarkably improved.

Description

Character recognition method and device for image, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of OCR (Optical Character Recognition), and in particular, to a method and an apparatus for recognizing characters in an image, an electronic device, and a readable storage medium.

Background

OCR refers to a technique for recognizing text information in an image. The essence of the method is to detect images captured by an optical device and recognize characters, and extend vision and character recognition capability to machines. OCR technology is now widely used in the fields of medicine, insurance, finance, logistics, traditional manufacturing, and the like. For example, in an appointment scene of a medical health service, it is necessary to identify characters in a clinical medical record picture uploaded by a user by using a mobile phone to realize accurate appointment. Considerable recognition and discrimination time can be saved through the OCR technology, a large amount of manpower and material resources are saved, and the processing efficiency is improved.

However, in a common business scene, the text information of the photo is generally complex, and there may be cases such as shadows, occlusion, wrinkles, distortion, multiple directions, and multiple lines. The image detection mode in the prior art can not achieve the ideal recognition effect. For example, conventional object detection algorithms cannot accommodate the need for text to span an entire page. For another example, the image detection method of semantic segmentation only divides the whole region where the same type of target is located, which may result in that more compact lines of text cannot be effectively distinguished, resulting in a situation that a plurality of lines are detected as one line, and affecting the subsequent recognition process.

Disclosure of Invention

In order to overcome the above technical problems or at least partially solve the above technical problems, the following technical solutions are proposed:

in a first aspect, the present application provides a method for recognizing characters in an image, including:

carrying out multi-scale detection on an image to be processed to obtain a pixel level thermodynamic diagram;

performing example segmentation based on the pixel level thermodynamic diagram, and extracting each text box information corresponding to the image to be processed;

determining text box images respectively corresponding to the text box information in the image to be processed according to the text box information;

and identifying each text box image to obtain corresponding character identification results.

In an optional implementation manner, the performing multi-scale detection on the image to be processed to obtain a pixel-level thermodynamic diagram includes:

zooming the image to be processed into images with a plurality of preset scales;

and respectively carrying out multi-scale detection on the images with the multiple preset scales to obtain the pixel-level thermodynamic diagrams with the multiple preset scales.

In an optional implementation manner, performing multi-scale detection on an image of any one preset scale to obtain a pixel-level thermodynamic diagram of any one preset scale includes:

extracting feature maps of a plurality of scale layers from the image with any preset scale;

fusing the feature maps of the multiple scale layers to obtain a fused feature map;

and classifying the fused feature maps to obtain the pixel-level thermodynamic diagram of any preset scale.

In an optional implementation manner, the pixel-level thermodynamic diagram includes two types of channel information of each pixel point, where the two types of channel information of any pixel point include:

the pixel channel information is used for representing whether any pixel point is a character or not;

and the communication channel information is used for representing whether any pixel point is communicated with the surrounding pixels with the preset number.

In an optional implementation manner, performing example segmentation based on the pixel-level thermodynamic diagram, and extracting information of each text box corresponding to the image to be processed includes:

adjusting the scales of the pixel-level thermodynamic diagrams of the multiple preset scales to the maximum scale in the multiple preset scales;

respectively determining the average value of the same pixel point of each adjusted pixel-level thermodynamic diagram to obtain an average pixel-level thermodynamic diagram with each pixel point as the average value;

and performing example segmentation based on the average pixel level thermodynamic diagram, and extracting each text box information corresponding to the image to be processed.

In an optional implementation manner, the performing example segmentation based on the average pixel level thermodynamic diagram to extract each text box information corresponding to the image to be processed includes:

determining pixel points of which the pixel channel information is greater than or equal to a pixel threshold value in the average pixel level thermodynamic diagram as text pixel points;

determining a corresponding text connected domain according to the connected channel information of the text pixel points;

and extracting corresponding text box information in the image to be processed according to each text connected domain.

In an optional implementation manner, the text box information includes coordinate information of a text box in the image to be processed;

determining a text box image respectively corresponding to each text box information in the image to be processed according to each text box information, including:

and determining text box images respectively corresponding to the information of each text box from the image to be processed according to the coordinate information of each text box in the image to be processed.

In an optional implementation manner, recognizing any text box image to obtain a corresponding character recognition result includes:

extracting character features of any text box image, and coding the character features;

and decoding the coded character features based on the professional dictionary in the preset field to obtain a corresponding character recognition result.

In an optional implementation manner, the extracting text features of any text box image includes:

determining a feature vector sequence of any text box image;

and extracting corresponding character features according to the feature vector sequence.

In an alternative implementation manner, the determining a feature vector sequence of any text box image includes:

extracting semantic features of any text box image;

and converting the semantic features into a feature vector sequence.

In an optional implementation manner, the extracting, according to the feature vector sequence, corresponding text features and encoding the text features includes any one of:

extracting corresponding character features through a deep bidirectional cyclic neural network according to the feature vector sequence, and coding the character features;

and extracting corresponding character features through a deep bidirectional cyclic neural network containing an attention mechanism according to the feature vector sequence, and coding the character features.

In a second aspect, the present application provides an apparatus for recognizing characters of an image, the apparatus comprising:

the prediction module is used for carrying out multi-scale detection on the image to be processed to obtain a pixel level thermodynamic diagram;

the extraction module is used for carrying out example segmentation based on the pixel level thermodynamic diagram and extracting each text box information corresponding to the image to be processed;

the determining module is used for determining text box images respectively corresponding to the text box information in the image to be processed according to the text box information;

and the recognition module is used for recognizing each text box image to obtain corresponding character recognition results.

In an optional implementation manner, the prediction module is specifically configured to scale the image to be processed into images of multiple preset scales; and respectively carrying out multi-scale detection on the images with the multiple preset scales to obtain the pixel-level thermodynamic diagrams with the multiple preset scales.

In an optional implementation manner, the prediction module is specifically configured to extract feature maps of multiple scale layers for the image at any one preset scale; fusing the feature maps of the multiple scale layers to obtain a fused feature map; and classifying the fused feature maps to obtain the pixel-level thermodynamic diagram of any preset scale.

pixel channel information used for representing whether any pixel point is a character or not;

In an optional implementation manner, the extraction module is specifically configured to adjust the scales of the pixel-level thermodynamic diagrams of the multiple preset scales to a maximum scale of the multiple preset scales; respectively determining the average value of the same pixel points of each adjusted pixel level thermodynamic diagram to obtain an average pixel level thermodynamic diagram with each pixel point as the average value; and performing example segmentation based on the average pixel level thermodynamic diagram, and extracting each text box information corresponding to the image to be processed.

In an optional implementation manner, the extraction module is specifically configured to determine, as a text pixel, a pixel in the average pixel level thermodynamic diagram where the pixel channel information is greater than or equal to a pixel threshold; determining a corresponding text connected domain according to the connected channel information of the text pixel points; and extracting corresponding text box information in the image to be processed according to each text connected domain.

the determining module is specifically configured to determine, according to the coordinate information of each text box in the image to be processed, text box images corresponding to each text box information from the image to be processed.

In an optional implementation manner, the recognition module is specifically configured to extract character features of any one of the text box images, and encode the character features; and decoding the coded character features based on the professional dictionary in the preset field to obtain a corresponding character recognition result.

In an alternative implementation manner, the identification module is specifically configured to determine a feature vector sequence of any text box image; and extracting corresponding character features according to the feature vector sequence.

In an optional implementation manner, the recognition module is specifically configured to extract semantic features of any one of the text box images; and converting the semantic features into a feature vector sequence.

In an alternative implementation, the identification module is specifically configured to any one of:

In a third aspect, the present application provides an electronic device comprising:

the computer-readable medium may be a computer-readable medium comprising a processor and a memory, the memory storing at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the word recognition method as set forth in the first aspect of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium for storing a computer instruction, a program, a set of codes or a set of instructions which, when run on a computer, cause the computer to execute to implement the method for word recognition shown in the first aspect of the present application.

The image character recognition method, the image character recognition device, the electronic equipment and the readable storage medium have the advantages that the prediction and the example segmentation based on the pixel level are adopted, the texts in various angles and various bent shapes can be segmented in the image to be processed without depending on the preset small frame, and then various texts are extracted for recognition, the interference of noise, wrinkles, light inequality and other factors in a complex scene can be effectively resisted, the texts in the image of the text frame are recognized integrally, recognition errors possibly caused by single character segmentation are avoided, and the character recognition accuracy and the recall rate of the image are remarkably improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a text recognition method for an image according to an embodiment of the present application;

FIG. 2a is an exemplary diagram for predicting a pixel level thermodynamic diagram provided by an embodiment of the present application;

fig. 2b is an exemplary diagram for extracting information of each text box according to the embodiment of the present application;

FIG. 2c is an exemplary diagram of determining a text box image provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a detection method provided in an embodiment of the present application;

fig. 4 is a schematic diagram of another method for recognizing characters in an image according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an identification method provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a medical health scenario OCR process provided by an embodiment of the present application;

fig. 7 is a schematic diagram of a precise reservation usage scenario provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a character recognition device for an image according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

To make the objects and advantages of the present application more clear, embodiments of the present application will be described below in conjunction with two prior arts.

In one prior art, conventional OCR methods may be used to identify text in an image. However, the inventor of the present application found through research that the technical solution has the following disadvantages:

disadvantages 1-1: the traditional OCR is based on black-white optical change, the requirement on image quality is high, the preprocessing stage comprises traditional image processing methods such as denoising, detecting straight lines and distortion resistance, and fixed threshold values and parameters need to be set for processing results. In an actual business scene, the image sources uploaded by the user are diversified, for example, in a business scene such as medical health, the uploaded images include laboratory sheets and the like, which may come from different hospitals, have different formats, different distortion degrees, uneven black and white changes, and complex text distribution conditions, so that all the uploaded images cannot be processed by using the same set of parameters, and the image preprocessing method lacks intellectualization and adaptivity, so that the image processing method including the preprocessing stage cannot be used universally in the actual business scene.

Disadvantages 1-2: and in the image cutting and column dividing stage, a layout analysis method is used. However, the layout analysis depends on the situations of eliminating interference and noise, distortion resistance and the like in the preprocessing stage, and if the preprocessing stage cannot be used in the image processing, the ideal effect can not be obtained in the cutting and column dividing stage.

Disadvantages 1 to 3: and in the identification stage, the extracted text block is sent to an open source engine Tesseract. Tesseract will first segment the text block into individual characters and then enter the trained individual character recognition classification model. The word segmentation is suitable for Latin language series characters represented by English, and because letters in the Latin language series are obviously segmented and communicated internally, the segmentation algorithm in Tesseract can accurately segment words. However, in an actual service scene, text content may include mixed Chinese and English characters, and for complex characters such as 'puffs', a single character segmentation algorithm separates each part first and mistakenly segments each part into two or more characters, thereby greatly reducing accuracy.

Disadvantages 1 to 4: in the recognition stage, tesseract recognizes each word, which is easy for the low-number (0-9) and English (a-Z ) classification, and the accuracy for the very many Chinese (more than 4000) types is greatly reduced.

In another prior art, a deep learning model may be used to perform semantic segmentation, so as to implement an OCR technology used in an industry general scenario. However, the inventor of the present application found through research that the technical solution has the following disadvantages:

disadvantage 2-1: the CTPN detection model framework used in the existing scheme can only detect texts in the horizontal direction, and if horizontal, vertical or even multidirectional texts exist in a scene at the same time, a large amount of missed detection can occur.

Disadvantages 2-2: the CTPN mode is used for carrying out target detection on a text based on vertical small boxes (text containers), the vertical small boxes predicted to be characters are communicated to form a text line, accuracy depends on the sizes of the preset vertical small boxes in the horizontal direction and the vertical direction, for example, if the characters in an image are small and the size of the preset small box is large, the accuracy is influenced, and therefore the detection method is low in intellectualization and adaptability.

Disadvantages 2 to 3: the used CRNN recognition model is mainly trained by using conventional words, and the prediction accuracy is low for professional terms, such as medical terms like "creatinine" used in business scenes like medical health.

Based on this, the method and apparatus for recognizing characters of an image, an electronic device and a readable storage medium provided by the present application aim to solve the above technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems in detail by using specific embodiments.

The embodiment of the application provides a character recognition method for an image, as shown in fig. 1, the method includes:

step S101: carrying out multi-scale detection on an image to be processed to obtain a pixel level thermodynamic diagram;

for the embodiment of the present application, the execution subject may be a terminal device, for example, a mobile terminal used by a user. Alternatively, the execution subject may be a server, and perform processing after receiving the to-be-processed image transmitted by the terminal device.

The image to be processed is an image to be subjected to character recognition. In practical application, the image to be processed may be selected and uploaded from stored images by a user, or may be a picture taken by the user in real time using a mobile terminal or a picture scanned in real time, and the embodiment of the present application is not limited herein.

Specifically, the images to be processed include, but are not limited to, tickets, books, reports, bills, and the like. In one example, in a medical health business scenario, the image to be processed may specifically be a medical related image, such as a picture of paper material such as a clinical medical record, a physical examination report, a blood routine and a video report. Illustratively, during the process of using the reservation service, the user needs to detect and recognize the characters and the positions thereof in the photos through the OCR, and send the OCR results to a corresponding hospital AI (Artificial Intelligence) engine through structuring, so as to accurately recommend the OCR results to a corresponding department of the hospital for processing.

In the embodiment of the application, the image to be processed is subjected to multi-scale detection, namely, the image to be processed is subjected to feature detection based on multiple scales, the sizes of characters in the image and the sizes of texts cannot be the same due to different sizes of the different images to be processed, and the multi-scale detection can be used for detecting the characters or the texts with different sizes, so that the accuracy of detecting the small characters and the large texts is improved.

Subsequently, a pixel level thermodynamic diagram is obtained. In the embodiment of the present application, the pixel level thermodynamic diagrams obtained according to one to-be-processed image may be one or more. Each pixel-level thermodynamic diagram can describe the prediction information of each pixel point in the image, as shown in fig. 2a, the pixel-level thermodynamic diagram displays a region corresponding to the required prediction information in a special highlight form, so that the distribution situation of the text region can be visually seen.

Step S102: performing example segmentation based on the pixel level thermodynamic diagram, and extracting information of each text box corresponding to the image to be processed;

according to the embodiment of the application, an example segmentation mode based on the pixel level thermodynamic diagram is adopted, the pixel level thermodynamic diagram can describe the prediction information of each pixel point in the image, and based on the pixel level prediction and segmentation, characters with any scale can be detected in the image, and texts with various shapes and scales, such as texts with various angles and various bent shapes, can be detected, so that various text box information in the image to be processed can be extracted.

In practical application, since one or more pixel-level thermodynamic diagrams obtained according to one image to be processed may be obtained, in the embodiment of the present application, each text box information corresponding to the image to be processed may be extracted based on example segmentation performed on one pixel-level thermodynamic diagram corresponding to the image to be processed, or each text box information corresponding to the image to be processed may be extracted based on example segmentation performed on a plurality of pixel-level thermodynamic diagrams corresponding to the image to be processed, as shown in fig. 2 b.

Compared with the situation that a traditional target detection algorithm cannot detect that a text line spans the whole page, the embodiment of the application can identify a larger text without a great receptive field. Compared with the situation that a tighter multi-line character cannot be distinguished in the image detection mode based on semantic segmentation, and multi-line detection is performed as one line, the example segmentation mode adopted in the embodiment of the application can identify a text with a smaller scale, and the detection effect is obviously superior to that of the traditional target detection algorithm and the image detection mode based on semantic segmentation.

Step S103: determining text box images respectively corresponding to the text box information in the image to be processed according to the text box information;

as shown in fig. 2c, several text box images are illustrated as being determined from the image to be processed.

Step S104: and identifying each text box image to obtain respectively corresponding character identification results.

In the embodiment of the application, the whole text of each text box image is recognized, and compared with the method that the text block is divided into single characters and then recognized by an open source engine Tesseract, the method and the device can avoid recognition errors possibly caused by single character division and improve recognition accuracy.

According to the character recognition method of the image, prediction and example segmentation based on the pixel level are adopted, texts in various angles and various bending shapes can be segmented in the image to be processed without depending on a preset small frame, then various texts are extracted for recognition, interference of noise, wrinkles, light inequality factors and the like in a complex scene can be effectively resisted, recognition errors possibly caused by single character segmentation are avoided by integrally recognizing the texts in the image of the text frame, and the accuracy and recall rate of character recognition of the image are remarkably improved.

In the embodiment of the present application, a possible implementation manner is provided for step S101, and specifically, the implementation manner includes:

step S1011: zooming an image to be processed into images with a plurality of preset scales;

in consideration of the diversification of the sources of the images to be processed, for example, in business scenes such as medical health, uploaded images including clinical medical records, physical examination reports, blood routine and image reports, etc., may come from different hospitals, have different sizes and formats, and are difficult to process uniformly.

Based on this, the embodiment of the application scales the image to be processed into the images with various preset scales, so as to improve the accuracy of detecting the images to be processed with different sizes.

Taking a medical health scene as an example, since the size of a general image may be between 824 × 824 and 2440 × 2440, 5 scales of a square, a horizontal rectangle (2440 × 824), and a vertical rectangle (824 × 2440) which are three sizes of a small (824 × 824), a medium (1640 × 1640), and a large (2440 × 2440) by scaling the picture may be set. In practical applications, a person skilled in the art may set the scaling and the number of the scaling of the image to be processed according to practical situations, and the embodiment of the present application is not limited herein.

Step S1012: and respectively carrying out multi-scale detection on the images with various preset scales to obtain pixel-level thermodynamic diagrams with various preset scales.

In the embodiment of the application, multi-scale detection is respectively carried out on images with various preset scales, and a plurality of pixel-level thermodynamic diagrams can be obtained according to one image to be processed. For example, in the above example, if the 5-scale images are subjected to multi-scale detection, the 5-scale pixel-level thermodynamic diagrams are obtained. In other words, each image with the preset scale obtains a pixel-level thermodynamic diagram with the same preset scale in one-to-one correspondence with the image with the preset scale, and the detection accuracy is further improved.

In this embodiment of the present application, a possible implementation manner is provided for step S1012, as shown in fig. 3, specifically, a process of performing multi-scale detection on an image of any preset scale to obtain a pixel-level thermodynamic diagram of any preset scale includes:

step SA: extracting feature maps of a plurality of scale layers from any image (hereinafter referred to as a target image for convenience of description) with preset scale;

extracting feature maps of a plurality of scale layers refers to extracting features of a target image based on different scales, and the scale layers can limit the scale of the extracted feature maps. In practical application, the target image can be input into a neural network to extract feature maps of multiple scale layers.

Wherein the feature map of the smaller scale layer contains more detailed features, and the feature map of the larger scale layer contains more classified features. Therefore, the multi-scale detection is high in accuracy of detecting small characters and large texts of the target image. The method can play a role in effectively resisting interference to noise, folds, uneven light and other factors in a complex scene.

In one possible implementation, feature maps of multiple scale layers are extracted through a VGG-16 network. Specifically, the target image is passed through a plurality of convolution and pooling layers until the feature map size is 1/16 of the original image.

Step SB: fusing the feature maps of the multiple scale layers to obtain a fused feature map;

specifically, this step may be performed by a feature fusion full convolution depth Network, such as a U-type (e.g., U-Net) deep full convolution Neural Network, a ResNet (Residual Neural Network), a densneet (Dense Convolutional Network), or an inclusion Net (a Convolutional Neural Network proposed by google), and so on.

In the embodiment of the application, a U-shaped full convolution depth neural network is taken as an example, and in a feasible implementation manner, a VGG-16 basic network framework is used, the extracted feature maps of a plurality of scale layers are up-sampled layer by layer according to a sequence from small to large, and a shallow feature map and a deep feature map which are up-sampled and have the same size are fused. The U-shaped network structure can effectively learn the semantic information of the images corresponding to the shallow characteristic diagram and the deep characteristic diagram of the model, thereby improving the accuracy of the model for detecting small characters and large texts.

In other implementation manners, other multi-scale fusion manners may also be adopted to fuse the feature maps of multiple scale layers to obtain a fused feature map, which is not limited herein in this embodiment of the application.

Step SC: and classifying the fused feature maps to obtain a pixel-level thermodynamic diagram with any preset scale.

And (4) passing the feature map after the last layer of fusion through a convolution layer (such as 1x1 convolution) and classifying (softmax), and outputting a pixel-level thermodynamic map of the target image.

It can be understood that, for each image of the preset scale in step S1011, the above steps SA-SC are respectively required to obtain the pixel-level thermodynamic diagram corresponding to the preset scale.

The embodiment of the present application provides a feasible implementation manner, wherein the pixel-level thermodynamic diagram includes two types of channel information of each pixel, where the two types of channel information of any pixel include pixel channel information (pixel) and connected channel information (link), that is, as shown in fig. 3, each pixel in the pixel-level thermodynamic diagram is divided into two types of channels, namely pixel and link, for output.

and the communication channel information is used for representing whether any pixel point is communicated with the surrounding pixels with the preset number. For example, whether any pixel point is connected with 8 surrounding pixels or not may be represented, and in practical application, whether any pixel point is connected with 4 surrounding pixels or not may also be represented. The predetermined number can be set by a person skilled in the art according to practical situations, and the embodiments of the present application are not limited herein.

In other implementations, the pixel-level thermodynamic diagram may also be output in other forms of channels, and the embodiments of the present application are not limited herein.

In the embodiment of the present application, a possible implementation manner is provided for a process of performing example segmentation on a plurality of pixel-level thermodynamic diagrams (pixel-level thermodynamic diagrams of a plurality of preset scales obtained in step S1012) corresponding to an image to be processed and extracting information of each text box corresponding to the image to be processed. Wherein, step S102 includes:

step SL: adjusting the scales of the pixel-level thermodynamic diagrams of various preset scales to the maximum scale of the various preset scales;

for example, in the above example, the pixel-level thermodynamic diagrams of the above 5 scales can be all scaled up to the height and width of the largest scale of the 5 scales (e.g., 2440 × 2440).

Step SM: respectively determining the average value of the same pixel point of each adjusted pixel-level thermodynamic diagram to obtain an average pixel-level thermodynamic diagram with each pixel point as the average value;

the values of all scales are averaged for the same pixel point of the adjusted thermodynamic diagrams of different pixel levels, and an average pixel level thermodynamic diagram is obtained.

Step SN: and performing example segmentation based on the average pixel level thermodynamic diagram, and extracting each text box information corresponding to the image to be processed.

In a possible implementation manner, the step SN may include:

step SN1: determining pixel points of which the pixel channel information is greater than or equal to a pixel threshold value in the average pixel level thermodynamic diagram as text pixel points;

specifically, by setting a proper pixel threshold, the pixel points in the average pixel level thermodynamic diagram where the pixel channel information is lower than the pixel threshold are classified as non-text pixel points, and the pixel points in the average pixel level thermodynamic diagram where the pixel channel information is greater than or equal to the pixel threshold are classified as text pixel points. The pixel threshold may be set by a person skilled in the art according to practical situations, and the embodiment of the present application is not limited herein.

Step SN2: determining a corresponding text connected domain according to the connected channel information of the text pixel points;

further, for each text pixel, whether the adjacent pixels with a predetermined number (for example, 8 pixels) around the text pixel are connected with each other is determined according to the connected channel information, so as to determine the connected domain of each text.

Step SN3: and extracting the information of each text box in the corresponding image to be processed according to each text connected domain.

Specifically, a minareect function in OpenCV (Open Source Computer Vision Library) may be used to extract a minimum bounding rectangle of each text connected domain, that is, a corresponding text box (also referred to as a prediction box), and output information of each text box.

In addition, another possible implementation manner is provided in the embodiments of the present application, and instead of scaling the image to be processed, multi-scale detection may be directly performed, so that a pixel-level thermodynamic diagram may be obtained according to one image to be processed. In a scene with standard size of the image to be processed, the efficiency of character recognition can be improved.

Then, for a scheme in which the image to be processed is directly subjected to multi-scale detection without scaling, a pixel-level thermodynamic diagram corresponding to the image to be processed can also be obtained through the following processes: extracting feature maps of a plurality of scale layers from an image to be processed; fusing the feature maps of the multiple scale layers to obtain a fused feature map; and classifying the fused feature maps to obtain a pixel-level thermodynamic diagram of the image to be processed. The specific implementation manner may refer to the description of the above step SA-SC, and is not described herein again.

Similarly, the pixel-level thermodynamic diagram contains two kinds of channel information of each pixel, and the two kinds of channel information of any pixel include pixel channel information (pixel) and connected channel information (link), that is, each pixel of the pixel-level thermodynamic diagram is divided into two kinds of channels, namely pixel and link, for output.

and the communication channel information is used for representing whether any pixel point is communicated with the surrounding pixels with the preset number. For example, it may be set to represent whether any pixel is connected to 8 surrounding pixels, or in practical application, it may also be set to represent whether any pixel is connected to 4 surrounding pixels. The predetermined number may be set by a person skilled in the art according to practical situations, and the embodiments of the present application are not limited herein.

Then, for the scheme that the image to be processed is directly subjected to multi-scale detection without scaling to obtain a pixel-level thermodynamic diagram corresponding to the image to be processed, the subsequent step S102 includes: determining pixel points of which the pixel channel information is greater than or equal to a pixel threshold value in the pixel-level thermodynamic diagram as text pixel points; determining a corresponding text connected domain according to the connected channel information of the text pixel points; and extracting the information of each text box in the corresponding image to be processed according to each text connected domain. The specific implementation manner can refer to the introduction of the steps SN1-SN3, and is not described herein again.

In the embodiment of the application, the text box information comprises coordinate information of the text box in the image to be processed. In practical applications, the format of the coordinate information may be (x 1, y1, x2, y2, x3, y3, x4, y 4), i.e. the coordinates representing the four vertices of the text box.

In the embodiment of the present application, as shown in fig. 4, the processes of steps S101 to S102 may be executed by a detection network, that is, the to-be-processed image is input into the detection network, and after coordinate information of N text boxes with various shapes and dimensions in the to-be-processed image is extracted, the process of detecting the network is ended.

The detection network provided by the embodiment of the application combines technical means of scaling the image to be processed into the image with various preset scales and multi-scale detection, does not need to use the same set of parameters to execute the preprocessing stage, and overcomes the technical problems in the prior art: and the defect 1-1 enables the detection of the image to be more intelligent and has higher adaptivity. Particularly, the multi-scale detection is carried out on the image to be processed, and the characters or texts with different sizes can be detected, so that the accuracy of the detection of the small characters and the large texts is improved, and the interference can be effectively resisted by the aid of factors such as noise, wrinkles and uneven light in a complex scene. And the accuracy of detecting the images to be processed with different sizes is improved by zooming the images to be processed into the images with various preset scales.

By further combining with the technical means of example segmentation based on the pixel level, characters with any scale and texts with various shapes and scales, such as texts with various angles and various curved shapes, can be detected in the image through an example segmentation mode based on pixel level thermodynamic diagram. The technical problems in the prior art are overcome: and the defects 1-2, 2-1 and 2-2 enable the detection accuracy, intelligence and adaptability of the image to be higher.

Compared with the situation that a traditional target detection algorithm cannot detect that a text line spans the whole page, the embodiment of the application can identify a larger text without a great receptive field. Compared with the situation that a plurality of lines of compact characters cannot be distinguished in an image detection mode based on semantic segmentation, and the plurality of lines are detected as one line, the text with a smaller scale can be identified in an example segmentation mode adopted in the embodiment of the application, and the detection effect is obviously superior to that of a traditional target detection algorithm and an image detection mode based on semantic segmentation.

In the embodiment of the present application, when the text box information is coordinate information of the text box in the image to be processed, a possible implementation manner is provided for step S103, and specifically, each text box image is determined from the image to be processed according to the coordinate information of each text box in the image to be processed.

In practical application, each text box image is cut out from the image to be processed according to the coordinate information of each text box in the image to be processed, and the image is sent to an identification network for identification.

In this embodiment of the application, as shown in fig. 4, the process of obtaining N character strings corresponding to N text boxes may be executed by the recognition network.

Optionally, to simplify the parameter configuration of the recognition network, the text box images may all be scaled to the same input size and fed into the recognition network in batches for recognition. The input size can be set by those skilled in the art according to practical situations, for example, a rectangle with a width of 100 and a height of 32, and the embodiments of the present application are not limited herein.

In the embodiment of the present application, a possible implementation manner is provided for step S104, so as to be able to cope with complex text situations such as mixing of chinese and english, and difficult recognition of professional vocabularies, as shown in fig. 5, a process of recognizing any text box image to obtain a corresponding text recognition result includes:

step S1041: extracting character features of any text box image, and coding the character features;

step S1042: and decoding the coded character features based on the professional dictionary in the preset field to obtain a corresponding character recognition result.

Wherein, the predetermined domain professional dictionary usually covers a large number of professional terms, and can assist in predicting the professional term class words or texts. Taking a medical health scene as an example, the predetermined field professional dictionary can be a medical professional dictionary, and currently, more than 3000 medical professional indexes are covered, so that the accuracy of recognition is improved.

In practical application, the coded character features may be decoded by a connection timing Classification algorithm (CTC) to obtain a corresponding character recognition result.

Based on a predetermined domain professional dictionary, decoding can be carried out by combining a connection time sequence classification algorithm and a context, and therefore words or character strings with any length in the dictionary can be obtained.

In a possible implementation manner, the process of extracting text features of any text box image includes:

step SP: determining a feature vector sequence of any text box image;

in practical applications, the step may be performed by Convolutional networks such as CNN (Convolutional Neural Network), resNet, densnet, or inclusion Net, and a person skilled in the art may select the Convolutional Network according to actual situations, which is not limited herein.

In the embodiment of the present application, a feature vector sequence in which a text box image is extracted using a standard CNN is taken as an example. The components of the convolutional layer are constructed using the convolutional layer and the max pooling layer in the standard CNN (remove all-connected layer), and the feature vector sequence is extracted in the generated final layer feature map (feature map). Taking the characteristic vector sequences as the input of a circulation layer, and extracting the semantic characteristics of any text box image; and converting the semantic features into a feature vector sequence, namely converting the semantic features in the text box image into the feature vector sequence. Since the convolution operation has translation invariance, each column of the feature map corresponds to a rectangular region, i.e., the receptive field, of the text box image. The corresponding columns from left to right in the receptive field and the feature map have the same order. Thus, each vector in the sequence of feature vectors is associated with a receptive field, which can be considered an image descriptor of the corresponding region.

Step SQ: and extracting corresponding character features according to the feature vector sequence.

And further performing character features according to the feature vector sequence extracted in the step SP, and encoding the character features.

In a feasible implementation manner, according to the feature vector sequence, a Bi-directional Long Short-Term Memory (Bi-directional Long Short-Term Memory) Network, i.e., a deep bidirectional Neural Network (RNN), is used to extract corresponding character features and encode the character features.

This is because RNN has a strong ability to capture context information in a sequence, and for character recognition of an image, a recognition method using context cues is more stable and helpful than a conventional recognition method in which a single character is divided and each symbol is processed independently. For example, a Chinese wide character may be fully described by a few consecutive frames, with fuzzy and low resolution characters more easily distinguishable when viewing its context.

In practice, the RNN can propagate its loss back to its input convolutional layer, and thus can co-train the cyclic layer and convolutional layer end-to-end in the same network. The RNN may also operate on sequences of arbitrary length from beginning to end. For the limitation of the context range caused by the gradient vanishing problem existing in the conventional RNN, the BiLSTM network used in the embodiment of the present application is composed of LSTM in two directions, and each LSTM is composed of a storage unit, an input gate, an output gate and a forgetting gate, and can capture long-distance sequence information. Thus, bilSTM can pass information forward and backward to capture past and future context information.

In another possible implementation manner, according to the feature vector sequence, extracting corresponding text features through a BilSTM network containing an attention mechanism, and encoding the text features.

According to the embodiment of the application, an attention mechanism is creatively introduced on the basis of a BilSTM structure, and after the obtained global weight information and the short-distance local correlation in the BilSTM coding process are obtained, the accuracy of the whole line identification can be improved.

In the embodiment of the present application, a possible implementation manner is provided for step S1042, and specifically, the CTC decoding based on the professional dictionary in the predetermined domain may adopt a Beam Search (Word Beam Search), best path decoding (best path decoding), a Beam Search (Beam Search), a language model (language model), and the like.

In the embodiment of the present application, a decoding method of bundle search based on a predetermined domain specialized dictionary is described as an example. The prefix tree (Trie) can be constructed off-line based on a professional dictionary in a predetermined field, and decoding is carried out on line.

Wherein the bundle search employs a breadth first search to build its search tree. At each level of the search tree, a series of solutions are generated. These solutions need to be sorted and matched with a predetermined domain professional dictionary, and the best K solutions are selected as candidate solutions, where K is also called bundle width, and those skilled in the art can set the solutions according to actual situations, and the embodiment of the present application is not limited herein. In this way, words or character strings of arbitrary length in a predetermined domain professional dictionary can be predicted.

The recognition network that this application embodiment provided adopts the BilSTM network that contains the attention mechanism to extract corresponding characters characteristic and code, can combine the context, wholly discerns the text, carries out the single word segmentation again and independently handles every symbol with traditional recognition mode and compares, can be more stable and more helpful, also can effectively avoid chinese to cut apart with the misidentification, has overcome above-mentioned technical problem among the prior art: disadvantages 1-3 and 1-4 make the recognition of the image more accurate.

And further combining a decoding mode based on a predetermined domain professional dictionary, the recognition of the professional terms can be greatly improved. The technical problems in the prior art are overcome: and 2-3, the accuracy of image identification is further improved.

In the embodiment of the present application, as can be seen from the above, the technical solution provided in the embodiment of the present application can be applied to a medical health scene, for example, the scene can be specifically used for a user to use an appointment service, as shown in fig. 6, characters and positions thereof in the photos are detected and recognized by OCR, and OCR results are structurally sent to an AI (Artificial Intelligence) engine of a corresponding hospital, so as to be accurately recommended to a corresponding department of the hospital for processing.

Specifically, as shown in fig. 7, the hospital allocates a certain proportion of number sources from the number source pool for reservation to make an accurate reservation, and when the patient makes a reservation, the patient uploads medical record data, and matches a suitable doctor to a suitable patient through AI accurate screening, thereby improving the matching efficiency of outpatient service. The precise reservation steps are as follows:

(1) Distributing accurate reservation number sources;

(2) Collecting medical records;

(3) Identifying medical records: identifying medical record data information through an OCR technology, and formatting data;

(4) AI screening the patient;

(5) And submitting reservation registration on line.

The OCR process in the step (3) is the technical solution protected by the embodiments of the present application, and is a key basic capability in an accurate reservation scenario.

Then, for the business scene, in the training stage of the model, the model can be trained step by step from easy to difficult based on large-scale medical professional labeling data.

In the training data preparation process, real data are searched in a business scene, and the training data and the test data are ensured to be independent and identically distributed (i.i.d.) through manual labeling and secondary verification.

In the training of the detection model, the target function of the pixel part is cross entropy loss, and the target function of the link part is cross entropy loss based on class balance.

In the training of the recognition model, the target function is CTC loss, and the error of the indefinite length sequence pair group channel (correctly marked data) based on conditional probability can be effectively calculated.

By adopting the character recognition method of the image, the character recognition accuracy and the recall rate of the medical clinical medical record picture can be effectively improved, and the accuracy of the subsequent OCR result structuring is improved.

Those skilled in the art will appreciate that the above-described service scenarios are merely examples, and that suitable variations based on this example may be made for other scenarios, and may also fall within the spirit or scope of the present application.

An embodiment of the present application further provides a text recognition apparatus for an image, and as shown in fig. 8, the text recognition apparatus 80 may include: a prediction module 801, an extraction module 802, a determination module 803, and an identification module 804, wherein,

the prediction module 801 is used for performing multi-scale detection on an image to be processed to obtain a pixel level thermodynamic diagram;

the extraction module 802 is configured to perform example segmentation based on the pixel-level thermodynamic diagram, and extract information of each text box corresponding to the image to be processed;

the determining module 803 is configured to determine, according to the information of each text box, a text box image corresponding to each text box information in the image to be processed;

the recognition module 804 is configured to recognize each text box image to obtain corresponding character recognition results.

In an alternative implementation, the prediction module 801 is specifically configured to scale an image to be processed into images of multiple preset scales; and respectively carrying out multi-scale detection on the images with various preset scales to obtain pixel-level thermodynamic diagrams with various preset scales.

In an alternative implementation manner, the prediction module 801 is specifically configured to extract feature maps of multiple scale layers for an image of any one preset scale; fusing the feature maps of the multiple scale layers to obtain a fused feature map; and classifying the fused feature maps to obtain a pixel-level thermodynamic diagram with any preset scale.

In an optional implementation manner, the pixel-level thermodynamic diagram includes two types of channel information of each pixel, where the two types of channel information of any pixel include:

In an alternative implementation manner, the extraction module 802 is specifically configured to adjust the scales of the pixel-level thermodynamic diagrams of multiple preset scales to the maximum scale of the multiple preset scales; respectively determining the average value of the same pixel points of each adjusted pixel level thermodynamic diagram to obtain an average pixel level thermodynamic diagram with each pixel point as the average value; and performing example segmentation based on the average pixel level thermodynamic diagram, and extracting each text box information corresponding to the image to be processed.

In an optional implementation manner, the extraction module 802 is specifically configured to determine, as a text pixel, a pixel in the average pixel-level thermodynamic diagram where the pixel channel information is greater than or equal to the pixel threshold; determining a corresponding text connected domain according to the connected channel information of the text pixel points; and extracting the information of each text box in the corresponding image to be processed according to each text connected domain.

In an alternative implementation, the text box information includes coordinate information of the text box in the image to be processed;

the determining module 803 is specifically configured to determine, according to the coordinate information of each text box in the image to be processed, a text box image corresponding to each text box information from the image to be processed.

In an optional implementation manner, the recognition module 804 is specifically configured to extract character features of any text box image, and encode the character features; and decoding the coded character features based on the professional dictionary in the preset field to obtain a corresponding character recognition result.

In an alternative implementation, the identification module 804 is specifically configured to determine a feature vector sequence of any text box image; and extracting corresponding character features according to the feature vector sequence.

In an alternative implementation manner, the recognition module 804 is specifically configured to extract semantic features of any text box image; the semantic features are converted into a sequence of feature vectors.

In an alternative implementation, the identifying module 804 is specifically configured to any one of:

The embodiment of the application provides a character recognition device for images, which adopts prediction and example segmentation based on pixel levels, can be independent of a preset small frame, can segment texts with various angles and various bending shapes in an image to be processed, further extracts various texts for recognition, can effectively resist the interference of noise, wrinkles, uneven light and other factors in a complex scene, and then recognizes the whole text in the image of the text frame, thereby avoiding recognition errors possibly caused by single character segmentation, and remarkably improving the accuracy and recall rate of character recognition of the image.

It can be clearly understood by those skilled in the art that the implementation principle and the generated technical effects of the text recognition device for images provided in the embodiment of the present application are the same as those of the foregoing method embodiment, and for convenience and simplicity of description, corresponding contents in the foregoing method embodiment may be referred to where is not mentioned in the apparatus embodiment, and are not repeated herein.

An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing at least one instruction, at least one program, set of codes or set of instructions, which is loaded and executed by the processor to implement the respective content of the aforementioned method embodiments.

Optionally, the electronic device may further comprise a transceiver. The processor is coupled to the transceiver, such as via a bus. It should be noted that the transceiver in practical application is not limited to one, and the structure of the electronic device does not constitute a limitation to the embodiments of the present application.

The processor may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like.

A bus may include a path that carries information between the components. The bus may be a PCI bus or an EISA bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory may be, but is not limited to, ROM or other type of static storage device that can store static information and instructions, RAM or other type of dynamic storage device that can store information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The electronic equipment provided by the embodiment of the application adopts pixel level-based prediction and example segmentation, can be independent of a preset small frame, and can segment texts with various angles and various bent shapes in an image to be processed, so that various texts can be extracted for identification, the interference of noise, wrinkles, light inequality factors and the like in a complex scene can be effectively resisted, the whole text in the image of the text frame is identified, the identification error possibly caused by single character segmentation is avoided, and the accuracy and recall rate of character identification of the image are remarkably improved.

The present application also provides a readable storage medium, for example, a computer-readable storage medium, which is used for storing computer instructions, and when the computer instructions are executed on a computer, the computer can execute the corresponding content in the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A character recognition method for an image, comprising:

zooming an image to be processed into images with a plurality of preset scales;

respectively carrying out multi-scale detection on the images with multiple preset scales to obtain pixel-level thermodynamic diagrams with multiple preset scales, wherein the pixel-level thermodynamic diagrams comprise two kinds of channel information of each pixel point, and the two kinds of channel information of any pixel point comprise pixel channel information for representing whether any pixel point is a character or not and communication channel information for representing whether any pixel point is communicated with a preset number of surrounding pixel points or not;

performing example segmentation based on the average pixel level thermodynamic diagram, and extracting each text box information corresponding to the image to be processed;

and identifying each text box image to obtain respectively corresponding character identification results.

2. The character recognition method of claim 1, wherein performing multi-scale detection on an image of any one preset scale to obtain a pixel-level thermodynamic diagram of any one preset scale comprises:

extracting feature maps of a plurality of scale layers from the image with any one preset scale;

3. The word recognition method according to claim 1, wherein the performing instance segmentation based on the average pixel level thermodynamic diagram to extract each text box information corresponding to the image to be processed comprises:

4. The word recognition method according to any one of claims 1 to 3, wherein the text box information includes coordinate information of a text box in the image to be processed;

5. The method of any one of claims 1-3, wherein recognizing any one of the text box images to obtain a corresponding text recognition result comprises:

and decoding the coded character features based on a professional dictionary in a preset field to obtain a corresponding character recognition result.

6. The word recognition method of claim 5, wherein said extracting the word feature of any one of the text box images comprises:

determining a feature vector sequence of any text box image;

7. The method of claim 6, wherein the determining the sequence of feature vectors for any of the text box images comprises:

extracting semantic features of any text box image;

and converting the semantic features into a feature vector sequence.

8. The method of claim 6, wherein the extracting corresponding text features according to the feature vector sequence and encoding the text features comprises any one of:

9. An apparatus for recognizing characters in an image, comprising:

the prediction module is used for scaling the image to be processed into images with various preset scales;

the prediction module is further configured to perform multi-scale detection on the images with multiple preset scales respectively to obtain pixel-level thermodynamic diagrams with multiple preset scales, where the pixel-level thermodynamic diagrams include two types of channel information of each pixel point, and the two types of channel information of any pixel point include pixel channel information used for representing whether the any pixel point is a character or not and communication channel information used for representing whether the any pixel point is communicated with a predetermined number of surrounding pixel points or not;

the extraction module is used for adjusting the scales of the pixel-level thermodynamic diagrams of the multiple preset scales to the maximum scale in the multiple preset scales;

the extraction module is further used for respectively determining the average value of the same pixel point of each adjusted pixel level thermodynamic diagram to obtain an average pixel level thermodynamic diagram with each pixel point as the average value;

the extraction module is further used for carrying out example segmentation based on the average pixel level thermodynamic diagram and extracting each text box information corresponding to the image to be processed;

10. An electronic device, comprising: a processor and a memory, and a control unit,

the memory stores at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the method of any of claims 1-8.

11. A computer-readable storage medium for storing a computer instruction, a program, a set of codes, or a set of instructions that, when executed on a computer, cause the computer to perform the method of text recognition according to any one of claims 1-8.