CN113887375A

CN113887375A - Text recognition method, device, equipment and storage medium

Info

Publication number: CN113887375A
Application number: CN202111137451.6A
Authority: CN
Inventors: 刘秩铭; 邵明
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2022-01-04

Abstract

The application provides a text recognition method, a text recognition device, text recognition equipment and a storage medium, relates to the technical field of image processing, and is used for improving the accuracy of text recognition. The method comprises the following steps: determining a plurality of target text regions of the image to be recognized according to the trained text detection model; performing text recognition on the plurality of target text regions according to the trained first text recognition model to obtain first text recognition results corresponding to the plurality of target text regions; performing text recognition on the plurality of target text regions according to the trained second text recognition model to obtain second text recognition results corresponding to the plurality of target text regions; and determining a plurality of target text recognition results corresponding to the image to be recognized according to the first confidence degrees contained in the first text recognition results and the second text recognition results.

Description

Text recognition method, device, equipment and storage medium

Technical Field

The application relates to the technical field of image processing, and provides a text recognition method, a text recognition device, text recognition equipment and a storage medium.

Background

In everyday office work of people, situations that text recognition is needed are often encountered, for example, when picture characters, scanned characters or PDF characters are quickly recorded, because the characters cannot be directly copied and pasted, and manual input is too laborious and time-consuming, the purpose of quickly recording the text can be achieved through a text recognition mode. At present, in the existing text recognition method, a Region-based Convolutional Neural Network (R-CNN) is often used to detect a text Region in an image, and then a Back Propagation (BP) Neural Network is used to recognize text characters in the text Region.

However, when the R-CNN is used for text detection, the text distance may be too large or too small, and the distribution direction of the text lines may be a random direction, so that the probability of occurrence of missed detection or multiple detection is high, the edges of the text lines cannot be accurately detected, and finally the accuracy of text detection is low. When the BP neural network is used for text recognition, because the Chinese characters are more in category number, the difference between simplified Chinese characters and traditional Chinese characters exists, and the Chinese characters and English punctuations are similar, the recognition effect on the Chinese characters and the traditional Chinese characters is not good enough, only a specific few text characters can be recognized, the application scenes are few, the recognition difficulty on special symbols and punctuations is large, and therefore the accuracy of text recognition is low, and the normal use of a user is influenced.

Therefore, how to improve the accuracy of text recognition is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a text recognition method, a text recognition device, text recognition equipment and a storage medium, which are used for improving the accuracy of text recognition.

In one aspect, a text recognition method is provided, and the method includes:

determining a plurality of target text regions of the image to be recognized according to the trained text detection model;

performing text recognition on the plurality of target text regions according to the trained first text recognition model to obtain first text recognition results corresponding to the plurality of target text regions; the first text recognition model determines the first text recognition result according to text semantic information;

performing text recognition on the plurality of target text regions according to the trained second text recognition model to obtain second text recognition results corresponding to the plurality of target text regions; the second text recognition model determines the second text recognition result according to the text length and the text semantic information;

determining a plurality of target text recognition results corresponding to the image to be recognized according to first confidence degrees contained in the first text recognition results and the second text recognition results; the first confidence coefficient is used for indicating the probability that a specific character exists in a target text region corresponding to the text recognition result; and one target text area corresponds to a plurality of recognition results, and the recognition result with the highest confidence degree in the plurality of recognition results is the target text recognition result of the target text area.

In one aspect, an apparatus for text recognition is provided, the apparatus comprising:

the text region determining unit is used for determining a plurality of target text regions of the image to be recognized according to the trained text detection model;

a first recognition result determining unit, configured to perform text recognition on the multiple target text regions according to a trained first text recognition model, and obtain first text recognition results corresponding to the multiple target text regions; the first text recognition model determines the first text recognition result according to text semantic information;

a second recognition result determining unit, configured to perform text recognition on the multiple target text regions according to a trained second text recognition model, and obtain second text recognition results corresponding to the multiple target text regions; the second text recognition model determines the second text recognition result according to the text length and the text semantic information;

the target recognition result determining unit is used for determining a plurality of target text recognition results corresponding to the image to be recognized according to first confidence degrees contained in the plurality of first text recognition results and the plurality of second text recognition results; the first confidence coefficient is used for indicating the probability that a specific character exists in a target text region corresponding to the text recognition result; and one target text area corresponds to a plurality of recognition results, and the recognition result with the highest confidence degree in the plurality of recognition results is the target text recognition result of the target text area.

Optionally, the text region determining unit is specifically configured to:

determining a first probability that each pixel point in the image to be recognized is a central point of a single character and a second probability that each pixel point is a central point between any two adjacent characters according to the trained text detection model;

obtaining a plurality of local image areas according to the first probability;

for each local image region, segmenting each local image region according to the second probability, and determining a plurality of candidate text regions corresponding to each local image region;

and determining a plurality of target text regions of the image to be recognized according to a plurality of candidate text regions corresponding to the plurality of local image regions respectively.

Optionally, the text region determining unit is further specifically configured to:

determining a second confidence degree corresponding to each of the candidate text regions; wherein the second confidence level is used to indicate a probability that text is present in the candidate text region;

determining, for one candidate text region of the plurality of candidate text regions, whether a second confidence corresponding to the one candidate text region is greater than a set second confidence threshold;

determining the one candidate text region as the target text region upon determining that the one candidate text region is greater than the set second confidence threshold.

when the confidence coefficient is determined to be larger than a set second confidence coefficient threshold value, performing binarization processing on the candidate text region to obtain a first candidate text region;

performing connected domain analysis on the first candidate text region, and determining whether the first candidate text region is a connected region; the connected region is an image region which has the same pixel value and is formed by non-background pixel points adjacent in position;

and if the first candidate text region is determined to be the connected region, determining the first candidate text region as a target text region.

after the connected region is determined, determining a plurality of included angles between a plurality of text sub-regions in the first candidate text region and a preset first coordinate axis; wherein one included angle corresponds to one text subarea; when any two adjacent included angles in the plurality of included angles are different, determining that a text region part formed by text subregions corresponding to any two adjacent included angles in the first candidate text region has a bending phenomenon;

sequentially determining whether the difference value between two adjacent included angles in the plurality of included angles is larger than a set angle threshold value;

when the difference value between two adjacent included angles is larger than a set angle threshold value, determining a boundary line between text sub-regions corresponding to the two adjacent included angles respectively corresponding to the difference value larger than the set angle threshold value in the first candidate text region as a dividing line;

and acquiring a plurality of target text sub-regions according to the dividing lines, and determining the target text sub-regions as target text regions.

Optionally, the apparatus further includes a text recognition preprocessing unit, where the text recognition preprocessing unit is configured to:

based on the text direction classification function of the trained first text recognition model, performing text direction classification on the plurality of target text regions to obtain a plurality of first target text regions;

based on the text correction sub-function of the trained first text recognition model, performing text correction on the plurality of first target text regions to obtain a plurality of second target text regions;

performing text typesetting direction classification on the plurality of second target text regions based on the text typesetting direction classification function of the trained first text recognition model to obtain a plurality of third target text regions;

and respectively inputting the plurality of third target text areas into the trained first text recognition model for text recognition, and/or respectively inputting the trained second text recognition model for text recognition.

Optionally, the target recognition result determining unit is specifically configured to:

respectively performing text recognition on the plurality of target text regions according to a plurality of trained second text recognition models to obtain second text recognition results corresponding to the plurality of target text regions in each trained second text recognition model in the plurality of trained second text recognition models;

and determining a plurality of target text recognition results corresponding to the image to be recognized according to the respective corresponding confidence degrees of the plurality of first text recognition results and the respective corresponding confidence degrees of the plurality of second text recognition results of each trained second text recognition model.

Optionally, the target recognition result determining unit is further specifically configured to:

determining whether the same text recognition result exists in a first text recognition result and a plurality of second text recognition results corresponding to one target text region in the target text regions;

when the same text recognition result is determined to exist, increasing the confidence of the same text recognition result;

determining the text recognition result corresponding to the maximum confidence as the target text recognition result of the target text region according to the confidence of the same text recognition result and the confidence of the other text recognition results of the target text region;

and determining a plurality of target text recognition results corresponding to the image to be recognized according to the target text recognition results corresponding to the plurality of target text regions respectively.

In one aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of the above aspect when executing the computer program.

In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of the method of the above aspect.

In the embodiment of the application, a plurality of target text regions of an image to be recognized can be determined according to a trained text detection model; further, text recognition can be performed on the plurality of target text regions according to the trained first text recognition model to obtain first text recognition results corresponding to the plurality of target text regions; the text recognition can be carried out on the plurality of target text regions according to the trained second text recognition model to obtain second text recognition results corresponding to the plurality of target text regions; therefore, the target text recognition results corresponding to the image to be recognized can be determined according to the first confidence degrees contained in the first text recognition results and the second text recognition results. It can be seen that, in the embodiment of the present application, since the first text recognition model determines the first text recognition result according to the text semantic information, therefore, can reason the characters in the target text area to improve the recognition accuracy of Chinese and English punctuations and words, and the second text recognition model determines a second text recognition result based on the text length and the text semantic information, and therefore, which can solve the alignment problem of indefinite-length sequences, and further, when comprehensively determining a target text recognition result through the first text recognition model and the second text recognition model, for the same target text region, relatively more text recognition results with different confidence degrees can be obtained, and further, on the basis, the recognition result with the maximum confidence coefficient is selected to serve as the target text recognition result of the target text region, and the accuracy of text recognition can be further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a text recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of a text detection model in which a VGG 16 is adopted in a backbone network;

FIG. 4 is a flowchart illustrating a process of screening text regions according to an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating screening a text region according to an embodiment of the present application;

FIG. 6 is a schematic view of an irregular quadrilateral provided in accordance with an embodiment of the present application;

FIG. 7 is a schematic diagram of a text sample provided in an embodiment of the present application;

FIG. 8 is a flowchart illustrating a text region splitting process according to an embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating a text recognition preprocessing provided by an embodiment of the present application;

FIG. 10 is a schematic flow chart illustrating text recognition provided by an embodiment of the present application;

FIG. 11 is a schematic flow chart illustrating text recognition according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

At present, in the existing text recognition method, R-CNN is often adopted to detect a text region in an image, and then a BP neural network is adopted to recognize text characters in the text region. However, when the R-CNN is used for text detection, the text distance may be too large or too small, and the distribution direction of the text lines may be a random direction, so that the probability of occurrence of missed detection or multiple detection is high, the edges of the text lines cannot be accurately detected, and finally the accuracy of text detection is low. When the BP neural network is used for text recognition, because the Chinese characters are more in category number, the difference between simplified Chinese characters and traditional Chinese characters exists, and the Chinese characters and English punctuations are similar, the recognition effect on the Chinese characters and the traditional Chinese characters is not good enough, only a specific few text characters can be recognized, the application scenes are few, the recognition difficulty on special symbols and punctuations is large, and therefore the accuracy of text recognition is low, and the normal use of a user is influenced.

Based on this, in the embodiment of the application, a plurality of target text regions of the image to be recognized may be determined according to the trained text detection model; further, text recognition can be performed on the plurality of target text regions according to the trained first text recognition model to obtain first text recognition results corresponding to the plurality of target text regions; the text recognition can be carried out on the plurality of target text regions according to the trained second text recognition model to obtain second text recognition results corresponding to the plurality of target text regions; therefore, the target text recognition results corresponding to the image to be recognized can be determined according to the first confidence degrees contained in the first text recognition results and the second text recognition results. It can be seen that, in the embodiment of the present application, since the first text recognition model determines the first text recognition result according to the text semantic information, therefore, can reason the characters in the target text area to improve the recognition accuracy of Chinese and English punctuations and words, and the second text recognition model determines a second text recognition result based on the text length and the text semantic information, and therefore, which can solve the alignment problem of indefinite-length sequences, and further, when comprehensively determining a target text recognition result through the first text recognition model and the second text recognition model, for the same target text region, relatively more text recognition results with different confidence degrees can be obtained, and further, on the basis, the recognition result with the maximum confidence coefficient is selected to serve as the target text recognition result of the target text region, and the accuracy of text recognition can be further improved.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

As shown in fig. 1, a schematic view of an application scenario provided in the embodiment of the present application is provided, where the application scenario for text recognition may include a text recognition device 10 and another device 11.

The other device 11 may be a device storing the image to be recognized, for example a device containing a database. Alternatively, the other device 11 may also be a device for generating an image to be recognized, such as a mobile phone, a camera, or the like.

The text recognition apparatus 10 may be a computer apparatus having a certain processing capability, and may be, for example, a Personal Computer (PC), a notebook computer, a server, or the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto. The text recognition device 10 may include one or more processors 101, memory 102, and I/O interfaces 103 to interact with other devices, among other things. In addition, the text recognition device 10 may further configure a database 104, and the database 104 may be used to store data such as network model parameters, confidence degrees, and the like involved in the scheme provided by the embodiment of the present application. The memory 102 of the text recognition device 10 may store therein program instructions of the text recognition method provided in the embodiment of the present application, and when the program instructions are executed by the processor 101, the program instructions can be used to implement the steps of the text recognition method provided in the embodiment of the present application, so as to improve the accuracy of text recognition.

In the embodiment of the present application, when the I/O interface 103 detects an image to be recognized input from another device 11, the program instructions of the text recognition method stored in the memory 102 are called, and the processor 101 executes the program instructions, so as to perform text recognition on the image to be recognized, and while obtaining a text recognition result, the accuracy of the text recognition is improved, and data such as confidence level generated during the execution of the program instructions and the text recognition result are stored in the database 104.

Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein. Hereinafter, the method of the embodiment of the present application will be described with reference to the drawings.

As shown in fig. 2, a flowchart of a text recognition method provided in an embodiment of the present application, which can be executed by the text recognition apparatus 10 in fig. 1, is described as follows.

Step 201: and determining a plurality of target text regions of the image to be recognized according to the trained text detection model.

In the embodiment of the present application, in order to facilitate subsequent screening of text regions and to enable better processing of text boundary regions that are not strictly surrounded, the image to be recognized may be a processed saliency image, for example, an image similar to a thermodynamic diagram (Heatmap), or a Gaussian map (Gaussian map). The text detection model may be a text detection model based on a Visual Geometry Group (VGG), for example, as shown in fig. 3, a network structure schematic diagram of a text detection model of VGG 16 is adopted for a backbone network provided in an embodiment of the present application, wherein an image feature of an image to be recognized may be extracted according to a network structure of VGG 16 in a trained text detection model, and then the image feature may be regressed in a manner of alternating appearance of deconvolution (UpConv Block) and upsampling (UpSample), so that 2 channel feature maps of the size 1/2 of the image to be recognized may be obtained, that is, a first probability that each pixel point in the image to be recognized is a center point of a single character and a second probability that each pixel point is a center point between any two adjacent characters may be determined. Register score and Affinity score shown in FIG. 3. Specifically, when the image to be recognized is a saliency image, the region score may be represented as a character-level gaussian heat map, and the Affinity score may be represented as a gaussian heat map connected between characters.

Furthermore, a plurality of local image areas containing text lines can be obtained according to the determined first probability region score; then, for each local image region, segmenting each local image region according to the determined second probability Affinity score, thereby determining a plurality of candidate text regions corresponding to each local image region; furthermore, a plurality of target text regions of the image to be recognized may be determined according to a plurality of candidate text regions corresponding to the plurality of local image regions, respectively.

Like this, detect single character (region score) and connection relation (affinity score) between the characters first, then confirm the way of the final text line according to the connection relation between the characters, because only need pay close attention to the content of the character level and does not need to pay close attention to the whole text example, therefore, can make the small sense field also predict big text and long text.

Step 202: and performing text recognition on the plurality of target text regions according to the trained first text recognition model to obtain first text recognition results corresponding to the plurality of target text regions.

In the embodiment of the application, the first text recognition model determines a first text recognition result according to the text semantic information. For example, the first text recognition model may be a text recognition model including a transform network structure based on an attention mechanism, and a main structure of the text recognition model may be a residual network Resnet34, so that texts such as punctuation categories and english words and chinese words can be accurately inferred according to semantic information of texts shown in a target text region.

In practical application, the target text region may be input into the trained first text recognition model, and the trained first text recognition model may perform text recognition on the input target text region, so as to obtain a first text recognition result corresponding to the target text region.

Step 203: and performing text recognition on the plurality of target text regions according to the trained second text recognition model to obtain second text recognition results corresponding to the plurality of target text regions.

In the embodiment of the application, the second text recognition model determines a second text recognition result according to the text length and the text semantic information. For example, the second text recognition model is a recognition model formed by the first text recognition model and a third text recognition model, wherein a Network structure of the third text recognition model may sequentially include a Residual Network (ResNet) 34, a Long-Short Term Memory (LSTM) artificial neural Network, a fully connected layer, and a connection dominant time Classification (CTCloss). The CTCloss can solve the alignment problem of indefinite sequences. And zooming the transversely typeset text images and the vertically typeset text images, and obtaining a recognition result through a text recognition model, wherein the recognition result contains the confidence coefficient of each character.

Specifically, when the second recognition model is used, the respective text recognition results of the first text recognition model and the third text recognition model can be respectively determined, and then the respective text recognition results of the first text recognition model and the third text recognition model are processed according to a weighted sum mode to determine the text recognition result of the second recognition model, so that the second recognition model can not only solve the alignment problem of indefinite-length sequences, but also determine the text recognition result according to semantic information of the text, and can be considered in many aspects to improve the accuracy of text recognition.

In a possible implementation manner, in order to further improve the accuracy of text recognition, in the implementation of the present application, a model fusion manner may be adopted to improve the recognition accuracy of the text recognition model, that is, the second text recognition model may be obtained by fusing a plurality of models, and for example, 5 trained second text recognition models may be screened out, where the 5 trained second text recognition models have different weight parameters, and a new weight parameter may be obtained by averaging the weight parameters, so that a new second text recognition model may be obtained, and when text recognition is performed by using the new second text recognition model, the model performance of the text recognition model may be further improved, and the accuracy of text recognition may be improved.

Step 204: and determining a plurality of target text recognition results corresponding to the image to be recognized according to the first confidence degrees contained in the first text recognition results and the second text recognition results.

In this embodiment of the present application, the first confidence may be used to indicate a probability that a specific character exists in a target text region corresponding to the text recognition result; one target text area may correspond to a plurality of recognition results, and the recognition result with the highest confidence level among the plurality of recognition results is the target text recognition result of one target text area.

In practical application, when the text recognition result is obtained through the text recognition model, a first confidence level is obtained at the same time, and the first confidence level is included in the text recognition result. Since the first confidence level is used to indicate a probability that a specific character exists in the target text region corresponding to the text recognition result, that is, the recognition accuracy of the text, in the embodiment of the present application, the target text result corresponding to each target text region may be determined by comparing the magnitudes of the first confidence levels included in the respective text recognition results, for example, for target text region 1, there correspond to 1 first text recognition result and 1 second text recognition result, where the confidence level that character "a" exists in the first text recognition result is 0.9, and the confidence level that character "a" exists in the second text recognition result is 0.8, and then the first text recognition result may be used as the target text recognition result of target text region 1.

In one possible embodiment, when text region detection is performed, there may be a case of erroneous detection, for example, a region that is not a character is detected as a character region, and in this case, if text recognition is performed on a directly detected text region, the text region may affect the recognition accuracy of the text, and in a serious case, the text recognition accuracy may be lowered.

Therefore, in the embodiment of the application, the text regions can be screened in a manner of determining the probability of the text existing in each text region, so as to further provide for improving the accuracy of text recognition. Since the screening process for each candidate text region is the same, the following description will be given by taking the candidate text region a as an example, and as shown in fig. 4, a schematic flow chart for screening the text region provided in the embodiment of the present application is provided, and a specific flow is described as follows.

Step 401: determining a second confidence degree corresponding to each of the plurality of candidate text regions.

In implementations of the present application, the second confidence level may be used to indicate a probability that text is present in the candidate text region.

In practical applications, when the image to be recognized is a saliency image, such as a gaussian image, for each text region, a central point of the text region may be determined first, and then a confidence map (e.g., a gaussian distribution confidence) corresponding to the text region may be calculated according to the central point and the length and width of the text region.

Step 402: and determining whether the second confidence corresponding to the candidate text region A is greater than a set second confidence threshold value or not for the candidate text region A in the candidate text regions.

Step 403: and when the determination is larger than the set second confidence threshold, determining the candidate text region A as the target text region.

In practical applications, for example, the set second confidence threshold may be set to 0.6, and then when the second confidence of the candidate text region a is greater than 0.6, the candidate text region a may be determined as the target text region. Otherwise, it indicates that the candidate text region a has a higher probability of being a text region obtained by error detection, and therefore, the candidate text region a cannot be determined as the target text region, that is, the candidate text region a can be removed from the total candidate text regions, so as to achieve the purpose of filtering and screening the text regions, and further improve the accuracy of subsequent text recognition.

In one possible implementation, the accuracy of text recognition may be reduced due to the larger spacing between individual characters in the text region. Therefore, in the embodiment of the present application, in order to further improve the accuracy of text recognition, after determining that the second confidence of the candidate text region is greater than the second confidence threshold, the text region may be further screened according to connected domain analysis, that is, whether the text region is connected is analyzed, so as to further improve the accuracy of subsequent text recognition. Since the screening process for each candidate text region is the same, the following description also takes the candidate text region a as an example, and as shown in fig. 5, another schematic flow chart for screening the text region provided in the embodiment of the present application is provided, and a specific flow chart is described as follows.

Step 501: and when the confidence coefficient is determined to be larger than the set second confidence coefficient threshold value, carrying out binarization processing on the candidate text region A to obtain a first candidate text region.

In this embodiment of the application, after determining that the second confidence of the candidate text region is greater than the second confidence threshold, the candidate text region may be subjected to binarization processing, for example, the pixel value of each pixel point corresponding to the character may be set to 1, and the pixel value of each pixel point corresponding to the background image may be set to 0, so as to convert the original candidate text region into the first candidate text region in the form of a binarized image.

Step 502: and performing connected component analysis on the first candidate text region to determine whether the first candidate text region is a connected component.

In this embodiment, the connected region may be an image region composed of non-background pixels having the same pixel value and adjacent positions.

Following the above example, since the pixel value of each pixel in the first candidate text region is 0 or 1, after the first candidate text region in the form of the binarized image is obtained, whether the first candidate text region is a connected region can be determined by performing connected region analysis on the first candidate text region, that is, whether the first candidate text region is an image region composed of a plurality of pixels having pixel values and adjacent positions, for example, all pixels having pixel values of 1, can be determined.

Step 503: and if the first candidate text region is determined to be the connected region, determining the first candidate text region as the target text region.

In an embodiment of the present application, when the first candidate text region is determined to be a connected region, then the first candidate text region may be determined to be the target text region. Otherwise, it indicates that the distance between the characters in the candidate text region is large, which is easy to affect the accuracy of text recognition, so that the first candidate text region cannot be determined as the target text region, that is, the first candidate text region can be removed from the total candidate text region, thereby achieving the purpose of filtering and screening the text regions, and further improving the accuracy of subsequent text recognition.

In the embodiment of the present application, in order to improve the accuracy of text recognition, in the model training, the shape of the text region marked in the sample image is an irregular quadrilateral, as shown in fig. 6, which is a schematic diagram of an irregular quadrilateral provided in the embodiment of the present application, and the irregular quadrilateral is determined by at least 4 coordinate points. Furthermore, the bounding box of the text region after the connected component filtering is not necessarily a text box determined by four coordinate points, but may be an irregular quadrilateral text box determined by a plurality of coordinate points as shown in fig. 6.

In one possible implementation, in the daily life of people, as shown in fig. 7, which is a schematic view of a text sample provided in the embodiment of the present application, the text may be displayed in any direction, for example, in an inclined form or in a curved form, in addition to the conventional horizontal display and vertical display. However, in different display cases of these texts, the difficulty of identifying the curved text region is high, and therefore, the accuracy of identifying the curved text region is low.

Furthermore, in the embodiment of the present application, in order to improve the accuracy of text recognition, a method of splitting a curved text region into a plurality of non-curved text regions may be adopted to further improve the accuracy of text recognition. Fig. 8 is a schematic flow chart of splitting a text region according to an embodiment of the present application, and a specific flow is described as follows.

Step 801: after the connected region is determined, a plurality of included angles between a plurality of text sub-regions in the first candidate text region and a preset first coordinate axis are determined.

In the embodiment of the present application, the direction indicated by the first coordinate axis may be a horizontal direction. As shown in fig. 6, the curved text region "STOP" can divide the text region corresponding TO "STOP" into a plurality of text sub-regions according TO the degree of curvature, for example, the text region corresponding TO "STOP" corresponds TO 3 different inclinations, that is, 3 different included angles with the horizontal direction, which are respectively an included angle 1, an included angle 2, and an included angle 3, and further, the text region corresponding TO "STOP" can be divided into 3 text sub-regions according TO the 3 different included angles, which are respectively a text sub-region 1 corresponding TO "S", a text sub-region 2 corresponding TO ", and a text sub-region 3 corresponding TO" P ".

Furthermore, it can be seen that an included angle of the text region corresponding to the STOP may correspond to one text sub-region, and when any two adjacent included angles are different, in the text region corresponding to the STOP, a bending phenomenon exists in a text region portion formed by the text sub-regions corresponding to the two adjacent included angles.

Step 802: and sequentially determining whether the difference value between two adjacent included angles in the plurality of included angles is larger than a set angle threshold value.

In the practice of the present application, the set angle threshold may be set at 10 °.

In order to further improve the accuracy of text recognition, when it is determined that the degree of curvature between two adjacent text sub-regions is large, the two adjacent text sub-regions need to be split, that is, it is determined whether the two adjacent text sub-regions need to be split by determining whether a difference value between included angles corresponding to the two adjacent text sub-regions is greater than a set angle threshold.

Continuing with the above example, when splitting the text region corresponding to "STOP", it may be determined whether the difference between the included angle 1 and the included angle 2 is greater than the set angle threshold, and then it may be determined whether the difference between the included angle 2 and the included angle 3 is greater than the set angle threshold.

Step 803: and when the difference value between the two adjacent included angles is determined to be larger than the set angle threshold value, determining a boundary between text sub-regions corresponding to the two adjacent included angles respectively corresponding to the difference value larger than the set angle threshold value in the first candidate text region as a dividing line.

In this embodiment of the application, when it is determined that the difference between two adjacent included angles is greater than the set angle threshold, that is, it indicates that a text region portion jointly formed by text sub-regions corresponding to the two adjacent included angles has a bending phenomenon, and the bending degree has affected the accuracy of text recognition, at this time, the text sub-regions corresponding to the two adjacent included angles need to be split, and then, a boundary between the text sub-regions corresponding to the two adjacent included angles can be determined as a dividing line.

As shown in fig. 6, if the included angle 1 is 45 °, the included angle 2 is 0 °, and the set angle threshold is 10 °, it can be seen that the difference between the included angle 1 and the included angle 2 is 45 °, which is obviously greater than the angle threshold of 10 °, and therefore, the text subregion 1 corresponding TO "S" and the text subregion 2 corresponding TO "need TO be split. Since there is a boundary between the text sub-area 1 and the text sub-area 2, as shown in fig. 6, a dashed line between "S" and "TO" is a boundary between the text sub-area 1 and the text sub-area 2, when splitting is performed, the boundary can be determined as a split line between the text sub-area 1 and the text sub-area 2.

Step 804: and acquiring a plurality of target text sub-regions according to the dividing lines, and determining the plurality of target text sub-regions as target text regions.

In the implementation of the present application, after a dividing line corresponding to a first candidate text region is determined, the first candidate text region may be divided based on the dividing line, and then, a plurality of target text sub-regions may be obtained, and the plurality of target text sub-regions may be determined as target text regions, so that text recognition may be performed based on the determined target text regions.

For example, as shown in fig. 6, assuming that an included angle 1 is 45 °, an included angle 2 is 0 °, an included angle 3 is-45 °, and a set angle threshold is 10 °, the text region corresponding TO "STOP" may be finally divided into 3 target text sub-regions, that is, the target text sub-region 1 corresponding TO "S", the target text sub-region 2 corresponding TO ", and the target text sub-region 3 corresponding TO" P ", and further, the target text sub-region 1, the target text sub-region 2, and the target text sub-region 3 may be determined as the target text region TO be subjected TO text recognition.

In a possible implementation, since the typesetting directions of the characters are not necessarily all standard typesetting directions, that is, they are not necessarily all common horizontal display and vertical display, but may be displayed in any directions. In addition, since there may be a phenomenon of wrong cutting or rotation of characters in the text, for example, when taking a picture, there may be a case of taking a rectangle as a parallelogram, that is, a phenomenon of taking a normal object as an inclined object occurs, in order to further improve the accuracy of text recognition, in the embodiment of the present application, before text recognition is formally performed, the characters in the target text region may be preprocessed. Fig. 9 is a schematic flow chart of text recognition preprocessing provided in the embodiment of the present application, and a specific flow is described as follows.

Step 901: and carrying out text direction classification on the plurality of target text regions based on a text direction classification function of the trained first text recognition model to obtain a plurality of first target text regions.

In the embodiment of the present application, a text direction detection model composed of a Convolutional Neural Network (CNN) and fully connected layers may be used to classify the text directions in the target text region, specifically, the text directions may be classified into 4 types, i.e., 0 °, 90 °, 180 °, and 270 °, or the text directions may be classified into 8 types, i.e., 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, and 315 °. Of course, how to perform the direction classification can be set according to the user's needs.

Step 902: and performing text correction on the plurality of first target text regions based on a text correction sub-function of the trained first text recognition model to obtain a plurality of second target text regions.

In the embodiment of the application, a text correction network can be constructed based on affine transformation and interpolation principles, and based on the text correction network, a text image can have spatial invariance, so that a text character which is cut or rotated is corrected into a normal character typesetting form, and meanwhile, the text content is kept unchanged. That is, the angled text region may be rectified into a horizontal or vertical text box. At this time, a text region composed of multiple points may be processed such that the coordinates of the text box are represented by four corner points.

Step 903: and performing text typesetting direction classification on the plurality of second target text regions based on the text typesetting direction classification function of the trained first text recognition model to obtain a plurality of third target text regions.

In the embodiment of the present application, a text typesetting direction classification detection model formed by a convolutional neural network and a full connection layer may be used to perform text typesetting direction detection on a corrected text region, and specifically, the text typesetting directions may be classified into 2 types of horizontal typesetting and vertical typesetting (similar to the vertical typesetting manner of ancient languages).

Furthermore, when text recognition is carried out subsequently, the horizontally typeset text box can be sent to a horizontal text recognition network for text recognition, and the vertically typeset text box can be sent to a vertical text recognition network for text recognition. The network structure of the horizontal text recognition network is the same as that of the vertical text recognition network, and only the size of the input image is different, wherein the size of the input image of the horizontal text recognition model can be as follows: 320 width and 40 height; the input image size of the vertical text recognition model may be: the width is 40 and height is 320.

Step 904: and respectively inputting the plurality of third target text areas into the trained first text recognition model for text recognition, and/or respectively inputting the trained second text recognition model for text recognition.

In the embodiment of the present application, after the target text region is subjected to text typesetting direction classification detection to obtain a third target text region, the third target text region may be input into the trained first text recognition model, so as to accurately infer texts such as punctuation marks, english words, chinese words, and the like according to semantic information of the text shown in the third target text region.

And/or the third target text area can be input into a trained second text recognition model to determine a second text recognition result according to the text length and the text semantic information, so that the alignment problem of the indefinite sequence is solved while the text recognition result is determined according to the text semantic information, and the accuracy of text recognition is improved.

In one possible implementation, in order to further improve the accuracy of text recognition, when the target text recognition result is determined according to the confidence degrees corresponding to the first text recognition result and the second text recognition result, more than one trained second text recognition model may be used, that is, the target text recognition result may be determined from a plurality of second text recognition results corresponding to the plurality of trained second text recognition models and a plurality of first text recognition results of one trained first text recognition model. Fig. 10 is a schematic view of another flow of text recognition provided in the embodiment of the present application, and a specific flow is described as follows.

Step 1001: and respectively carrying out text recognition on the plurality of target text regions according to the plurality of trained second text recognition models to obtain second text recognition results corresponding to the plurality of target text regions in each of the plurality of trained second text recognition models.

In practical application, for example, text recognition needs to be performed on a target text region a, there are currently 3 trained second text recognition models with different weight parameters and 1 trained first text recognition model, so that for the target text region a, the 3 second text recognition models respectively obtain 1 second text recognition result, that is, 3 second text recognition results can be obtained in total, and the first text recognition model obtains 1 first text recognition result, that is, for the target text region a, 4 text recognition results can be obtained in total.

Step 1002: and determining a plurality of target text recognition results corresponding to the image to be recognized according to the confidence degrees corresponding to the first text recognition results and the confidence degrees corresponding to the second text recognition results of each trained second text recognition model.

In practical application, a preferred voting mode can be adopted for the character result to further improve the accuracy of the text recognition result. Since the preferred voting process for all the target text regions is the same, taking the example that the target text region a has the same text recognition result B as an example, as shown in fig. 11, another schematic flow chart of text recognition provided in the embodiment of the present application is provided, and a specific flow chart is described as follows.

Step 1101: for a target text region a of the plurality of target text regions, it is determined whether the same text recognition result B exists in the first text recognition result and the plurality of second text recognition results corresponding to the target text region a.

Step 1102: when it is determined that the same text recognition result B exists, the confidence of the same text recognition result B is increased.

In the embodiment of the present application, the confidence of the same text recognition result may be increased, that is, the same text recognition result may be given a higher weight. Specifically, assuming that 2 text recognition results among the text recognition results for the target text region a are all the text recognition results B, that is, the same text recognition result exists, the confidence of the text recognition result B may be increased by performing the following equation:

P＝(p₁+p₂+…+p_n)×1.1^n-1

where P is the increased confidence of the text recognition result B, P_nIs the confidence of the nth text recognition result B, and n is the repetition number of the text recognition result B.

Step 1103: and determining the text recognition result corresponding to the maximum confidence as the target text recognition result of the target text region A according to the same confidence of the text recognition result B and the confidence of the rest text recognition results of the target text region A.

Step 1104: and determining a plurality of target text recognition results corresponding to the image to be recognized according to the target text recognition results corresponding to the plurality of target text regions respectively.

In one possible embodiment, in order to recognize as many texts as possible, for example, simplified characters, traditional characters, english, chinese and english punctuations, etc. Because the fonts and the text typesetting directions of the characters in the network images are variable, the texts in the data set comprise data in the forms of horizontal typesetting, vertical typesetting, inclined angle typesetting, bending typesetting (such as circular typesetting) and the like. Therefore, before text recognition is performed, a basic database for model training needs to be established, and each sample for model training is provided in the basic database. In the embodiment of the present application, the network image may be sorted as a training set, where an annotation tag in the network image may include a text region coordinate and text content. The labeling mode is to label the texts with similar distances in a text area as a text box to identify the texts, for example, for a semantic text, when the distance is more than one character, the text can be regarded as characters belonging to two text areas and labeled as two text boxes. In practical application, the label of the training set text can be a 6000+ dictionary library, and the dictionary library contains simplified Chinese characters, traditional Chinese characters, English letters, numbers, Chinese and English punctuations, special characters and the like. Further, after text recognition by the text recognition model, a dictionary library tag including the above-described dictionary library tag may be output.

In summary, in the embodiment of the present application, since the first text recognition model determines the first text recognition result according to the text semantic information, the characters in the target text region can be inferred to improve the recognition accuracy of the punctuation marks and words in chinese and english, and the second text recognition model determines the second text recognition result according to the text length and the text semantic information, so that the alignment problem of the indefinite-length sequences can be solved, and further, when the target text recognition result is determined comprehensively by the first text recognition model and the second text recognition model, for the same target text region, relatively many text recognition results with different confidence degrees can be obtained, and on the basis, the recognition result with the largest confidence degree is selected as the target text recognition result of the target text region, the accuracy of text recognition can be further improved.

As shown in fig. 12, based on the same inventive concept, an embodiment of the present application provides a text recognition apparatus 120, including:

a text region determining unit 1201, configured to determine a plurality of target text regions of the image to be recognized according to the trained text detection model;

a first recognition result determining unit 1202, configured to perform text recognition on the multiple target text regions according to the trained first text recognition model, and obtain first text recognition results corresponding to the multiple target text regions; the first text recognition model determines a first text recognition result according to the text semantic information;

a second recognition result determining unit 1203, configured to perform text recognition on the multiple target text regions according to the trained second text recognition model, and obtain second text recognition results corresponding to the multiple target text regions respectively; the second text recognition model determines a second text recognition result according to the text length and the text semantic information;

a target recognition result determining unit 1204, configured to determine, according to first confidence levels included in the plurality of first text recognition results and the plurality of second text recognition results, a plurality of target text recognition results corresponding to the image to be recognized; the first confidence coefficient is used for indicating the probability that a specific character exists in a target text region corresponding to the text recognition result; and one target text area corresponds to a plurality of recognition results, and the recognition result with the highest confidence degree in the plurality of recognition results is the target text recognition result of one target text area.

Optionally, the text region determining unit 1201 is specifically configured to:

obtaining a plurality of local image areas according to the first probability;

Optionally, the text region determining unit 1201 is further specifically configured to:

determining a second confidence degree corresponding to each of the candidate text regions; wherein the second confidence level is used for indicating the probability of text existence in the candidate text region;

determining whether a second confidence corresponding to one candidate text region in the plurality of candidate text regions is greater than a set second confidence threshold;

and determining a candidate text region as the target text region when the determination is larger than the set second confidence threshold.

when the confidence coefficient is determined to be larger than the set second confidence coefficient threshold value, performing binarization processing on a candidate text region to obtain a first candidate text region;

performing connected domain analysis on the first candidate text region to determine whether the first candidate text region is a connected region; the connected region is an image region which has the same pixel value and is formed by non-background pixel points adjacent in position;

and if the first candidate text region is determined to be the connected region, determining the first candidate text region as the target text region.

when the difference value between two adjacent included angles is larger than a set angle threshold value, determining a boundary between text sub-regions corresponding to the two adjacent included angles corresponding to the difference value larger than the set angle threshold value in the first candidate text region as a dividing line;

and acquiring a plurality of target text sub-regions according to the dividing lines, and determining the plurality of target text sub-regions as target text regions.

Optionally, the apparatus further comprises a text recognition preprocessing unit 1205, configured to:

based on a text direction classification function of the trained first text recognition model, performing text direction classification on the plurality of target text regions to obtain a plurality of first target text regions;

based on a text correction sub-function of the trained first text recognition model, performing text correction on the plurality of first target text regions to obtain a plurality of second target text regions;

performing text typesetting direction classification on the plurality of second target text regions based on a text typesetting direction classification function of the trained first text recognition model to obtain a plurality of third target text regions;

Optionally, the target recognition result determining unit 1204 is specifically configured to:

respectively performing text recognition on the plurality of target text regions according to the plurality of trained second text recognition models to obtain second text recognition results corresponding to the plurality of target text regions in each trained second text recognition model in the plurality of trained second text recognition models;

and determining a plurality of target text recognition results corresponding to the image to be recognized according to the confidence degrees corresponding to the first text recognition results and the confidence degrees corresponding to the second text recognition results of each trained second text recognition model.

Optionally, the target recognition result determining unit 1204 is further specifically configured to:

when the same text recognition result is determined to exist, increasing the confidence coefficient of the same text recognition result;

The apparatus may be configured to execute the methods described in the embodiments shown in fig. 2 to 11, and therefore, for functions and the like that can be realized by each functional module of the apparatus, reference may be made to the description of the embodiments shown in fig. 2 to 11, which is not repeated here. It should be noted that the functional units shown by the dashed boxes in fig. 12 are unnecessary functional units of the apparatus.

Referring to fig. 13, based on the same technical concept, an embodiment of the present application further provides a computer device 130, which may include a memory 1301 and a processor 1302.

The memory 1301 is used for storing computer programs executed by the processor 1302. The memory 1301 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. The processor 1302 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 1301 and the processor 1302 is not limited in this embodiment. In the embodiment of the present application, the memory 1301 and the processor 1302 are connected through a bus 1303 in fig. 13, the bus 1303 is shown by a thick line in fig. 13, and the connection manner between other components is merely an illustrative description and is not limited thereto. The bus 1303 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 13, but this is not intended to represent only one bus or type of bus.

The memory 1301 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1301 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or the memory 1301 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 1301 may be a combination of the above.

A processor 1302, configured to execute the method performed by the apparatus in the embodiments shown in fig. 2 to fig. 11 when calling the computer program stored in the memory 1301.

In some possible embodiments, various aspects of the methods provided herein may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the methods as described in the embodiments shown in fig. 2-11.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of text recognition, the method comprising:

2. The method of claim 1, wherein determining a plurality of target text regions of an image to be recognized according to the trained text detection model comprises:

obtaining a plurality of local image areas according to the first probability;

3. The method of claim 2, wherein determining a plurality of target text regions of the image to be recognized based on the plurality of candidate text regions comprises:

4. The method of claim 3, wherein determining the one candidate text region as the target text region upon determining that it is greater than the set second confidence threshold comprises:

5. The method of claim 4, wherein determining the first candidate text region as a target text region if determined as a connected region comprises:

6. The method of claim 1, wherein before performing text recognition on the plurality of target text regions according to the trained first text recognition model to obtain the first text recognition result corresponding to each of the plurality of target text regions, and/or before performing text recognition on the plurality of target text regions according to the trained second text recognition model to obtain the second text recognition result corresponding to each of the plurality of target text regions, the method further comprises:

7. The method of claim 1, wherein when the trained second text recognition model is multiple, determining multiple target text recognition results corresponding to the image to be recognized according to respective corresponding confidence degrees of the multiple first text recognition results and the multiple second text recognition results comprises:

8. The method of claim 7, wherein determining a plurality of target text recognition results corresponding to the image to be recognized according to the confidence degrees corresponding to the first text recognition results and the confidence degrees corresponding to the second text recognition results of each trained second text recognition model comprises:

9. A text recognition apparatus, characterized in that the apparatus comprises:

10. Computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein,

the processor, when executing the computer program, realizes the steps of the method of any one of claims 1 to 8.

11. A computer storage medium having computer program instructions stored thereon, wherein,

the computer program instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.