CN111242129A

CN111242129A - Method and device for end-to-end character detection and identification

Info

Publication number: CN111242129A
Application number: CN202010006312.9A
Authority: CN
Inventors: 连庆; 宋彦; 王咏刚
Original assignee: Innovation Workshop (guangzhou) Artificial Intelligence Research Co Ltd
Current assignee: Innovation Workshop (guangzhou) Artificial Intelligence Research Co Ltd
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2020-06-05

Abstract

The application provides a method and a device for end-to-end character detection and identification, wherein the method comprises the following steps: inputting a target picture into a feature extraction network to obtain shared feature information corresponding to the target picture; inputting the shared characteristic information into a character detection network, and obtaining a character detection result output by the character detection network, wherein the character detection result comprises a character area detection result and a character connection area detection result; inputting the shared characteristic information and the character detection result into a character recognition network to obtain a character recognition result output by the character recognition network; and generating a final recognition result according to the character detection result and the character recognition result. According to the scheme of the application, the problem that the prediction area and the actual area are not aligned possibly existing in the existing attention network can be solved, and the character detection system and the character recognition system can be better fused.

Description

Method and device for end-to-end character detection and identification

Technical Field

The application relates to the technical field of computers, in particular to a technical scheme for end-to-end character detection and identification.

Background

The automatic text detection and recognition algorithm requires that a model can simultaneously locate a text and recognize characters in a natural image, and the task plays a crucial role in many practical applications, for example, in many scenes in the fields of automatic driving, image retrieval, industrial automation and the like, the text detection and recognition are required. In the prior art, scholars in the related art have proposed many methods to solve this task, which can be divided into two categories: end-to-end and Two-phase (Two-stage). However, in the conventional scheme, the detection and recognition of the text are generally regarded as two independent tasks to be separately executed, and the specific steps are as follows: the detection model first locates text instances in the image, after which the recognition model decodes the detected text regions.

FIG. 1 shows a flow of a conventional Two-stage character recognition system, which is as follows: inputting a picture (Image), firstly performing first Feature extraction through a first Feature encoder, then performing Text Detection (Textdetection) to obtain a Detection result (Detection results), then performing second Feature extraction through a second Feature encoder, and then performing Text recognition (Text recognition). Although the Two-stage approach has been popular for some time, it suffers from the following disadvantages: 1) non-end-to-end systems can cause errors to propagate from the detection network to the identification network, resulting in instability of the system used; 2) the operation of two-step operation needs to use two independent feature extraction models to extract features (namely, two times of feature extraction are independent), so that the calculation burden is greatly increased; 3) optimizing the two systems separately can cause the final text recognition algorithm to have a local optimization problem. Therefore, in order to meet the related requirements of fast execution and high performance in real-world applications, the industry has focused on end-to-end text recognition and detection, and has achieved certain improvements. Fig. 2 shows the flow of the existing end-to-end character recognition system, and as can be seen from fig. 2, compared with the Two-stage scheme shown in fig. 1, the end-to-end character recognition system only shares the feature extraction branch for detecting and recognizing the Two systems. Although the end-to-end character detection and recognition can improve the capability of the model to a certain extent, the scheme still has some problems, for example, the problem that irregular-shaped texts cannot be well processed, the problem that the predicted region of interest and the actual region cannot be aligned when the existing attention network solves the text recognition problem due to the lack of character position information, and the like.

Disclosure of Invention

The application aims to provide a technical scheme for end-to-end character detection and identification.

According to one embodiment of the application, a method for end-to-end text detection and identification is provided, wherein the method comprises the following steps:

inputting a target picture into a feature extraction network to obtain shared feature information corresponding to the target picture;

inputting the shared characteristic information into a character detection network, and obtaining a character detection result output by the character detection network, wherein the character detection result comprises a character area detection result and a character connection area detection result;

inputting the shared characteristic information and the character detection result into a character recognition network to obtain a character recognition result output by the character recognition network;

and generating a final recognition result according to the character detection result and the character recognition result.

According to another embodiment of the present application, there is also provided an apparatus for end-to-end text detection and recognition, wherein the apparatus includes:

the device is used for inputting a target picture into a feature extraction network and obtaining shared feature information corresponding to the target picture;

the device is used for inputting the shared characteristic information into a character detection network and obtaining a character detection result output by the character detection network, wherein the character detection result comprises a character area detection result and a character connection area detection result;

the device is used for inputting the shared characteristic information and the character detection result into a character recognition network to obtain a character recognition result output by the character recognition network;

and the device is used for generating a final recognition result according to the character detection result and the character recognition result.

There is also provided, in accordance with another embodiment of the present application, a computer apparatus, wherein the computer apparatus includes: a memory for storing one or more programs; one or more processors coupled with the memory, the one or more programs, when executed by the one or more processors, causing the one or more processors to perform operations comprising:

According to another embodiment of the present application, there is also provided a computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to:

There is further provided, in accordance with another embodiment of the present application, a computer program product which, when executed by an apparatus, causes the apparatus to perform operations comprising:

Compared with the prior art, the method has the following advantages: 1) the character detection is carried out from the character angle, so that the problems of character deformation and indefinite length can be better solved; 2) by providing the character detection result of the character level predicted in the character detection network to the character recognition network, the attention network in the character recognition network can predict by using the character detection result, so that the problem that the predicted area and the actual area are not aligned possibly existing in the conventional attention network is solved; 3) the method can improve the prediction result of the character detection network based on the intermediate result information generated by the character recognition network in the character recognition process, thereby realizing more accurate character recognition and better fusing a character detection system and a character recognition system.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a conventional Two-stage word recognition system flow;

FIG. 2 illustrates a prior art end-to-end text recognition system flow;

FIG. 3 is a flow diagram illustrating a method for end-to-end text detection and recognition according to one embodiment of the present application;

FIG. 4 illustrates a flow chart for improving character detection results according to an example of the present application;

FIG. 5 illustrates an architectural diagram of a system for end-to-end text detection and recognition according to an example of the present application;

fig. 6 shows a schematic structural diagram of an apparatus for end-to-end text detection and recognition according to an example of the present application.

FIG. 7 illustrates an exemplary system that can be used to implement the various embodiments described in this application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The term "device" in this context refers to an intelligent electronic device that can perform predetermined processes such as numerical calculations and/or logic calculations by executing predetermined programs or instructions, and may include a processor and a memory, wherein the predetermined processes are performed by the processor executing program instructions prestored in the memory, or performed by hardware such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), or performed by a combination of the above two.

The technical scheme of the application is mainly realized by computer equipment. Wherein the computer device comprises a network device and a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. The user equipment includes but is not limited to PCs, tablets, smart phones, IPTV, PDAs, wearable devices, and the like. The computer equipment can be independently operated to realize the application, and can also be accessed into a network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), and the like.

It should be noted that the above-mentioned computer devices are only examples, and other computer devices that are currently available or that may come into existence in the future, such as may be applicable to the present application, are also included within the scope of the present application and are incorporated herein by reference.

The methodologies discussed hereinafter, some of which are illustrated by flow diagrams, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. The processor(s) may perform the necessary tasks.

Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present application. This application may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element may be termed a second element, and, similarly, a second element may be termed a first element, without departing from the scope of example embodiments. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

The present application is described in further detail below with reference to the attached figures.

Fig. 3 is a flowchart illustrating a method for end-to-end text detection and recognition according to an embodiment of the present application. The method according to the present embodiment includes step S11, step S12, step S13, and step S14. In step S11, the computer device inputs the target picture to the feature extraction network, and obtains shared feature information corresponding to the target picture; in step S12, the computer device inputs the shared characteristic information into a text detection network, and obtains a character detection result output by the text detection network, where the character detection result includes a character area detection result and a character connection area detection result; in step S13, the computer device inputs the shared feature information and the character detection result to a character recognition network, and obtains a character recognition result output by the character recognition network; in step S14, the computer device generates a final recognition result according to the character detection result and the character recognition result. Optionally, the method of this embodiment is implemented based on Anchor-free (Anchor node is not required) target detection.

In step S11, the computer device inputs the target picture to the feature extraction network, and obtains shared feature information corresponding to the target picture. The target picture is also the picture to be identified, and the target picture contains text content; in some embodiments, the computer device may obtain the target picture using any feasible implementation, such as obtaining the target picture by taking, obtaining the target picture locally, receiving the target picture sent by another device, and so on. The feature extraction network may be any network model for extracting features from pictures, such as VGG (Visual Geometry Group) and batch normalization adopted by a backbone network of the feature extraction network; in some embodiments, a data set is collected and trained using a deep learning neural network, resulting in the feature extraction network. In some embodiments, the shared characteristic information includes any information related to shared characteristics that can be used in a text detection network and a text recognition network. It should be noted that the feature extraction network and the shared feature information are not specifically limited in the present application, and any network model for performing feature extraction should be included in the scope of the feature extraction network described in the present application, and any information related to shared features that can be used in the text detection network and the text recognition network should be included in the scope of the shared feature information described in the present application.

In step S12, the computer device inputs the shared characteristic information into a word detection network, and obtains a character detection result output by the word detection network, where the character detection result includes a character area detection result and a character connection area detection result. In this context, the branch in which the text detection network is located may also be referred to as a "text detection branch", which may also be referred to as a "text detection system", and which is used for text detection. In some embodiments, the word detection network includes any network model that enables character level word region detection. In some embodiments, the literal detection network uses a segmentation model to enable character-level literal region detection; in some embodiments, a VGG network plus a full convolution network is used as the segmentation model to output a character region detection result and a character connection region detection result through the segmentation model. The character region detection result is used for indicating a detection result related to a character region detected aiming at the target picture, such as the probability of each detected character region and a single character central region; the character connection region detection result is used to indicate a detection result related to a character connection region detected for the target picture, such as probabilities of each detected character connection region and a center of the character connection region, where one character connection region is used to represent a connection region between two adjacent characters. In some embodiments, the word detection network uses a gaussian heat map to generate the character region detection result and the character connection region detection result, that is, the character region detection result represents the probability of the character center region by the gaussian heat map, and the character connection region detection result represents the probability of the center of the adjacent character connection region by the gaussian heat map; in some embodiments, the character region detection result includes a character region gaussian map, and optionally further includes a character region border; the character connection area detection result comprises a character connection area Gaussian map and optionally further comprises a character connection area frame; it should be noted that the use of the gaussian heat map has the advantage of handling well border regions that are not strictly surrounded. It should be noted that the character region detection result and the character connection region detection result may also be represented in other forms, which is not limited in this application, for example, any manner capable of representing the probability of the center region of a character may be used as a feasible representation manner of the character region detection result, and any manner capable of representing the probability of the center of an adjacent character region may be used as a feasible representation manner of the character connection region detection result.

In step S13, the computer device inputs the shared feature information and the character detection result to a character recognition network, and obtains a character recognition result output by the character recognition network. In this context, the branch in which the word recognition network is located may also be referred to as a "word recognition branch", which may also be referred to as a "word recognition system", and which is used for performing word recognition. In some embodiments, the character region detection result is used to guide an attention network in the word recognition network to predict a character region (i.e., to guide the attention network to predict which regions the character region needs to be in), and the character connection region detection result is used to determine each recognized character content as a corresponding text content by the word recognition network (e.g., determine which characters are continuous texts according to the character connection region detection result, so as to recognize at least one continuous text). In some embodiments, the method further comprises training to obtain the word recognition network before the step S11. In some embodiments, the word recognition Network first uses a Bi-long short Term Memory Network (BiLSTM) to capture word timing information, and then uses an attention mechanism to predict the character region and character content based on the character region detection results from the word detection Network.

In step S14, the computer device generates a final recognition result according to the character detection result and the character recognition result. In some embodiments, the final recognition result includes at least one text box and text content in each text box, and each text box includes a recognized string of consecutive characters. In some embodiments, the computer device locates the position of each text box according to the character detection result, and determines the text content in each text box by combining the character recognition result.

In some embodiments, the method further comprises, before the step S11: the computer equipment trains the character detection network according to a plurality of items of sample data marked with character positions and label information corresponding to each item of sample data, wherein the label information corresponding to each item of sample data comprises a character area detection result and a character connection area detection result corresponding to the item of sample data. The tag information represents a tag generated for tagged data in sample data, and in some embodiments, the tag information corresponding to the sample data at a tagged character position includes a character region gaussian map and a character connection region gaussian map obtained for the sample data, that is, an original marked bounding box is converted into a gaussian map. In some embodiments, the computer device obtains a plurality of items of sample data of marked character positions and label information corresponding to each item of sample data (for example, including a character center region probability and a character connection region center probability corresponding to each item of sample data), and trains the plurality of items of sample data and the label information corresponding to each item of sample data by using a deep learning neural network to obtain the text detection network; in some embodiments, each sample data is marked with a character area corresponding to each character, that is, a bounding box of each character; in some embodiments, each sample data item is marked with a text area and a character area corresponding to each character in the text area. By training the character recognition network, aiming at a target picture without marked data, a character region detection result and a character connection region detection result corresponding to the target picture can be output. In some embodiments, the computer device collects a plurality of items of sample data of the marked character position and generates label information corresponding to each item of sample data. As an example, a plurality of sample data of the marked character positions (i.e. the bounding boxes of the respective characters) are collected, and for each sample data, the following operations are performed: aiming at each character, four vertexes of the character form a quadrangle, a thermodynamic diagram from inside to outside corresponding to the character is constructed by taking the quadrangle as a frame and taking the center point of the character as a basis, namely a character region Gaussian map is obtained, regarding each character connection region connecting adjacent characters, the center point between two adjacent characters is firstly positioned, the corresponding vertex and the corresponding bottom point are found to form a new quadrangle, the new quadrangle is taken as a frame and the center point between the two characters is used as a basis, namely the thermodynamic diagram from inside to outside is constructed, namely the character connection region Gaussian map corresponding to the two adjacent characters is obtained.

In some embodiments, the step S13 further includes: inputting the shared characteristic information into a bidirectional long-time and short-time memory network in the character recognition network to obtain character time sequence information corresponding to the target picture; inputting the character time sequence information and the character detection result into an attention network in the character recognition network, so that the attention network predicts character content in each character area according to at least one character area indicated by the character area detection result, and generates a character recognition result according to a character connection area detection result corresponding to the target picture and the character content in each character area. Therefore, the character detection results obtained in the character detection network can guide the attention network in the character recognition branch to predict the character content from which character areas, namely, the character area detection results provided by the character detection results are utilized to provide a potential prediction range for the attention network, so that the problem that the prediction area and the actual area are not aligned possibly existing in the attention network in the conventional text recognition system can be effectively solved.

In some embodiments, the method further comprises: in the character recognition network, judging whether at least one character area indicated by the character area detection result has reliability, if not, sending intermediate result information corresponding to the target image in the character recognition network to the character detection network; wherein the generating a final recognition result according to the character detection result and the character recognition result comprises: adjusting the character detection result in the character detection network according to the intermediate result information to obtain a new character detection result; and generating a final recognition result according to the new character detection result and the character recognition result. In some embodiments, the intermediate result information includes any information related to an intermediate result generated prior to obtaining a text recognition result in the text recognition network; in some embodiments, the intermediate result information includes first data information output by a bidirectional long-and-short memory network and second data information output by the attention network. In some embodiments, whether at least one of the character regions indicated by the character region detection result has reliability, that is, whether the potential prediction range provided by the character detection result has reliability, indicates that the character detection result is accurate if the at least one of the character regions is reliable, and indicates that the character detection result may have an error if the at least one of the character regions is not reliable. The character detection result is adjusted in the character detection network according to the intermediate result information from the character recognition network, so that the character detection result output by the character detection network at first can be optimized and improved, a more accurate new character detection result is obtained, and the final recognition result is more accurate.

In some embodiments, the determining whether the at least one character region indicated by the character region detection result has reliability includes: judging whether the at least one character area is accurate or not, and if not, determining that the at least one character area does not have reliability; if so, judging whether the character content in the at least one character area can be identified in sequence, if so, determining that the at least one character area has reliability, otherwise, determining that the at least one character area does not have reliability. In some embodiments, if at least one of the character regions indicated by the character region detection result has reliability, there is no need to perform any operation, that is, there is no need to improve and optimize the character detection result output in the character detection network. In some embodiments, if at least one of the character regions indicated by the character region detection result has reliability, the character detection result is accurate, and therefore, the word detection network can be optimized according to the target picture and the character detection result.

In some embodiments, the adjusting the character detection result according to the intermediate result information in the text detection network to obtain a new character detection result includes: and adjusting the character detection result in the character detection network according to the intermediate result information and the shared characteristic information to obtain a new character detection result. Fig. 4 is a schematic diagram illustrating a flow for improving a character detection result according to an example of the present application, where the specific flow includes: firstly, extracting shared Feature information from a target picture through a Feature extractor (namely a Feature extraction model) so as to input the obtained shared Feature information into a character detection network and a character recognition network respectively; the character detection network is used for carrying out character positioning (Localization) to obtain a Coarse character detection result (Coarse result), and then the Coarse character detection result is input into the character recognition network to be used as a guide (guide) to guide the prediction in which areas by using the Coarse character detection result; the character Recognition network is used for performing character Recognition (Recognition), firstly using BilSTM to capture character timing sequence information, then using an attention mechanism to perform attention prediction (attention prediction), namely sequentially predicting character content in each character area according to at least one character area indicated by the rough character detection result, and judging whether the at least one character area indicated by the rough character detection result is accurate; then, according to the shared characteristic information, the rough character detection result generated in the character detection network, and the intermediate result information generated by the BilSTM and the attention prediction mechanism in the character recognition network, positioning improvement (Localization refinement) can be performed, so as to obtain an improved new character detection result.

In some embodiments, the method further comprises: and optimizing the character detection network and/or the character recognition network according to the character detection result, the character recognition result and the final recognition result. Based on the character detection network and/or the character recognition network, the character detection network and/or the character recognition network can be further optimized, so that a more accurate recognition result can be obtained in the subsequent character detection and recognition.

Fig. 5 shows an architecture diagram of a system for end-to-end text detection and recognition according to an example of the present application. Based on the system, after a target picture (i.e. the Input image shown in fig. 5) is Input into the system, shared feature information is extracted through a feature extraction model, and the shared feature information is Input into a text Detection branch for Detection (Detection) and is Input into a text Recognition branch for Recognition (Recognition); in the text detection branch, a Character detection result (Character detection result) corresponding to the target picture is obtained through detection, for example, a Character region gaussian map and a Character connection region gaussian map corresponding to the target picture, the character detection result is inputted into the character recognition branch for guiding (Guide) Attention prediction mechanism (Attention prediction) in which region to predict to obtain the character recognition result, the character recognition network may determine whether the character detection result is accurate based on a prediction result, and feeding back (Feedback) intermediate result information in the character recognition network to the character detection network, to improve (Refine) the character detection result to obtain a New character detection result (New character detection result), and finally obtain a final recognition result (i.e. Text spotting results shown in fig. 5) based on the New character detection result and the character recognition result.

Fig. 6 shows a schematic structural diagram of an apparatus for end-to-end text detection and recognition according to an example of the present application. The device for end-to-end character detection and recognition (hereinafter referred to as "end-to-end recognition device 1") includes a feature extraction device 11, a character detection device 12, a character recognition device 13, and a generation device 14. The feature extraction device 11 is configured to input a target picture to a feature extraction network, and obtain shared feature information corresponding to the target picture; the character detection device 12 is configured to input the shared feature information to a character detection network, and obtain a character detection result output by the character detection network, where the character detection result includes a character area detection result and a character connection area detection result; the character recognition device 13 is configured to input the shared feature information and the character detection result to a character recognition network, and obtain a character recognition result output by the character recognition network; the generating device 14 is configured to generate a final recognition result according to the character detection result and the character recognition result.

The feature extraction device 11 is configured to input a target picture to a feature extraction network, and obtain shared feature information corresponding to the target picture. The target picture is also the picture to be identified, and the target picture contains text content; in some embodiments, the computer device may obtain the target picture using any feasible implementation, such as obtaining the target picture by taking, obtaining the target picture locally, receiving the target picture sent by another device, and so on. The feature extraction network may be any network model for extracting features from pictures, such as VGG (Visual Geometry Group) and batch normalization adopted by a backbone network of the feature extraction network; in some embodiments, a data set is collected and trained using a deep learning neural network, resulting in the feature extraction network. In some embodiments, the shared characteristic information includes any information related to shared characteristics that can be used in a text detection network and a text recognition network. It should be noted that the feature extraction network and the shared feature information are not specifically limited in the present application, and any network model for performing feature extraction should be included in the scope of the feature extraction network described in the present application, and any information related to shared features that can be used in the text detection network and the text recognition network should be included in the scope of the shared feature information described in the present application.

The character detection device 12 is configured to input the shared feature information to a character detection network, and obtain a character detection result output by the character detection network, where the character detection result includes a character area detection result and a character connection area detection result. In this context, the branch in which the text detection network is located may also be referred to as a "text detection branch", which may also be referred to as a "text detection system", and which is used for text detection. In some embodiments, the word detection network includes any network model that enables character level word region detection. In some embodiments, the literal detection network uses a segmentation model to enable character-level literal region detection; in some embodiments, a VGG network plus a full convolution network is used as the segmentation model to output a character region detection result and a character connection region detection result through the segmentation model. The character region detection result is used for indicating a detection result related to a character region detected aiming at the target picture, such as the probability of each detected character region and a single character central region; the character connection region detection result is used to indicate a detection result related to a character connection region detected for the target picture, such as probabilities of each detected character connection region and a center of the character connection region, where one character connection region is used to represent a connection region between two adjacent characters. In some embodiments, the word detection network uses a gaussian heat map to generate the character region detection result and the character connection region detection result, that is, the character region detection result represents the probability of the character center region by the gaussian heat map, and the character connection region detection result represents the probability of the center of the adjacent character region by the gaussian heat map; in some embodiments, the character region detection result includes a character region gaussian map, and optionally further includes a character region border; the character connection area detection result comprises a character connection area Gaussian map and optionally further comprises a character connection area frame; it should be noted that the use of the gaussian heat map has the advantage of handling well border regions that are not strictly surrounded. It should be noted that the character region detection result and the character connection region detection result may also be represented in other forms, which is not limited in this application, for example, any manner capable of representing the probability of the center region of a character may be used as a feasible representation manner of the character region detection result, and any manner capable of representing the probability of the center of an adjacent character region may be used as a feasible representation manner of the character connection region detection result.

The character recognition device 13 is configured to input the shared feature information and the character detection result to a character recognition network, and obtain a character recognition result output by the character recognition network. In this context, the branch in which the word recognition network is located may also be referred to as a "word recognition branch", which may also be referred to as a "word recognition system", and which is used for performing word recognition. In some embodiments, the character region detection result is used to guide an attention network in the word recognition network to predict a character region (i.e., to guide the attention network to predict which regions the character region needs to be in), and the character connection region detection result is used to determine each recognized character content as a corresponding text content by the word recognition network (e.g., determine which characters are continuous texts according to the character connection region detection result, so as to recognize at least one continuous text). In some embodiments, the end-to-end recognition device 1 further comprises means for training to obtain said word recognition network. In some embodiments, the word recognition Network first uses a Bi-long short Term Memory Network (BiLSTM) to capture word timing information, and then uses an attention mechanism to predict the character region and character content based on the character region detection results from the word detection Network.

The generating device 14 is configured to generate a final recognition result according to the character detection result and the character recognition result. In some embodiments, the final recognition result includes at least one text box and text content in each text box, and each text box includes a recognized string of consecutive characters. In some embodiments, the generating device 14 locates the position of each text box according to the character detection result, and determines the text content in each text box according to the character recognition result.

In some embodiments, the end-to-end recognition apparatus 1 further includes a device for training a text detection network (hereinafter, referred to as a "training device" for short, not shown) according to a plurality of sample data with labeled character positions and label information corresponding to each sample data, where the label information corresponding to each sample data includes a character region detection result and a character connection region detection result corresponding to the sample data. The tag information represents a tag generated for tagged data in sample data, and in some embodiments, the tag information corresponding to the sample data at a tagged character position includes a character region gaussian map and a character connection region gaussian map obtained for the sample data, that is, an original marked bounding box is converted into a gaussian map. In some embodiments, the training device obtains a plurality of items of sample data with marked character positions and label information corresponding to each item of sample data (for example, including a character center region probability and a character connection region center probability corresponding to each item of sample data), and trains the plurality of items of sample data and the label information corresponding to each item of sample data by using a deep learning neural network to obtain the text detection network; in some embodiments, each sample data is marked with a character area corresponding to each character, that is, a bounding box of each character; in some embodiments, each sample data item is marked with a text area and a character area corresponding to each character in the text area. By training the character recognition network, aiming at a target picture without marked data, a character region detection result and a character connection region detection result corresponding to the target picture can be output. In some embodiments, the training device collects a plurality of items of sample data of the marked character positions and generates label information corresponding to each item of sample data. As an example, a plurality of sample data of the marked character positions (i.e. the bounding boxes of the respective characters) are collected, and for each sample data, the following operations are performed: aiming at each character, four vertexes of the character form a quadrangle, a thermodynamic diagram from inside to outside corresponding to the character is constructed by taking the quadrangle as a frame and taking the center point of the character as a basis, namely a character region Gaussian map is obtained, regarding each character connection region connecting adjacent characters, the center point between two adjacent characters is firstly positioned, the corresponding vertex and the corresponding bottom point are found to form a new quadrangle, the new quadrangle is taken as a frame and the center point between the two characters is used as a basis, namely the thermodynamic diagram from inside to outside is constructed, namely the character connection region Gaussian map corresponding to the two adjacent characters is obtained.

In some embodiments, the text recognition device 13 is configured to: inputting the shared characteristic information into a bidirectional long-time and short-time memory network in the character recognition network to obtain character time sequence information corresponding to the target picture; inputting the character time sequence information and the character detection result into an attention network in the character recognition network, so that the attention network predicts character content in each character area according to at least one character area indicated by the character area detection result, and generates a character recognition result according to a character connection area detection result corresponding to the target picture and the character content in each character area. Therefore, the character detection results obtained in the character detection network can guide the attention network in the character recognition branch to predict the character content from which character areas, namely, the character area detection results provided by the character detection results are utilized to provide a potential prediction range for the attention network, so that the problem that the prediction area and the actual area are not aligned possibly existing in the attention network in the conventional text recognition system can be effectively solved.

In some embodiments, the end-to-end recognition apparatus 1 further includes a determining device (not shown) configured to determine whether at least one character region indicated by the character region detection result has reliability in the character recognition network, and if not, send intermediate result information corresponding to the target image in the character recognition network to the character detection network; wherein the generating means 14 is adapted to: adjusting the character detection result in the character detection network according to the intermediate result information to obtain a new character detection result; and generating a final recognition result according to the new character detection result and the character recognition result. In some embodiments, the intermediate result information includes any information related to an intermediate result generated prior to obtaining a text recognition result in the text recognition network; in some embodiments, the intermediate result information includes first data information output by a bidirectional long-and-short memory network and second data information output by the attention network. In some embodiments, whether at least one of the character regions indicated by the character region detection result has reliability, that is, whether the potential prediction range provided by the character detection result has reliability, indicates that the character detection result is accurate if the at least one of the character regions is reliable, and indicates that the character detection result may have an error if the at least one of the character regions is not reliable. The character detection result is adjusted in the character detection network according to the intermediate result information from the character recognition network, so that the character detection result output by the character detection network at first can be optimized and improved, a more accurate new character detection result is obtained, and the final recognition result is more accurate.

In some embodiments, the end-to-end recognition apparatus 1 further comprises an optimization device (not shown) for optimizing the word detection network and/or the word recognition network according to the character detection result, the word recognition result and the final recognition result. Based on the character detection network and/or the character recognition network, the character detection network and/or the character recognition network can be further optimized, so that a more accurate recognition result can be obtained in the subsequent character detection and recognition.

According to the scheme of this application, have following advantage: the character detection is carried out from the character angle, so that the problems of character deformation and indefinite length can be better solved; by providing the character detection result of the character level predicted in the character detection network to the character recognition network, the attention network in the character recognition network can predict by using the character detection result, so that the problem that the predicted area and the actual area are not aligned possibly existing in the conventional attention network is solved; the method can improve the prediction result of the character detection network based on the intermediate result information generated by the character recognition network in the character recognition process, thereby realizing more accurate character recognition and better fusing a character detection system and a character recognition system.

The present application further provides a computer device, wherein the computer device includes: a memory for storing one or more programs; one or more processors coupled with the memory, the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method for end-to-end text detection and identification described herein.

The present application also provides a computer-readable storage medium having stored thereon a computer program executable by a processor to perform the method for end-to-end text detection and recognition described herein.

The present application also provides a computer program product which, when executed by a device, causes the device to perform the method for end-to-end text detection and recognition described herein.

In some embodiments, system 1000 can be implemented as any of the processing devices in the embodiments of the present application. In some embodiments, system 1000 may include one or more computer-readable media (e.g., system memory or NVM/storage 1020) having instructions and one or more processors (e.g., processor(s) 1005) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.

For one embodiment, system control module 1010 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 1005 and/or to any suitable device or component in communication with system control module 1010.

The system control module 1010 may include a memory controller module 1030 to provide an interface to the system memory 1015. Memory controller module 1030 may be a hardware module, a software module, and/or a firmware module.

System memory 1015 may be used to load and store data and/or instructions, for example, for system 1000. For one embodiment, system memory 1015 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 1015 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, system control module 1010 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 1020 and communication interface(s) 1025.

For example, NVM/storage 1020 may be used to store data and/or instructions. NVM/storage 1020 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more hard disk drive(s) (HDD (s)), one or more Compact Disc (CD) drive(s), and/or one or more Digital Versatile Disc (DVD) drive (s)).

NVM/storage 1020 may include storage resources that are physically part of a device on which system 1000 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 1020 may be accessed over a network via communication interface(s) 1025.

Communication interface(s) 1025 may provide an interface for system 1000 to communicate over one or more networks and/or with any other suitable device. System 1000 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 1005 may be packaged together with logic for one or more controller(s) of the system control module 1010, e.g., memory controller module 1030. For one embodiment, at least one of the processor(s) 1005 may be packaged together with logic for one or more controller(s) of the system control module 1010 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1005 may be integrated on the same die with logic for one or more controller(s) of the system control module 1010. For one embodiment, at least one of the processor(s) 1005 may be integrated on the same die with logic of one or more controllers of the system control module 1010 to form a system on a chip (SoC).

In various embodiments, system 1000 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 1000 may have more or fewer components and/or different architectures. For example, in some embodiments, system 1000 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

While exemplary embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the claims. The protection sought herein is as set forth in the claims below. These and other aspects of the various embodiments are specified in the following numbered clauses:

1. a method for end-to-end text detection and recognition, wherein the method comprises:

2. The method of clause 1, wherein the method further comprises:

training the character detection network according to a plurality of items of sample data with marked character positions and label information corresponding to each item of sample data, wherein the label information corresponding to each item of sample data comprises a character area detection result and a character connection area detection result corresponding to the item of sample data.

3. The method of clause 1, wherein the literal detection network employs a gaussian heat map to generate the character region detection results and the character connection region detection results.

4. The method of clause 1, wherein the shared characteristic information and the character detection result are input to a character recognition network, and a character recognition result output by the character recognition network is obtained:

inputting the shared characteristic information into a bidirectional long-time and short-time memory network in the character recognition network to obtain character time sequence information corresponding to the target picture;

inputting the character time sequence information and the character detection result into an attention network in the character recognition network, so that the attention network predicts character content in each character area according to at least one character area indicated by the character area detection result, and generates a character recognition result according to a character connection area detection result corresponding to the target picture and the character content in each character area.

5. The method of clause 4, wherein the method further comprises:

judging whether at least one character area indicated by the character area detection result has reliability, if not, sending intermediate result information corresponding to the target image in the character recognition network to the character detection network;

wherein the generating a final recognition result according to the character detection result and the character recognition result comprises:

adjusting the character detection result in the character detection network according to the intermediate result information to obtain a new character detection result;

and generating a final recognition result according to the new character detection result and the character recognition result.

6. The method according to clause 5, wherein the determining whether at least one character region indicated by the character region detection result has reliability includes:

judging whether the at least one character area is accurate or not, and if not, determining that the at least one character area does not have reliability; if so, judging whether the character content in the at least one character area can be identified in sequence, if so, determining that the at least one character area has reliability, otherwise, determining that the at least one character area does not have reliability.

7. The method of clause 5, wherein the intermediate result information includes first data information output by a bidirectional long-and-short mnemonic network and second data information output by the attention network.

8. The method according to clause 5, wherein the adjusting the character detection result according to the intermediate result information in the text detection network to obtain a new character detection result comprises:

and adjusting the character detection result in the character detection network according to the intermediate result information and the shared characteristic information to obtain a new character detection result.

9. The method of clause 1, wherein the method further comprises:

and optimizing the character detection network and/or the character recognition network according to the character detection result, the character recognition result and the final recognition result.

10. An apparatus for end-to-end text detection and recognition, wherein the apparatus comprises:

11. The apparatus of clause 10, wherein the apparatus further comprises:

and the device is used for training the character detection network according to a plurality of items of sample data with marked character positions and label information corresponding to each item of sample data, wherein the label information corresponding to each item of sample data comprises a character area detection result and a character connection area detection result corresponding to the item of sample data.

12. The apparatus of clause 10, wherein the literal detection network employs a gaussian heat map to generate the character region detection result and the character connection region detection result.

13. The apparatus according to clause 10, wherein the means for inputting the shared characteristic information and the character detection result into a character recognition network to obtain the character recognition result output by the character recognition network is configured to:

14. The apparatus of clause 13, wherein the apparatus further comprises:

a device for judging whether at least one character area indicated by the character area detection result has reliability, if not, sending intermediate result information corresponding to the target image in the character recognition network to the character detection network;

wherein the means for generating a final recognition result from the character detection result and the character recognition result is configured to:

15. The apparatus according to clause 14, wherein the determining whether at least one character region indicated by the character region detection result has reliability includes:

16. The apparatus of clause 14, wherein the intermediate result information includes first data information output by a bidirectional long-and-short mnemonic network and second data information output by the attention network.

17. The apparatus according to clause 14, wherein the adjusting the character detection result according to the intermediate result information in the text detection network to obtain a new character detection result includes:

18. The apparatus of clause 10, wherein the apparatus further comprises:

means for optimizing the text detection network and/or the text recognition network based on the character detection result, the text recognition result, and the final recognition result.

19. A computer device, wherein the computer device comprises:

a memory for storing one or more programs;

one or more processors coupled to the memory,

the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of clauses 1-9.

20. A computer-readable storage medium having stored thereon a computer program executable by a processor to perform the method of any of clauses 1-9.

21. A computer program product which, when executed by an apparatus, causes the apparatus to perform the method of any of clauses 1 to 9.

Claims

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 1, wherein a word detection network employs a gaussian heat map to generate the character region detection result and the character connection region detection result.

4. The method of claim 1, wherein the shared characteristic information and the character detection result are input to a character recognition network, and a character recognition result output by the character recognition network is obtained:

5. The method of claim 4, wherein the method further comprises:

6. The method of claim 5, wherein the determining whether the at least one character region indicated by the character region detection result has reliability comprises:

7. The method of claim 5, wherein the intermediate result information includes first data information output by a bidirectional long-and-short memory network and second data information output by the attention network.

8. The method of claim 5, wherein the adjusting the character detection result in the text detection network according to the intermediate result information to obtain a new character detection result comprises:

9. The method of claim 1, wherein the method further comprises:

11. The apparatus of claim 10, wherein the means for inputting the shared characteristic information and the character detection result into a character recognition network to obtain a character recognition result output by the character recognition network is configured to:

12. The apparatus of claim 11, wherein the apparatus further comprises:

13. The apparatus of claim 12, wherein the determining whether the at least one character region indicated by the character region detection result has reliability comprises:

14. A computer device, wherein the computer device comprises:

a memory for storing one or more programs;

one or more processors coupled to the memory,

the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method recited by any of claims 1-9.

15. A computer-readable storage medium, on which a computer program is stored, which computer program can be executed by a processor to perform the method according to any one of claims 1 to 9.