WO2018054326A1

WO2018054326A1 - Character detection method and device, and character detection training method and device

Info

Publication number: WO2018054326A1
Application number: PCT/CN2017/102679
Authority: WO
Inventors: 向东来; 郭强; 夏炎; 梁鼎
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2016-09-22
Filing date: 2017-09-21
Publication date: 2018-03-29
Also published as: CN106446899A

Abstract

Disclosed are a character detection method and device, and a character detection training method and device. An exemplary character detection method comprises: extracting a feature map from an image comprising a character region by using a convolutional neural network; transversely clipping the feature map by using a plurality of anchors separately, to obtain a plurality of suggestion regions; classifying and regressing each suggestion region by means of the convolutional neural network, wherein whether each suggestion region corresponds to an region comprising characters is determined by means of the classification, and the position in the image corresponding to each suggestion region is determined by means of the regression; and transversely splicing, according to the positions in the image that respectively correspond to the suggestion regions and are determined by means of the regression, the suggestion regions that correspond to the regions comprising characters and are determined by means of the classification, to obtain a character region detection result.

Description

Text detection method and device, and text detection training method and device

The present disclosure claims the priority of the Chinese patent application filed on September 22, 2016, the Chinese Patent Office, the application number is 201610842572.3, and the invention is entitled "Text detection method and apparatus, and text detection training method and apparatus", the entire contents of which are The citations are incorporated in the disclosure.

Technical field

The present application relates to word detection, and in particular to a text detection method and apparatus, and a text detection training method and apparatus.

Background technique

In recent years, general object detection methods based on convolutional neural networks have been tried in the field of text detection, and have achieved good results. The Regional Proposal Network (RPN) is one of the best performing algorithms in convolutional neural networks. How to apply the regional recommendation neural network to text detection has attracted widespread attention and research enthusiasm.

Summary of the invention

The present application provides a technical solution for text detection.

In one aspect, the present application provides a text detection method, including: extracting a feature map from an image including a text region using a convolutional neural network; and horizontally intercepting the feature map by using a plurality of anchor rectangles to obtain a plurality of suggested regions. Dividing and regressing each of the suggested regions by the convolutional neural network, wherein the classification determines whether each of the suggested regions corresponds to a region including a character, and determining, by the regression, each corresponding region a position in the image; and each suggestion region corresponding to the region including the character determined by the classification is horizontally spliced according to the position in the image corresponding to each of the suggested regions determined by the regression to obtain the text region detection result.

In another aspect, the present application provides a text detection training method, including: extracting a feature map from a training image including a text region using a convolutional neural network; and laterally intercepting a feature image of the training image by using a plurality of anchor rectangles, Obtaining a plurality of suggestion regions; classifying and retrieving the suggested regions intercepted by each anchor rectangle by the convolutional neural network, wherein the classification determines whether each of the suggested regions corresponds to an area including characters, and the regression determines each The suggested area corresponds to a position in the training image; and iteratively trains the convolutional neural network until training according to a known difference between the real text area corresponding to the training image and the predicted text area obtained by the classification and regression The result satisfies the predetermined convergence condition.

In another aspect, the present application provides a text detecting apparatus, comprising: an image feature extraction module, which uses a convolutional neural network to extract a feature image from an image including a text area; and proposes an area intercepting module, which adopts multiple anchor rectangle pairs The feature maps are respectively laterally intercepted to obtain a plurality of suggestion regions; the classification module classifies each suggested region through the convolutional neural network to determine whether each suggested region corresponds to a region including characters; the regression module will Each suggested area is regressed by the convolutional neural network to determine that each suggested area corresponds to a position in the image; and a detection result splicing module, the suggestions corresponding to the area including the text determined by the classification module The area is spliced horizontally according to the position in the image according to the recommended areas determined by the regression module to obtain a text area detection result.

In another aspect, the present application provides a text detection training apparatus, including: an image feature extraction module, which uses a convolutional neural network to extract a feature image from a training image including a text region; and proposes an area interception module that uses multiple anchor rectangle pairs. The feature image of the training image is laterally intercepted to obtain a plurality of suggestion regions; the classification module classifies each suggested region by the convolutional neural network to determine whether each suggested region corresponds to an area including characters; the regression module Retrieving each suggested region through the convolutional neural network to determine that each suggested region corresponds to a location in the training image; and a training module that is based on a known real text region corresponding to the training image and The classification and the difference of the predicted text regions obtained by the regression, iteratively training the convolutional neural network until the training result satisfies a predetermined convergence condition.

In still another aspect, the present application provides a text detecting apparatus, a memory storing executable instructions, and one or more processors in communication with the memory to execute the executable instructions to perform the following operations: using convolution The neural network extracts the feature map from the image including the text region; the feature map is horizontally intercepted by using multiple anchor rectangles to obtain a plurality of suggestion regions; each suggested region is classified and returned through the convolutional neural network, Wherein, by the classification, determining whether each of the suggestion regions corresponds to an area including characters, determining, by the regression, each of the suggested regions corresponds to a position in the image; and determining an area corresponding to the text determined by the classification Each of the suggested regions is horizontally spliced according to the positions in the image corresponding to the respective suggested regions determined by the regression to obtain a text region detection result.

In still another aspect, the present application provides a text detection training apparatus comprising: a memory storing executable instructions; and one or more processors in communication with the memory to execute the executable instructions to perform the following operations: Extracting a feature map from a training image including a text region using a convolutional neural network; laterally intercepting the feature image of the training image by using a plurality of anchor rectangles to obtain a plurality of suggested regions; and recommending a suggested region for each anchor rectangle The convolutional neural network performs classification and regression, wherein the classification determines whether each of the suggested regions corresponds to an area including text, the regression determines a location of each of the suggested regions; and according to a known known image corresponding to the training image The difference between the real text area and the predicted text area obtained by the classification and regression, iteratively trains the convolutional neural network until the training result satisfies a predetermined convergence condition.

The present application also provides a computer readable medium having computer executable instructions stored thereon, when the processor executes computer executable instructions stored in the computer readable medium, the processor executes the Any text detection method and/or text detection training method.

Feature extraction and subsequent classification and regression are performed by employing a plurality of laterally stitched anchor rectangles, each of which intercepts only the suggested region corresponding to the lateral portion of the region to be detected in the image for processing, and thus is used for processing In the convolutional neural network for text detection, when detecting a text area having a large width, an area near a single anchor rectangle corresponding to a lateral portion of the area to be detected can be seen without having a great feeling. Wild, thereby reducing the difficulty of network design.

DRAWINGS

The accompanying drawings, which are incorporated in FIG.

The present application can be more clearly understood from the following detailed description, in which:

1 is a flow chart showing a text detecting method according to an embodiment of the present application;

FIG. 2 shows an architectural diagram of a character detecting apparatus according to an exemplary embodiment;

FIG. 3 shows a schematic diagram of an exemplary application example according to the present application;

4 shows a flow chart of a training method for a convolutional neural network in accordance with an exemplary embodiment;

FIG. 5 shows an architectural diagram of a text detection training device according to an exemplary embodiment;

FIG. 6 is a block diagram showing the structure of a computer system suitable for implementing an embodiment of the present application.

Specific embodiment

Various exemplary embodiments of the present application will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the application.

In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.

The following description of the at least one exemplary embodiment is merely illustrative and is in no way

Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but the techniques, methods and apparatus should be considered as part of the specification, where appropriate.

It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.

Embodiments of the present application can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations. Suitable for use with terminal equipment, computer systems or services Examples of well-known terminal devices, computing systems, environments, and/or configurations used with electronic devices such as the device include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.

FIG. 1 shows a flow chart 1000 of a text detection method in accordance with an embodiment of the present application. In step S1010, a feature map is extracted from the image including the text region using a convolutional neural network. The feature map obtained by the convolutional neural network contains the feature information of the image. In step S1030, the feature maps are respectively laterally intercepted by using a plurality of anchor rectangles to obtain a plurality of suggested regions (for example, at least two suggested regions are obtained). Since the feature maps are respectively laterally intercepted by using a plurality of anchor rectangles, each of the suggested regions corresponding to the lateral portion of the image to be detected corresponds to the entire lateral length of the text region to be detected. In step S1050, each suggestion region is classified and regressed by a convolutional neural network, wherein each of the suggested regions is determined by classification to correspond to an area including characters, and each suggested region is determined by regression to correspond to the image to be detected. s position. In step S1070, each suggestion region corresponding to the region including the character determined by the classification is horizontally spliced according to the position in the image corresponding to each of the suggested regions determined by the regression to obtain a text region detection result. In an optional example, the suggested areas that are adjacent to each other and/or intersected may be connected according to the respective suggested areas determined by the regression, respectively, corresponding to the positions in the image to be detected, thereby obtaining the text area detection result. In an optional example, the anchor rectangles corresponding to the suggested regions in the adjacent and/or intersecting regions may be connected according to the respective suggested regions determined by the regression respectively corresponding to the positions in the image to be detected, thereby Get the text area detection result.

Since the processing object of classification and regression is a suggested region corresponding to the lateral portion of the image to be detected intercepted by the anchor rectangle, for the convolutional neural network used for text detection, the text region having a larger width is performed. When detecting, it is possible to see an area near a single anchor rectangle corresponding to a lateral portion of the text area without having a large receptive field, thereby reducing the difficulty of network design.

In the above text detecting method, the plurality of anchor rectangles may be anchor rectangles continuously spliced in the lateral direction (ie, the width direction), whereby each suggestion region intercepted by each anchor rectangle may correspond to the entire width of the image to be detected. Can Optionally, a plurality of anchor rectangles may overlap slightly in the width direction. For example, two adjacent anchor rectangles may overlap one pixel in the width direction; thus, each suggested region intercepted by each anchor rectangle corresponds to the to-be-detected The entire width of the image has a small amount of overlap to avoid gaps between adjacent anchor rectangles or adjacent suggested areas due to errors in actual use, thereby missing some intermediate width of the image to be detected.

The text detection method provided by the embodiment of the present application may be performed by any suitable device having data processing capability, including but not limited to: a terminal device, a server, and the like. Alternatively, the text detection method provided by the embodiment of the present application may be executed by a processor, for example, the processor executes the text detection method mentioned in the embodiment of the present application by calling a corresponding instruction stored in the memory. This will not be repeated below.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

FIG. 2 shows an architectural diagram of a text detecting device 2000 according to an exemplary embodiment. In an alternative example, the text detection device 2000 is implemented in the form of an RPN. As shown in FIG. 2, the text detecting device 2000 includes: an image feature extraction module 2010, a suggestion region intercepting module 2030, a classification module 2040, a regression module 2050, and a detection result splicing module 2070, wherein the image feature extraction module 2010 uses a convolutional neural network. Extracting the feature map from the image including the text area, the suggestion region intercepting module 2030 performs horizontal truncation on the feature map by using a plurality of anchor rectangles to obtain a plurality of suggestion regions, and the classification module 2040 passes each suggested region through the convolutional nerve. The network performs classification to determine whether each suggested area corresponds to an area including text, and the regression module 2050 regresses each suggested area through the convolutional neural network to determine that each suggested area corresponds to a position in the image. The detection result splicing module 2070 splices the recommended areas corresponding to the areas including the characters determined by the classification module 2040 according to the positions in the image according to the recommended areas determined by the regression module 2050, Get the text area detection result.

In an alternative example, in conjunction with the above, when detecting text in an image, the image including the text is first input into the image feature extraction module 2010, and the image feature module 2010 uses a convolutional neural network from the included text region. The image is extracted from the feature map. The feature map obtained by convolution contains the feature information of the image. Then, the feature map extracted by the image feature extraction module 2010 is input into the suggestion region intercepting module 2030. In the suggestion region intercepting module 2030, the feature maps are respectively laterally intercepted by using a plurality of anchor rectangles to obtain a plurality of suggested regions. The obtained suggestion regions are respectively input into the classification module 2040 and the regression module 2050 to perform classification and regression, and it is determined by classification whether each of the suggested regions corresponds to an area including characters, and each suggested region is determined by regression to correspond to a position in the image. The detection result splicing module 2070 respectively determines, according to the position in the image, each suggestion area determined by the classification module 2040 corresponding to the area including the text according to the recommended area determined by the regression module 2050. The area is spliced horizontally to obtain the text area detection result. In an optional example, the detection result splicing module 2070 connects the adjacent regions and/or the intersecting suggestion regions according to the respective suggested regions determined by the regression corresponding to the positions in the image, thereby obtaining The text area detection result; in an optional example, the detection result splicing module 2070 suggests adjacent positions and/or intersections according to positions in the image corresponding to the respective suggested areas determined by regression respectively. The anchor rectangles corresponding to the regions are connected, thereby obtaining the text region detection result.

An exemplary application example will be described below in conjunction with the above-described character detecting method and text detecting device. FIG. 3 shows a schematic diagram of an exemplary application example in accordance with the present application.

As shown in FIG. 3, the image 10 containing the text area is the object to be detected. In the existing RPN, the anchor rectangle employed is, for example, a single anchor rectangle 110 corresponding to the entire lateral width of the text area to be detected. The detection of the text area can only be achieved if the lateral width of the anchor rectangle employed corresponds to the entire lateral width of the text area to be detected. In this way, in the case of a large text width, the RPN often requires a large receptive field to be processed, thereby bringing great difficulty to the design of the network. Therefore, existing regional recommendations are often not suitable for direct application to text detection.

As shown in FIG. 3, according to an exemplary embodiment of the present application, a plurality of laterally stitched anchor rectangles 120 are used instead of a single anchor rectangle 110, and the sum of the widths of the plurality of laterally stitched anchor rectangles 120 corresponds to the entire text area to be detected. Horizontal width. For example, the sum of the widths of the plurality of laterally stitched anchor rectangles 120 may be equal to the entire lateral width of the text area to be detected, or slightly larger than the entire lateral width of the text area to be detected. In the case where the sum of the widths of the plurality of laterally stitched anchor rectangles 120 is equal to the entire lateral width of the text region to be detected, the plurality of anchor rectangles 120 abut each other so as to correspond to the entire lateral width of the text region to be detected. In a case where the sum of the widths of the plurality of laterally stitched anchor rectangles 120 is greater than the entire lateral width of the text region to be detected, at least a portion of the plurality of anchor rectangles 120 have partial overlaps between the adjacent anchor rectangles 120, and a plurality of anchor rectangles The width of the area formed by the 120 connection corresponds to the entire lateral width of the text area to be detected. In the above-described character detecting method, feature image extraction is first performed on the image to be detected 10 by the image feature extraction module 2010 in the convolutional neural network. FIG. 3 exemplarily shows a portion 20 of the resulting feature map. In the suggestion region intercepting module 2030, the feature map is intercepted by using a plurality of laterally stitched anchor rectangles to obtain a plurality of suggestion regions, so that the suggested regions intercepted by each anchor rectangle are separately processed. The suggested area intercepted by each anchor rectangle is, for example, in the form of a sliding window as shown in FIG. Alternatively, the suggested area intercepted by the anchor rectangle may be further processed by one or more convolution layers 40. The recommended area (or the recommended area not processed by the convolution layer) processed by the convolutional layer 40 is input to the classifier 50 and the regression unit 60. It is identified at the classifier 50 whether each of the suggested areas is a text area. The position of each suggestion area in the image to be detected 10 is determined at the regression unit 60. Finally, the detection result splicing module 2070 splices the suggested region corresponding to the text region determined by the classifier 50 according to the position determined at the regenerator 60 to form the detected character detection result. As mentioned above, the alternative way of splicing For example, the detection result splicing module 2070 connects the adjacent regions and/or the intersecting suggestion regions, thereby obtaining the text region detection result; and, for example, the detection result splicing module 2070 is adjacent to the location and/or has The anchor rectangles corresponding to the suggested regions of the intersection are connected, thereby obtaining the text region detection result.

According to an exemplary embodiment, in the above-described character detecting method 1000, a step of training the convolutional neural network in advance is further included. A trained text detecting device, such as the above-described character detecting device 2000, is obtained by the training described below.

FIG. 4 illustrates a training method 4000 for a convolutional neural network in accordance with an exemplary embodiment. In an optional example, as shown in FIG. 4, the training method 4000 for the convolutional neural network may include: extracting the feature map from the training image including the text region using the convolutional neural network in step S4010; And a plurality of anchor rectangles are laterally intercepted to obtain a plurality of suggestion regions; and in step S4050, the suggested regions intercepted by each anchor rectangle are classified and regressed by the convolutional neural network, wherein The classification determines whether each of the suggestion regions corresponds to an area including a text, the regression determines a position of each of the suggested regions; and in step S4070, based on the known real text region corresponding to the training image and the classification and regression The resulting difference in predicted text regions is iteratively trained on the convolutional neural network until the training result satisfies a predetermined convergence condition. The predetermined convergence condition may be, for example, iterative training that the most recent error value falls within the allowable range, or the error value is less than the predetermined value, or the error value is the smallest, or the number of iterations reaches a predetermined number of times, and the like.

According to an embodiment of the present application, in each iterative training of the convolutional neural network, the real text is determined according to a cross ratio of the predicted text area and the corresponding real text area in a vertical direction. The difference between the area and the predicted text area. For example, in each iterative training of the convolutional neural network, the difference between the real text region and the predicted text region is determined according to a smooth L1 loss function. One form of difference can be an error.

According to the embodiment of the present application, when the intersection ratio of the predicted text area and the corresponding real text area in the vertical direction is greater than a preset threshold, the suggested area corresponding to the predicted text area is determined to be a positive sample; otherwise, The suggested area corresponding to the predicted text area is determined to be a negative sample.

In an optional example, the classifier may use the softmax loss function as a training objective function to predict whether the suggested region is a text region. According to an exemplary embodiment, during the training process, when calculating the error value of the convolutional neural network, the classifier determines each recommended area according to the intersection ratio of the suggested area and the horizontal part of the corresponding real text area in the vertical direction. Is it a positive or negative sample? The regressionr can use the smooth L1 loss function in the RPN network as a training objective function to minimize the difference between the real text region and the predicted text region. After iteratively training the convolutional neural network until the training result satisfies a predetermined convergence condition, the parameters of the convolutional neural network are adapted to be adapted to identify the text regions in the image using a plurality of horizontally stitched anchor rectangles.

In an optional example, when using the smooth L1 loss function in the RPN network as the training objective function, The difference between the real text area and the predicted text area is determined by the following formula:

Where L is the target error function, i is the number of the suggested region intercepted by the anchor rectangle, c _i is the category marker of the ith suggested region, r _i is the position vector of the ith suggested region, and the variable indicated by * is represented by The target real value of the corresponding variable, L _cls is the classification loss function, L _reg is the loss function of the regression position, N _cls and N _reg represent the number of selected classification and regression training samples, respectively, and λ is the preset empirical value, j It is any of x, y, w, and h, where x and y are the abscissa and the ordinate of the center point of the corresponding suggestion area, respectively, and w and h are the width and height of the corresponding suggestion area, respectively.

When the intersection ratio of the i-th suggestion region and the horizontal portion of the corresponding real text region in the vertical direction is greater than a preset threshold, c _i is equal to 1, indicating that the i-th suggested region is a positive sample; and, when When the intersection ratio of the i-th suggestion area and the horizontal part of the corresponding real text area in the vertical direction is less than or equal to a preset threshold, c _i is equal to 0, and the i-th suggestion area is a negative sample.

Due to the above training process, the classifier 50 determines whether each of the suggested regions corresponds to the region including the text (positive sample) or the region not including the text according to the intersection ratio of the suggested region and the real region intercepted by the anchor rectangle ( Negative sample), therefore, when an anchor rectangle coincides with the real area in the vertical direction but only a small part of the real area in the horizontal direction, the anchor rectangle will be considered to correspond to the text area and thus be selected as Positive sample. In the existing RPN, although the anchor rectangle is indeed a text area, it will not be selected as a positive sample.

The trained convolutional neural network, that is, the above-described character detecting device 2000, is obtained by adjusting the system parameters in an iterative training process to reduce the difference between the real text region and the predicted text region represented by the training objective function.

After this training, in the subsequent detection process, multiple horizontally stitched anchor rectangles may be used to perform feature extraction and subsequent classification and regression, and each anchor rectangle (or suggested region of the anchor rectangle interception) corresponds to the area to be detected. In the lateral part, since the vertical direction is considered in the training process of the convolutional neural network, in the detection process, the classifier in the convolutional neural network also considers the characteristics of the vertical direction of the proposed area to predict each Whether the suggested area corresponds to the text area. The text area detection result is obtained after the respective recommended areas corresponding to the area including the characters determined by the classification are horizontally spliced according to the positions in the image corresponding to the respective recommended areas determined by the regression. Based on such a technical solution, the problem that the actual real area corresponding to the text area cannot be correctly recognized when the anchor rectangle width is smaller than the real area width is avoided.

The training method for the convolutional neural network provided by the embodiment of the present application may be processed by any appropriate data. Capable device execution, including but not limited to: terminal devices and servers. Alternatively, the training method for the convolutional neural network provided by the embodiment of the present application may be performed by a processor, such as the processor performing the training method for the convolutional neural network mentioned in the embodiment of the present application by calling corresponding instructions stored in the memory. This will not be repeated below.

FIG. 5 shows an architectural diagram of a text detection training device 5000 in accordance with an exemplary embodiment. Each module of the text detection training device 5000 executes the various steps of the above-described character detection training method 4000. In an alternative example, the text detection training device 5000 is implemented in the form of an RPN. As shown in FIG. 5, the text detection training apparatus 5000 includes an image feature extraction module 5010, a suggestion region interception module 5030, a classification module 5040, a regression module 5050, and a training module 5060, wherein the image feature extraction module 5010 includes from a convolutional neural network. The training image of the text area extracts the feature map, and the suggested region intercepting module 5030 performs horizontal interception on the feature images of the training image by using multiple anchor rectangles to obtain a plurality of suggestion regions, and the classification module 5040 passes each suggested region through the convolutional nerve. The network performs classification to determine whether each suggested area corresponds to an area including text, and the regression module 5050 regresses each suggested area through the convolutional neural network to determine a position in each of the suggested areas corresponding to the training image, training The module 5060 iteratively trains the convolutional neural network according to the known difference between the real text region corresponding to the training image and the predicted text region obtained by the classification and regression until the training result satisfies the predetermined convergence condition.

In an alternative example, in conjunction with the above, when detecting text in the training image, the training image including the text is first input to the image feature extraction module 5010, and the image feature module 5010 is included from the convolutional neural network. The training image of the text area extracts the feature map. The feature map obtained by convolution contains the feature information of the training image. Then, the feature map extracted by the image feature extraction module 5010 is input into the suggestion region intercepting module 5030. In the suggestion region intercepting module 5030, the feature maps are respectively laterally intercepted by using a plurality of anchor rectangles to obtain a plurality of suggestion regions. The obtained suggestion areas are respectively input into the classification module 5040 and the regression module 5050 to perform classification and regression, and it is determined by classification whether each of the recommended areas corresponds to an area including characters, and each recommended area corresponds to a position in the training image by regression. The training module 5060 iteratively trains the convolutional neural network according to the known difference between the real text region corresponding to the training image and the predicted text region obtained by the classification and regression until the training result satisfies a predetermined convergence condition. The predetermined convergence condition may be, for example, iterative training that the most recent error value falls within the allowable range, or the error value is less than the predetermined value, or the error value is the smallest, or the number of iterations reaches a predetermined number of times, and the like.

According to an embodiment of the present application, in each iterative training of the convolutional neural network, the training module 5060 determines the ratio of the predicted text area and the corresponding real text area in the vertical direction. Tell the truth The difference between the real text area and the predicted text area. In each iterative training of the convolutional neural network, regression module 5050 determines a difference between the real text region and the predicted text region based on a smooth L1 loss function. One form of difference can be an error. When the intersection ratio of the predicted text area and the corresponding real text area in the vertical direction is greater than a preset threshold, the recommended area corresponding to the predicted text area is determined as a positive sample by the training module 5060; otherwise, the predicted text area The corresponding suggested area is determined by the training module 5060 as a negative sample.

Moreover, the various features of the character detection training method 4000 described above in connection with FIG. 4 are applicable to the text detection training device 5000 shown in FIG. In various embodiments, any number of various combinations of the various features of the text detection training method 4000 described above in connection with FIG. 4 can be incorporated into the text detection training device 5000 shown in FIG.

According to an exemplary embodiment, in the training and character detection described above, the width of the anchor rectangle employed may be fixed, thereby reducing the size and number of anchor rectangles required for matching, thereby reducing the amount of calculation.

According to an exemplary embodiment, in the training and character detection described above, the width of the anchor rectangle used may be equal to the step size of the convolutional neural network, whereby the detection result is laterally spliced to form a detection result corresponding to the detection area. The entire width. Optionally, the width of the anchor rectangle used may be slightly larger than the step size of the convolutional neural network. For example, the width of the anchor rectangle may be +1 of the convolutional neural network, thereby forming a detection by laterally splicing the detection results. The result corresponds to the entire width of the detection area and has a small amount of overlap to avoid a gap between adjacent anchor rectangles due to factors such as errors in actual use, thereby missing some intermediate width of the detection area.

The character detecting method and apparatus and the character detecting training method and apparatus described with reference to Figs. 1 to 5 can be implemented by a computer system. The computer system can include a memory that stores executable instructions and a processor. The processor is in communication with the memory to execute executable instructions to implement the text detection method and apparatus and text detection training method and apparatus described with reference to Figures 1 through 5. Alternatively or additionally, the text detection method and apparatus and text detection training method and apparatus described with reference to Figures 1 through 5 may be implemented by a non-transitory computer storage medium. The medium stores computer readable instructions that, when executed, cause the processor to perform the text detection method and apparatus and text detection training method and apparatus described with reference to Figures 1 through 5.

Referring now to Figure 6, there is shown a block diagram of a computer system 6000 suitable for implementing embodiments of the present application.

As shown in FIG. 6, computer system 6000 can include a processing unit (such as a central processing unit (CPU) 6001, an image processing unit (GPU), etc.) that can be stored according to a program stored in read only memory (ROM) 6002 or from a storage device. Portion 6008 loads into a program in random access memory (RAM) 6003 to perform various appropriate actions and processes. In the RAM 6003, various programs and data required for the operation of the system 6000 can also be stored. The CPU 6001, the ROM 6002, and the RAM 6003 are connected to each other through a bus 6004. The input/output I/O interface 6005 is also connected to the bus 6004.

The following are components that can be connected to the I/O interface 6005: an input portion 6006 including a keyboard, a mouse, etc.; An output portion 6007 of a cathode ray tube CRT, a liquid crystal display device LCD and a speaker, etc.; a storage portion 6008 including a hard disk or the like; and a communication portion 6009 including a network interface card such as a LAN card and a modem. The communication section 6009 can perform communication processing through a network such as the Internet. The driver 6010 can also be connected to the I/O interface 6005 as needed. A removable medium 6011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like can be mounted on the drive 6010 so that a computer program read therefrom can be installed into the storage portion 6008 as needed.

In an alternative example, the text detection method and apparatus and text detection training method and apparatus described above with reference to FIGS. 1 through 5 may be implemented as a computer software program in accordance with an embodiment of the present disclosure. For example, embodiments of the present disclosure can include a computer program product comprising a computer program tangibly embodied in a machine readable medium. The computer program includes a text detection method and apparatus and a character detection training method and apparatus for performing the description with reference to FIGS. 1 through 5. In such an embodiment, the computer program can be downloaded and installed from the network via the communication portion 6009, and/or can be installed from the removable medium 6011.

The text detection technology of the present application can be used in the company card identification product, for example, when the employee wears the company card certificate through the camera of the company access control system, the data processing device of the company access control system (such as a computer or server connected to the camera through the network) The image of the employee wearing the company card can be obtained by the camera. The data processing device of the company access control system can obtain the text area on the company card in the image by using the text detection technology of the application, by performing the text area on the text area. Text recognition can obtain information such as the employee's name and department marked on the company card.

The text detection technology of the present application can also be used in various applications involving text box positioning, for example, text box positioning for formatted texts such as medical bills, express orders, and invoices, so as to facilitate text recognition of the positioned text box. The result of the text box positioning or the result of the text recognition may be stored or displayed locally, or may be transmitted to a server or a peer in a peer-to-peer network. This application does not limit the specific application scenario of the text box after positioning.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality and operation of possible implementations of systems, methods and computer program products in accordance with various embodiments of the present application. In this regard, each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more logic for implementing the specified. Functional executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts, can be implemented in a dedicated hardware-based system that performs the specified function or operation. Or it can be implemented by a combination of dedicated hardware and computer instructions.

The units or modules involved in the embodiments of the present application may be implemented by software or hardware. The described unit or module can also be provided in the processor. The names of these units or modules should not be construed as limiting these units or modules.

The above description is only illustrative of the exemplary embodiments of the application and the description of the technical principles applied. It should be understood by those skilled in the art that the scope of the present application is not limited to the specific combination of the above technical features, and should also be covered by the above technical features or without departing from the concept of the application. Other technical solutions formed by arbitrarily combining the equivalent features. For example, the above features are combined with the technical features of the similar functions disclosed in the present application to replace each other.

Claims

A text detection method comprising:

Extracting a feature map from an image including a text region using a convolutional neural network;

The feature maps are separately intercepted by using a plurality of anchor rectangles to obtain a plurality of suggested regions;

Each of the suggested regions is classified and regressed by the convolutional neural network, wherein, by the classification, it is determined whether each of the suggested regions corresponds to an area including characters, and each of the suggested regions is determined by the regression The position in the image;

Each of the suggestion regions corresponding to the region including the character determined by the classification is laterally spliced according to the respective suggested regions determined by the regression corresponding to the positions in the image to obtain the text region detection result.
The character detecting method according to claim 1, wherein the horizontal splicing of the area comprises:

And determining, according to the positions in the image, the recommended areas determined by the regression, respectively, the recommended areas where the adjacent positions and/or the positions are intersected, thereby obtaining the text area detection result; or

The character region detection result is obtained by connecting the anchor rectangles corresponding to the suggested regions in which the adjacent positions and/or the positions are intersected according to the positions in the image determined by the regression, respectively.
The character detecting method according to claim 1 or 2, further comprising training the convolutional neural network in advance, wherein training on the convolutional neural network comprises:

Extracting a feature map from the training image including the text region using the convolutional neural network;

The feature map of the training image is laterally intercepted by using multiple anchor rectangles to obtain a plurality of suggested regions;

A suggested region intercepted by each anchor rectangle is classified and regressed by the convolutional neural network, wherein the classification determines whether each suggested region corresponds to an area including a character, and the regression determines that each suggested region corresponds to the training The position in the image;

The convolutional neural network is iteratively trained according to the known difference between the real text region corresponding to the training image and the predicted text region obtained by the classification and regression until the training result satisfies a predetermined convergence condition.
The character detecting method according to claim 3, wherein in each iterative training of the convolutional neural network, a ratio of the predicted text area to the corresponding real text area in a vertical direction is compared Determining a difference between the real text area and the predicted text area.
The character detecting method according to claim 3 or 4, wherein in each iterative training of the convolutional neural network, the difference between the real text region and the predicted text region is determined according to a smooth L1 loss function .
The character detecting method according to any one of claims 3 to 5, wherein when the ratio of the predicted text area to the corresponding real text area in the vertical direction is greater than a predetermined threshold, the predicted text area corresponds to Built The negotiation area is determined to be a positive sample; otherwise, the suggested area corresponding to the predicted text area is determined to be a negative sample.
The character detecting method according to any one of claims 1 to 6, wherein the width of the anchor rectangle is fixed.
The character detecting method according to any one of claims 1 to 7, wherein the width of the anchor rectangle is determined according to a step size of the convolutional neural network.
The character detecting method according to claim 8, wherein a width of the anchor rectangle is equal to or larger than a step size of the convolutional neural network.
A text detection training method includes:

Extracting a feature map from a training image including a text region using a convolutional neural network;

The feature map of the training image is laterally intercepted by using multiple anchor rectangles to obtain a plurality of suggested regions;

A suggested region intercepted by each anchor rectangle is classified and regressed by the convolutional neural network, wherein the classification determines whether each suggested region corresponds to an area including a character, and the regression determines that each suggested region corresponds to the training The position in the image;

The convolutional neural network is iteratively trained according to the known difference between the real text region corresponding to the training image and the predicted text region obtained by the classification and regression until the training result satisfies a predetermined convergence condition.
The character detection training method according to claim 10, wherein in each iterative training of the convolutional neural network, according to the intersection of the predicted text region and the corresponding real text region in the vertical direction And determining a difference between the real text area and the predicted text area.
The character detection training method according to claim 10 or 11, wherein in each iterative training of the convolutional neural network, determining between the real text region and the predicted text region is performed according to a smooth L1 loss function difference.
The character detection training method according to any one of claims 10 to 12, wherein when the ratio of the predicted text area to the corresponding real text area in the vertical direction is greater than a preset threshold, the predicted text area corresponds to The suggested area is determined to be a positive sample; otherwise, the suggested area corresponding to the predicted text area is determined to be a negative sample.
A character detection training method according to any one of claims 10-13, wherein the width of the anchor rectangle is fixed.
The character detection training method according to any one of claims 10-14, wherein the width of the anchor rectangle is determined according to a step size of the convolutional neural network.
The character detection training method according to claim 15, wherein a width of the anchor rectangle is equal to or larger than a step size of the convolutional neural network.
A text detecting device comprising:

An image feature extraction module for extracting a feature map from an image including a text region using a convolutional neural network;

The area intercepting module is recommended, and the feature maps are separately intercepted by using multiple anchor rectangles to obtain a plurality of suggestion areas;

a classification module that classifies each suggested region by the convolutional neural network to determine whether each suggested region corresponds to an area including text;

a regression module that regresses each suggested region through the convolutional neural network to determine that each suggested region corresponds to a location in the image;

a result splicing module, wherein each suggestion area corresponding to the area including the text determined by the classification module is horizontally spliced according to the position in the image according to the position determined by the regression module, to obtain a text Regional test results.
The character detecting device according to claim 17, wherein the horizontal splicing of the region comprises:

And determining, according to the positions in the image, the recommended areas determined by the regression, respectively, the recommended areas where the adjacent positions and/or the positions are intersected, thereby obtaining the text area detection result; or

The character region detection result is obtained by connecting the anchor rectangles corresponding to the suggested regions in which the adjacent positions and/or the positions are intersected according to the positions in the image determined by the regression, respectively.
A character detecting apparatus according to claim 17 or 18, further comprising a training module for training said convolutional neural network in advance, wherein in a pre-training process for said convolutional neural network:

The image feature extraction module extracts a feature map from a training image including a text region;

The suggested area intercepting module uses a plurality of anchor rectangles to perform horizontal interception on the feature image of the training image to obtain a plurality of suggested areas;

The classification module classifies each of the suggested regions by the convolutional neural network to determine whether each of the suggested regions corresponds to an area including characters, and the regression module performs each of the suggested regions through the convolutional neural network. Regressing to determine that each suggested area corresponds to a location in the training image;

The training module iteratively trains the convolutional neural network according to the known difference between the real text region corresponding to the training image and the predicted text region obtained by the classification and regression until the training result satisfies a predetermined convergence condition.
The character detecting device according to claim 19, wherein, in each iterative training of the convolutional neural network, the training module intersects the predicted text region and the corresponding real text region in a vertical direction And determining a difference between the real text area and the predicted text area.
The character detecting apparatus according to claim 19 or 20, wherein, in each iterative training of said convolutional neural network, a regression module determines between said real text area and said predicted text area based on a smooth L1 loss function The difference.
The character detecting device according to any one of claims 19 to 21, wherein when the predicted text area and the corresponding When the intersection ratio of the real text area in the vertical direction is greater than a preset threshold, the recommended area corresponding to the predicted text area is determined as a positive sample by the training module; otherwise, the recommended area corresponding to the predicted text area is trained by the training module. Determined as a negative sample.
A character detecting device according to any one of claims 17 to 22, wherein the width of the anchor rectangle is fixed.
A character detecting apparatus according to any one of claims 17 to 23, wherein a width of said anchor rectangle is determined according to a step size of said convolutional neural network.
The character detecting device according to claim 24, wherein a width of said anchor rectangle is equal to or larger than a step size of said convolutional neural network.
A text detection training device includes:

An image feature extraction module for extracting a feature map from a training image including a text region using a convolutional neural network;

The area intercepting module is proposed to perform horizontal interception on the feature image of the training image by using multiple anchor rectangles to obtain a plurality of suggested regions;

a classification module that classifies each suggested region by the convolutional neural network to determine whether each suggested region corresponds to an area including text;

a regression module that regresses each suggested region through the convolutional neural network to determine that each suggested region corresponds to a location in the training image;

The training module iteratively trains the convolutional neural network according to the known difference between the real text region corresponding to the training image and the predicted text region obtained by the classification and regression until the training result satisfies a predetermined convergence condition.
The character detection training apparatus according to claim 26, wherein in each iterative training of said convolutional neural network, said training module is in a vertical direction according to said predicted text area and said corresponding real text area The difference between the real text area and the predicted text area is determined.
The character detection training apparatus according to claim 26 or 27, wherein, in each iterative training of said convolutional neural network, said regression module determines said real text area and said predicted text area based on a smooth L1 loss function The difference between the two.
The character detection training apparatus according to any one of claims 26-28, wherein the predicted text area corresponds to when the intersection ratio of the predicted text area and the corresponding real text area in the vertical direction is greater than a preset threshold The suggested area is determined by the training module as a positive sample; otherwise, the suggested area corresponding to the predicted text area is determined by the training module as a negative sample.
A character detecting training device according to any one of claims 26-29, wherein the width of the anchor rectangle is fixed.
A character detecting training apparatus according to any one of claims 26 to 30, wherein a width of said anchor rectangle Determined according to the step size of the convolutional neural network.
The character detecting training apparatus according to claim 31, wherein a width of said anchor rectangle is equal to or larger than a step size of said convolutional neural network.
A text detecting device comprising:

a memory that stores executable instructions;

One or more processors in communication with the memory to execute the executable instructions to perform the operations in the text detection method of any of claims 1-9.
A text detection training device includes:

a memory that stores executable instructions;

One or more processors in communication with the memory to execute the executable instructions to perform the operations in the text detection training method of any of claims 10-16.
A computer program comprising computer readable code, the processor in the device executing a text for implementing any of claims 1-9 when the computer readable code is run in a device A method of detecting or an executable instruction for implementing the steps in the text detection training method of any of claims 10-16.
A computer readable medium storing computer readable code, the processor in the device executing to perform any of claims 1-9 when the computer readable code is run in a device The text detection method or the executable instruction for implementing the steps in the character detection training method according to any one of claims 10-16.