WO2022105120A1

WO2022105120A1 - Text detection method and apparatus from image, computer device and storage medium

Info

Publication number: WO2022105120A1
Application number: PCT/CN2021/090512
Authority: WO
Inventors: 左彬靖
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-17
Filing date: 2021-04-28
Publication date: 2022-05-27
Also published as: CN112395450A; CN112395450B

Abstract

A text detection method and apparatus from an image, a computer device and a storage medium, relating to the field of artificial intelligence. The method comprises: when receiving a target detection image, calculating the complexity of the target detection image according to a preconfigured detection model; when the complexity is a low complexity, calculating target text coordinates of first text boxes in the target detection image according to a first marking model in preconfigured marking models; calculating center coordinates of the target detection image according to the target text coordinates, fusing the first text boxes of which the center coordinates are less than or equal to a preset error value into a new text box, and determining the first text boxes of which the center coordinates are greater than the preset error value as fixed text boxes; and extracting text information from the new text box and the fixed text boxes, and determining the text information as detected texts. The method realizes efficient text detection from images.

Description

Image and text detection method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed on November 17, 2020 with the application number 202011286320.X and the title of the invention is "image text detection method, device, computer equipment and storage medium", the entire content of which is Incorporated herein by reference.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for detecting text in pictures.

Background technique

With the rapid development of target detection technology, text detection methods in target detection technology are used in more and more fields, such as Alipay scanning, ID card recognition and so on. By recognizing the text in the picture, the information in the picture can be extracted.

At present, the algorithm based on FPN (feature pyramid networks, feature pyramid network) has poor detection effect for small and dense text, while the algorithm based on pixel level has relatively high accuracy, but the processing time of the model is long, which is difficult to meet the requirements of industrialization. need. In addition, the inventor realized that the image text detection is mainly to extract useful information in the image, such as name, address, account information and other fields, so as to facilitate the subsequent storage of these parameters and provide data for the subsequent risk control system. However, a picture may contain a lot of information, and a more complex picture may contain more than one hundred fields. When text detection is performed on a picture by the existing technology, there is often a problem of low text detection efficiency.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to provide a method, device, computer equipment and storage medium for detecting pictures and characters, so as to solve the technical problem of low efficiency of detecting pictures and characters.

In order to solve the above technical problems, the embodiment of the present application provides a method for detecting text in pictures, which adopts the following technical solutions:

When receiving the target detection picture, calculate the complexity of the target detection picture according to the preset detection model;

When the complexity is low complexity, the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the feature vector of the first text box in the target detection picture is calculated according to the feature vector. target text coordinates;

According to the target text coordinates, the center coordinates of the target detection picture are calculated, the first text box whose center coordinates are less than or equal to the preset error value is merged into a new text box, and the center coordinates are greater than the preset error value. The first text box of the value is determined to be a fixed text box;

Extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.

In order to solve the above technical problems, the embodiments of the present application also provide a picture and text detection device, which adopts the following technical solutions:

a detection module, configured to calculate the complexity of the target detection picture according to a preset detection model when the target detection picture is received;

The labeling module is configured to obtain the feature vector of the target detection picture according to the first labeling model in the preset labeling model when the complexity is low complexity, and obtain the target detection picture according to the feature vector calculation. The target text coordinates of the first text box;

A confirmation module, configured to calculate the center coordinates of the target detection image according to the target text coordinates, fuse the first text box with the center coordinates less than or equal to a preset error value into a new text box, and set the center coordinates greater than The first text box of the preset error value is determined to be a fixed text box;

The extraction module is used to extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, including a memory and a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor executes The computer-readable instructions also implement the following steps:

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the processing The device also performs the following steps:

In the above-mentioned image text detection method, when the target detection image is received, the complexity of the target detection image is calculated according to the preset detection model, and the model of the target detection image can be selected according to the complexity, so as to further target the target detection image. Then, when the complexity is low, the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the target detection picture is calculated according to the feature vector. The target text coordinates of the first text box in the text box can be used to accurately locate the text information of the target detection picture; then, according to the target text coordinates, the center coordinates of the target detection picture are calculated, and the center coordinates are less than or equal to the preset error value. The first text box is merged into a new text box, and the first text box whose center coordinates are greater than the preset error value is determined as a fixed text box, thereby avoiding the wrong splitting of pictures and texts in low complexity, and improving the picture and text. The accuracy of detection; finally, extract the text information in the new text box and the fixed text box, and determine the text information as the detection text of the target detection image, which realizes the text detection of images of different complexity, reduces the cost of manual annotation, and saves The response time processed by the model further improves the efficiency and accuracy of image and text detection.

Description of drawings

In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

Fig. 2 is a flow chart of an embodiment of a picture text detection method according to the present application;

3 is a schematic structural diagram of an embodiment of a picture and text detection device according to the present application;

FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Reference numerals: picture and text detection device 300, detection module 301, labeling module 302, confirmation module 303 and extraction module 304.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.

As shown in FIG. 1 , the system architecture 100 may include

terminal devices

101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the

terminal devices

101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the

terminal devices

101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.

The

terminal devices

101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.

The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the

terminal devices

101 , 102 , and 103 .

It should be noted that the image and text detection methods provided in the embodiments of the present application are generally executed by a server/terminal device, and correspondingly, the image and text detection apparatus is generally set in the server/terminal device.

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

Continuing to refer to FIG. 2 , a flowchart of an embodiment of a method for detecting text in pictures according to the present application is shown. The image and text detection method includes the following steps:

Step S201, when receiving the target detection picture, calculate the complexity of the target detection picture according to a preset detection model;

In this embodiment, the target detection picture is a detection picture including target text, and the complexity of the target detection picture is calculated according to a preset detection model; wherein, the preset detection model is a preset picture complexity detection model, such as A lightweight convolutional neural network discriminant model based on VGG16. Specifically, input the target detection picture into the preset detection model, calculate the length, width and channel number of the target detection picture based on the convolution layer, pooling layer and fully connected layer of the preset detection model, and output The detection result value of the target detection picture is obtained; then the detection result value is calculated according to the two-class loss function, that is, the complexity of the current target detection picture is obtained.

Step S202, when the complexity is low complexity, obtain the feature vector of the target detection picture according to the first labeling model in the preset labeling model, and obtain the first target detection picture according to the feature vector calculation. The target text coordinates of the text box;

In this embodiment, the complexity can be divided into low complexity and high complexity according to the preset value, the complexity less than or equal to the preset value is the low complexity, and the complexity greater than the preset value is the high complexity the complexity. When the complexity of the target detection picture is low complexity, the target text coordinates of the target detection picture are acquired according to the first labeling model in the preset labeling model. The preset annotation model is a preset text coordinate detection model, including a first annotation model and a second annotation model. Detecting a low-complexity target detection picture according to the first annotation model can obtain the target text coordinates of the target detection picture; according to the second labeling model, a high-complexity target detection picture can be detected, and the detection of the target detection picture can be obtained. Text coordinates. According to the target text coordinates and the detected text coordinates, the detected texts of the low-complexity and high-complexity target detection images can be obtained respectively. Specifically, the coordinates of the target text and the coordinates of the detected text are composed of the coordinates of the lower left corner, the lower right corner, the upper left corner and the upper right corner of each text box in the target detection picture. When the complexity of the target detection picture is low complexity, a feature map of the target detection picture and a preset detection feature frame are acquired. Calculate the feature picture and the detection feature frame based on the first labeling model to obtain the feature vector of the target detection picture; the second feature vector is passed through the bidirectional long short-term memory network, the fully connected layer and the regression layer in the first labeling model. , and output the target text coordinates of the current target detection image.

Step S203: Calculate the center coordinates of the target detection picture according to the target text coordinates, fuse the first text box whose center coordinates are less than or equal to a preset error value into a new text box, and set the center coordinates greater than the The first text box of the preset error value is determined as a fixed text box;

In this embodiment, the first text box is a text box obtained by detecting the target picture according to the first annotation model, and the center coordinates are the mean coordinates of the first text boxes in each target detection picture. Calculate the x mean value and the y mean value of the target text coordinates of each first text box in the target detection image, and use the x mean value and the y mean value as the center coordinates of the corresponding first text box. When the center coordinates corresponding to each first text box are obtained, the first text boxes whose center coordinates are less than or equal to the preset error value are merged into a new text box. The coordinates of the lower left corner of the new text box take the minimum x value and the minimum y value of the coordinates of the target text in the fused first text box, and the coordinates of the upper right corner of the new text box take the coordinates of the target text in the first fused text box. The maximum x value and the maximum y value, the coordinates of the lower right corner of the new text box take the maximum x value and the minimum y value of the target text coordinates in the first text box fused, and the coordinates of the upper left corner of the new text box take the first fused text The minimum x and maximum y values for the coordinates of the target text in the box. The first text box whose center coordinates are greater than the preset error value is determined as a fixed text box.

Step S204: Extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.

In this embodiment, when a new text box and a fixed text box are obtained, the text information in the new text box and the fixed text box is extracted, and the text information is arranged in the order of the text boxes, that is, the target detection picture is obtained. Detect text.

It should be emphasized that, in order to further ensure the privacy and security of the above detection text, the above detection text can also be stored in a node of a blockchain.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

This embodiment realizes text detection for pictures of different complexity, reduces the cost of manual labeling, saves the response time of model processing, and further improves the efficiency and accuracy of text detection in pictures.

In some embodiments of the present application, the preset error value includes a first error value and a second error value, and the fusion of the first text box whose center coordinates are less than or equal to the preset error value into a new text box includes:

Obtain the first pixel difference value of the y-axis coordinates of the two adjacent center coordinates, and the second pixel difference value of the x-axis coordinates of the center coordinates;

A first text box whose first pixel difference value is less than or equal to the first error value and whose second pixel difference value is less than or equal to the second error value is merged into a new text box.

In this embodiment, the preset error value includes a first error value and a second error value. When the center coordinates corresponding to each first text box are obtained, the difference between the y-axis coordinates of two adjacent center coordinates is sequentially obtained. The first pixel difference value, and the second pixel difference value between the x-axis coordinates of the two center coordinates. The first pixel difference value is the pixel difference value between the y-axis coordinates of the two center point coordinates, and the second pixel difference value is the pixel difference value between the x-axis coordinates of the two center point coordinates. A new text box is obtained by fusing the first text boxes with the first pixel difference between the center coordinates less than or equal to the first error value and the second pixel difference less than or equal to the second error value.

This embodiment realizes the combination of texts with small errors by fusing the text boxes, avoids the wrong splitting of texts in the process of text detection for low-complexity pictures, and further improves the accuracy of text detection in pictures .

In some embodiments of the present application, after calculating the complexity of the target detection picture according to the preset detection model, the method includes:

When the complexity is high complexity, obtain the minimum picture corresponding to the target detection picture and the minimum text coordinates of the second text box in the minimum picture according to the second labeling model in the preset labeling model;

The minimum text coordinates are mapped to the largest picture corresponding to the target detection picture in parallel to obtain the detected text coordinates of the target detection picture, and the detected text corresponding to the target detection picture is calculated according to the detected text coordinates.

In this embodiment, when the complexity of the target detection picture is high, the minimum text coordinates of the second text box in the minimum picture corresponding to the target detection picture are obtained according to the second labeling model in the preset labeling model. The second text box is a text box obtained by detecting the target detection picture according to the second labeling model, and the second labeling model is a pre-trained high-complexity labeling model. The minimum picture corresponding to the target detection picture is obtained according to the second labeling model, and the minimum picture is the minimum picture after scaling the target detection picture. The second labeling model can perform pixel scaling on the target detection picture to obtain the minimum picture. When the minimum picture is obtained, the second text box in the minimum picture is detected based on the second labeling model, thereby obtaining the minimum text coordinates corresponding to the second text box in the minimum picture. When the minimum text coordinates are obtained, the minimum text coordinates are mapped to the maximum picture corresponding to the target detection picture, that is, all the obtained minimum text coordinates are enlarged according to the preset mapping ratio between the minimum picture and the maximum picture at the same time. Detect text coordinates. Obtaining the text content in the target text coordinates is to obtain the detection text of the target detection image.

This embodiment uses the second labeling model to perform text detection on pictures with high complexity, thereby realizing targeted detection of text in pictures with high complexity, and further improving the detection efficiency and accuracy of pictures with high complexity.

In some embodiments of the present application, the above-mentioned parallel mapping of the minimum text coordinates to the maximum picture corresponding to the target detection picture, and obtaining the detected text coordinates of the target detection picture includes:

A preset mapping ratio is acquired, and the minimum text coordinates are enlarged in parallel according to the preset mapping ratio to obtain the detected text coordinates of the target detection image.

In this embodiment, when obtaining the target text coordinates corresponding to the target detection picture with high complexity, the target detection picture can be obtained by obtaining a preset mapping ratio, and mapping the minimum text coordinates to the largest picture in parallel according to the preset mapping ratio. The detected text coordinates. Specifically, the preset mapping ratio is a preset ratio when the second annotation model scales the target detection picture, and the ratio ranges from 0 to 1. For example, the preset mapping ratio is 0.4. When the preset mapping ratio is obtained, all the obtained minimum text coordinates are simultaneously enlarged according to the preset mapping ratio, that is, the detected text coordinates of the target detection image are obtained.

In this embodiment, by enlarging the minimum text coordinates according to the preset mapping ratio, the accurate acquisition of the detected text coordinates of the target detection picture is realized, so that the text information of the target detection picture can be accurately located by detecting the text coordinates, avoiding the need for Text confusion may occur when text detection is performed on pictures with high complexity.

In some embodiments of the present application, the above calculation of the complexity of the target detection picture according to the preset detection model includes:

Inputting the target detection picture to the convolution layer of the preset detection model, and outputting the detection result value through the pooling layer and the fully connected layer;

The detection result value is predicted according to a preset two-class loss function to obtain the complexity of the target detection picture.

In this embodiment, the preset detection model includes a convolution layer, a pooling layer, and a fully connected layer. When the target detection picture is obtained, the length, width and channel number of the target detection picture are obtained. Input the length, width and number of channels of the target detection image to the convolution layer in the preset detection model, and then go through the pooling layer and the fully connected layer to output the detection result value of the target detection image. When the detection result value is obtained, the first result value is calculated by the preset two-class loss function, that is, the complexity of the current target detection picture is obtained. Among them, the complexity can be represented by p, and the range of p is between 0 and 1. The larger p is, the smaller the text in the target detection picture, the smaller the interval between words, and the higher the complexity of the target detection picture. High; the smaller the p, the larger the text in the target detection picture, the larger the interval between words, and the lower the complexity of the target detection picture.

In this embodiment, when the target detection picture is obtained, the complexity of the target detection picture is calculated, so as to realize the classification and detection of the target detection picture according to the complexity, and further improve the detection efficiency of the target detection picture.

In some embodiments of the present application, before the above step of acquiring the feature vector of the target detection picture according to the first annotation model in the preset annotation model, the method further includes:

Obtaining an initial text picture, dividing the initial text picture into a training picture and a test picture, inputting the training picture into a preset basic labeling model, and obtaining the labeling text coordinates of the training picture;

Calculate the loss function of the basic labeling model according to the coordinates of the labeling text, and when the loss function converges, determine that the basic labeling model is the trained basic labeling model;

The trained basic labeling model is verified according to the test picture, and when the verification pass rate of the trained basic labeling model on the test picture is greater than or equal to a preset pass rate, the trained basic labeling model is determined. The annotation model is a preset annotation model.

In this embodiment, before annotating the target detection picture according to the preset annotation model, a basic annotation model needs to be established in advance, and the basic annotation model is trained to obtain the preset annotation model. The preset labeling model includes a first labeling model and a second labeling model. The first labeling model is used for processing low-complexity target detection pictures, and the second labeling model is used for processing high-complexity target detection pictures. The first labeling model and the second labeling model have different network structures, but both the first labeling model and the second labeling model can be trained by the same training method. Specifically, an initial text picture is obtained, where the initial text picture is a plurality of pre-collected text pictures, and the initial text picture is divided into a training picture and a test picture. The initial text coordinates of the training picture are detected based on a basic labeling model, where the basic labeling model may be the network structure of the first labeling model or the network structure of the second labeling model. According to the basic labeling model, the initial text coordinates of the training picture are obtained, and at the same time, the training picture is labelled according to the preset labeling tool, and the labelled text coordinates of the training picture are obtained. The basic labeling model is trained according to the initial text coordinates and the labeling text coordinates, that is, the loss function of the basic labeling model is calculated according to the labeling text coordinates and the initial text coordinates. When the loss function converges, the trained basic labeling model is obtained. When the trained basic labeling model is obtained, the trained basic labeling model is tested according to the test image. If the similarity between the initial text coordinates detected by the trained basic labeling model and the labeling text coordinates corresponding to the test picture is greater than or equal to a preset similarity threshold, then the trained basic labeling model is determined to be the same as the test picture. Verification passed. When the verification pass rate of the trained basic annotation model for the test image is greater than or equal to the preset pass rate, the trained basic annotation model is determined to be the preset annotation model.

In this embodiment, the basic labeling model is trained in advance, so that the preset labeling model obtained by training can accurately detect the text of the picture, save the labeling time of the picture and text detection, and improve the efficiency of the picture and text detection.

In some embodiments of the present application, the above-mentioned calculation of the loss function of the basic annotation model according to the coordinates of the annotation text includes:

Annotate the training picture based on a preset labeling tool to obtain initial text coordinates of the training picture;

Calculate the squared difference between the initial text coordinates and the labeled text coordinates, and calculate the loss function of the basic labeling model according to the squared difference.

In this embodiment, when the initial text coordinates of the training picture are obtained, the training picture is labeled according to a preset labeling tool, and the labeled text coordinates of the training picture are obtained. Calculate the squared difference between the initial text coordinates and the labeled text coordinates, and then calculate the loss function of the basic labeling model according to the squared difference. The calculation formula of the loss function of the basic annotation model is as follows:

where ο _k is the initial text coordinate,

coordinates for the label text.

In this embodiment, by calculating the loss function of the basic labeling model, the training time of the basic labeling model is saved, and the training efficiency of the basic labeling model is improved.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, the processes of the above-mentioned method embodiments may be included. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.

Further referring to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a picture and text detection device, and the device embodiment corresponds to the method embodiment shown in FIG. 2 . Can be used in various electronic devices.

As shown in FIG. 3 , the image and text detection apparatus 300 in this embodiment includes: a detection module 301 , a labeling module 302 , a confirmation module 303 , and an extraction module 304 . in:

A detection module 301, configured to calculate the complexity of the target detection picture according to a preset detection model when the target detection picture is received;

Wherein, the detection module 301 includes:

a first computing unit, used for inputting the target detection picture to the convolution layer of the preset detection model, and outputting the detection result value through the pooling layer and the fully connected layer;

The second calculation unit is configured to predict the detection result value according to the preset binary classification loss function, and obtain the complexity of the target detection picture.

The labeling module 302 is configured to obtain the feature vector of the target detection picture according to the first labeling model in the preset labeling model when the complexity is low complexity, and calculate the target detection picture according to the feature vector The target text coordinates of the first text box in ;

Confirmation module 303, configured to calculate the center coordinates of the target detection picture according to the target text coordinates, fuse the first text box whose center coordinates are less than or equal to a preset error value into a new text box, and combine the center coordinates A first text box larger than the preset error value is determined as a fixed text box;

Wherein, the preset error value includes a first error value and a second error value, and the confirmation module 303 includes:

an acquisition unit, configured to acquire the first pixel difference between the y-axis coordinates of the two adjacent center coordinates, and the second pixel difference between the x-axis coordinates of the center coordinates;

A confirmation unit, configured to merge a first text box whose first pixel difference value is less than or equal to the first error value and the second pixel difference value is less than or equal to the second error value into a new text box.

The extraction module 304 is configured to extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.

The picture and text detection device proposed in this embodiment also includes:

The obtaining module is configured to obtain the minimum picture corresponding to the target detection picture according to the second labeling model in the preset labeling model when the complexity is high, and the minimum picture of the second text box in the minimum picture. text coordinates;

The mapping module is used to map the minimum text coordinates to the maximum picture corresponding to the target detection picture in parallel, to obtain the detected text coordinates of the target detection picture, and to calculate and obtain the corresponding target detection picture according to the detected text coordinates. Detect text.

Wherein, the mapping module includes:

The mapping unit is configured to obtain a preset mapping ratio, and enlarge the minimum text coordinates in parallel according to the preset mapping ratio to obtain the detected text coordinates of the target detection image.

a division module, configured to obtain an initial text picture, divide the initial text picture into a training picture and a test picture, input the training picture into a preset basic labeling model, and obtain the labeling text coordinates of the training picture;

A training module, configured to calculate the loss function of the basic labeling model according to the coordinates of the labeling text, and when the loss function converges, determine that the basic labeling model is the trained basic labeling model;

A verification module, configured to verify the trained basic labeling model according to the test picture, and determine the The trained basic annotation model is the preset annotation model.

Wherein, the training module includes:

a labeling unit, configured to label the training picture based on a preset labeling tool to obtain initial text coordinates of the training picture;

The third calculation unit is configured to calculate the squared difference between the initial text coordinates and the labeled text coordinates, and calculate the loss function of the basic labeling model according to the squared difference.

In this embodiment, before the target detection picture is annotated according to the preset annotation model, a basic annotation model needs to be established in advance, and the basic annotation model is trained to obtain the preset annotation model. The preset labeling model includes a first labeling model and a second labeling model. The first labeling model is used for processing low-complexity target detection pictures, and the second labeling model is used for processing high-complexity target detection pictures. The first labeling model and the second labeling model have different network structures, but both the first labeling model and the second labeling model can be trained by the same training method. Specifically, an initial text picture is obtained, where the initial text picture is a plurality of pre-collected text pictures, and the initial text picture is divided into a training picture and a test picture. The initial text coordinates of the training image are detected based on a basic labeling model, where the basic labeling model may be the network structure of the first labeling model or the network structure of the second labeling model. According to the basic labeling model, the initial text coordinates of the training picture are obtained, and at the same time, the training picture is labelled according to the preset labeling tool, and the labelled text coordinates of the training picture are obtained. The basic labeling model is trained according to the initial text coordinates and the labeling text coordinates, that is, the loss function of the basic labeling model is calculated according to the labeling text coordinates and the initial text coordinates. When the loss function converges, the trained basic labeling model is obtained. When the trained basic labeling model is obtained, the trained basic labeling model is tested according to the test image. If the similarity between the initial text coordinates detected by the trained basic labeling model and the labeling text coordinates corresponding to the test picture is greater than or equal to the preset similarity threshold, it is determined that the trained basic labeling model is the same as the test picture. Verification passed. When the verification pass rate of the trained basic labeling model for the test image is greater than or equal to the preset pass rate, the trained basic labeling model is determined to be the preset labeling model.

The image text detection device proposed in this embodiment realizes text detection on images of different complexity, reduces manual labeling costs, saves the response time of model processing, and further improves the efficiency and accuracy of image text detection.

To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to FIG. 4 , which is a block diagram of a basic structure of a computer device according to this embodiment.

The computer device 6 includes a memory 61 , a processor 62 , and a network interface 63 that communicate with each other through a system bus. It should be pointed out that only the computer device 6 with components 61-63 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded devices, etc.

The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.

The memory 61 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. The computer-readable storage medium may be non-volatile or volatile. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 . In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device. In this embodiment, the memory 61 is generally used to store the operating system and various application software installed on the computer device 6 , such as computer-readable instructions of a picture and text detection method. In addition, the memory 61 can also be used to temporarily store various types of data that have been output or will be output.

In some embodiments, the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 62 is typically used to control the overall operation of the computer device 6 . In this embodiment, the processor 62 is configured to execute computer-readable instructions or process data stored in the memory 61, for example, computer-readable instructions for executing the image and text detection method.

The network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.

The computer device proposed in this embodiment realizes text detection for pictures of different complexity, reduces manual labeling costs, saves response time for model processing, and further improves the efficiency and accuracy of picture text detection.

The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to execute the steps of the above-mentioned picture text detection method.

The computer-readable storage medium proposed in this embodiment realizes text detection for pictures of different complexity, reduces manual labeling costs, saves the response time of model processing, and further improves the efficiency and accuracy of picture text detection.

From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims

A method for detecting text in pictures, comprising the following steps:

When receiving the target detection picture, calculate the complexity of the target detection picture according to the preset detection model;

When the complexity is low complexity, the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the feature vector of the first text box in the target detection picture is calculated according to the feature vector. target text coordinates;

According to the target text coordinates, the center coordinates of the target detection image are calculated, the first text box whose center coordinates are less than or equal to the preset error value is merged into a new text box, and the center coordinates are greater than the preset error value. The first text box of the value is determined to be a fixed text box;

Extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.
The image text detection method according to claim 1, wherein the preset error value includes a first error value and a second error value, and the first text box whose center coordinate is less than or equal to the preset error value is fused The steps for the new text box specifically include:

Obtain the first pixel difference value of the y-axis coordinates of the two adjacent center coordinates, and the second pixel difference value of the x-axis coordinates of the center coordinates;

A first text box whose first pixel difference value is less than or equal to the first error value and whose second pixel difference value is less than or equal to the second error value is merged into a new text box.
The image text detection method according to claim 1, wherein after the step of calculating the complexity of the target detection image according to a preset detection model, the method comprises:

When the complexity is high complexity, obtain the minimum picture corresponding to the target detection picture and the minimum text coordinates of the second text box in the minimum picture according to the second labeling model in the preset labeling model;

The minimum text coordinates are mapped to the largest picture corresponding to the target detection picture in parallel to obtain the detected text coordinates of the target detection picture, and the detected text corresponding to the target detection picture is calculated according to the detected text coordinates.
The method for detecting text in pictures according to claim 3, wherein the step of mapping the minimum text coordinates to the largest picture corresponding to the target detection picture in parallel, and obtaining the detected text coordinates of the target detection picture specifically comprises:

A preset mapping ratio is acquired, and the minimum text coordinates are enlarged in parallel according to the preset mapping ratio to obtain the detected text coordinates of the target detection image.
The image text detection method according to claim 1, wherein the step of calculating the complexity of the target detection image according to a preset detection model specifically comprises:

Inputting the target detection picture to the convolution layer of the preset detection model, and outputting the detection result value through the pooling layer and the fully connected layer;

The detection result value is predicted according to a preset two-class loss function to obtain the complexity of the target detection picture.
The picture text detection method according to claim 1, wherein before the step of acquiring the feature vector of the target detection picture according to the first labeling model in the preset labeling model, it further comprises:

Obtaining an initial text picture, dividing the initial text picture into a training picture and a test picture, inputting the training picture into a preset basic labeling model, and obtaining the labeling text coordinates of the training picture;

Calculate the loss function of the basic labeling model according to the coordinates of the labeling text, and when the loss function converges, determine that the basic labeling model is the trained basic labeling model;

The trained basic labeling model is verified according to the test picture, and when the verification pass rate of the trained basic labeling model on the test picture is greater than or equal to a preset pass rate, the trained basic labeling model is determined. The annotation model is a preset annotation model.
The picture text detection method according to claim 6, wherein, the step of calculating the loss function of the basic labeling model according to the labeling text coordinates specifically includes:

Annotate the training picture based on a preset labeling tool to obtain initial text coordinates of the training picture;

Calculate the squared difference between the initial text coordinates and the labeled text coordinates, and calculate the loss function of the basic labeling model according to the squared difference.
A picture and text detection device, comprising:

a detection module, configured to calculate the complexity of the target detection picture according to a preset detection model when the target detection picture is received;

The labeling module is configured to obtain the feature vector of the target detection picture according to the first labeling model in the preset labeling model when the complexity is low complexity, and obtain the target detection picture according to the feature vector calculation. The target text coordinates of the first text box;

A confirmation module, configured to calculate the center coordinates of the target detection image according to the target text coordinates, fuse the first text box with the center coordinates less than or equal to a preset error value into a new text box, and set the center coordinates greater than The first text box of the preset error value is determined to be a fixed text box;

The extraction module is configured to extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.
A computer device includes a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor also implements the following steps when executing the computer-readable instructions:

When receiving the target detection picture, calculate the complexity of the target detection picture according to the preset detection model;

When the complexity is low complexity, the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the feature vector of the first text box in the target detection picture is calculated according to the feature vector. target text coordinates;

According to the target text coordinates, the center coordinates of the target detection picture are calculated, the first text box whose center coordinates are less than or equal to the preset error value is merged into a new text box, and the center coordinates are greater than the preset error value. The first text box of the value is determined to be a fixed text box;

Extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.
The computer device according to claim 9, wherein the preset error value includes a first error value and a second error value, and the first text box whose center coordinate is less than or equal to the preset error value is fused into a new The steps of the text box include:

Obtain the first pixel difference value of the y-axis coordinates of the two adjacent center coordinates, and the second pixel difference value of the x-axis coordinates of the center coordinates;

A first text box whose first pixel difference value is less than or equal to the first error value and whose second pixel difference value is less than or equal to the second error value is merged into a new text box.
The computer device according to claim 9, wherein after the step of calculating the complexity of the target detection picture according to the preset detection model, it comprises:

When the complexity is high complexity, obtain the minimum picture corresponding to the target detection picture and the minimum text coordinates of the second text box in the minimum picture according to the second labeling model in the preset labeling model;

The minimum text coordinates are mapped to the largest picture corresponding to the target detection picture in parallel to obtain the detected text coordinates of the target detection picture, and the detected text corresponding to the target detection picture is calculated according to the detected text coordinates.
The computer device according to claim 11, wherein the step of mapping the minimum text coordinates to the maximum picture corresponding to the target detection picture in parallel, and obtaining the detected text coordinates of the target detection picture specifically comprises:

A preset mapping ratio is acquired, and the minimum text coordinates are enlarged in parallel according to the preset mapping ratio to obtain the detected text coordinates of the target detection image.
computer equipment according to claim 9, wherein, the described step of calculating the complexity of described target detection picture according to preset detection model specifically comprises:

Inputting the target detection picture to the convolution layer of the preset detection model, and outputting the detection result value through the pooling layer and the fully connected layer;

The detection result value is predicted according to a preset two-class loss function to obtain the complexity of the target detection picture.
The computer device according to claim 9, wherein before the step of acquiring the feature vector of the target detection picture according to the first labeling model in the preset labeling model, it further comprises:

Obtaining an initial text picture, dividing the initial text picture into a training picture and a test picture, inputting the training picture into a preset basic labeling model, and obtaining the labeling text coordinates of the training picture;

Calculate the loss function of the basic labeling model according to the coordinates of the labeling text, and when the loss function converges, determine that the basic labeling model is the trained basic labeling model;

The trained basic labeling model is verified according to the test picture, and when the verification pass rate of the trained basic labeling model on the test picture is greater than or equal to a preset pass rate, the trained basic labeling model is determined. The annotation model is a preset annotation model.
The computer device according to claim 14, wherein the step of calculating the loss function of the basic annotation model according to the coordinates of the annotation text specifically comprises:

Annotate the training picture based on a preset labeling tool to obtain initial text coordinates of the training picture;

Calculate the squared difference between the initial text coordinates and the labeled text coordinates, and calculate the loss function of the basic labeling model according to the squared difference.
A computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the processor further performs the following steps:

When receiving the target detection picture, calculate the complexity of the target detection picture according to the preset detection model;

When the complexity is low complexity, the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the feature vector of the first text box in the target detection picture is calculated according to the feature vector. target text coordinates;

According to the target text coordinates, the center coordinates of the target detection picture are calculated, the first text box whose center coordinates are less than or equal to the preset error value is merged into a new text box, and the center coordinates are greater than the preset error value. The first text box of the value is determined to be a fixed text box;

Extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.
The computer-readable storage medium of claim 16, wherein the preset error value includes a first error value and a second error value, and the first text box whose center coordinate is less than or equal to the preset error value The steps of merging into a new text box include:

Obtain the first pixel difference value of the y-axis coordinates of the two adjacent center coordinates, and the second pixel difference value of the x-axis coordinates of the center coordinates;

A first text box whose first pixel difference value is less than or equal to the first error value and whose second pixel difference value is less than or equal to the second error value is merged into a new text box.
The computer-readable storage medium according to claim 16, wherein after the step of calculating the complexity of the target detection picture according to the preset detection model, it comprises:

When the complexity is high complexity, obtain the minimum picture corresponding to the target detection picture and the minimum text coordinates of the second text box in the minimum picture according to the second labeling model in the preset labeling model;

The minimum text coordinates are mapped to the largest picture corresponding to the target detection picture in parallel to obtain the detected text coordinates of the target detection picture, and the detected text corresponding to the target detection picture is calculated according to the detected text coordinates.
The computer-readable storage medium according to claim 18, wherein the step of mapping the minimum text coordinates to the largest picture corresponding to the target detection picture in parallel, and obtaining the detected text coordinates of the target detection picture specifically comprises the following steps: :

A preset mapping ratio is acquired, and the minimum text coordinates are enlarged in parallel according to the preset mapping ratio to obtain the detected text coordinates of the target detection image.
The computer-readable storage medium according to claim 16, wherein the step of calculating the complexity of the target detection picture according to the preset detection model specifically comprises:

Inputting the target detection picture to the convolution layer of the preset detection model, and outputting the detection result value through the pooling layer and the fully connected layer;

The detection result value is predicted according to a preset two-class loss function to obtain the complexity of the target detection picture.