CN113673497A

CN113673497A - Text detection method, terminal and computer readable storage medium thereof

Info

Publication number: CN113673497A
Application number: CN202110827395.2A
Authority: CN
Inventors: 尹瑾; 熊剑平
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2021-11-19

Abstract

The invention provides a text detection method, a terminal and a computer readable storage medium thereof, wherein the text detection method comprises the steps of obtaining a text image to be detected, and extracting features of a feature region to obtain a feature map; carrying out region detection on the feature map to obtain a boundary frame of the feature map; performing text detection on the feature map to obtain a contour border of a text in the feature map; determining a detection frame of the text to be detected based on the boundary frame obtained by detection and the outline frame obtained by detection, so as to improve the detection accuracy of the oblique text line; in addition, the text detection method does not need edge reduction processing on the characteristic region, so that the text omission ratio is greatly reduced, and the robustness of text detection is improved.

Description

Text detection method, terminal and computer readable storage medium thereof

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a text detection method, a terminal, and a computer-readable storage medium thereof.

Background

With the development of science and technology, it is more and more common to automatically identify various bill and certificate information by means of an OCR (Optical Character Recognition) technology, and for example, when a bank transacts business, identification card information needs to be identified; in the process of traffic police law enforcement, certificate information such as driving licenses, driving licenses and the like needs to be identified.

At present, card images are mainly shot by mobile phones, the content in the images is printed manually, the problems of complex background, inclined text lines, adhesion of text lines and the like easily occur, and the traditional OCR detection algorithm is not strong in robustness of card image detection in natural scenes.

Disclosure of Invention

The invention mainly solves the technical problem of providing a text detection method, a terminal and a computer readable storage medium thereof, and solves the problem of poor detection accuracy of inclined text lines in the prior art.

In order to solve the technical problems, the first technical scheme adopted by the invention is as follows: a text detection method is provided, which includes: acquiring a text image to be detected, wherein the text image to be detected at least comprises a characteristic region of a text to be detected; extracting the features of the feature areas to obtain a feature map; carrying out region detection on the feature map to obtain a boundary frame of the feature map; performing text detection on the feature map to obtain a contour frame of the text in the feature map; and determining a detection box of the text to be detected based on the boundary box and the outline border.

Wherein, the step of extracting the characteristic of the characteristic region to obtain the characteristic diagram further comprises the following steps: segmenting an image to be detected of the text to obtain a characteristic region; and correcting the characteristic region so that the characteristic region is at a preset angle.

The step of determining the detection box of the text to be detected based on the bounding box and the outline border specifically comprises the following steps: and if the distances between the corner points of the outline border and the corresponding corner points of the border are determined to be smaller than a first threshold value, taking the border as a text detection box.

The step of determining the detection box of the text to be detected based on the bounding box and the outline border specifically comprises the following steps: when the long side of the outline border is in the horizontal direction, if the difference value of two angular points of the short side formed by the outline border in the horizontal direction is larger than a second threshold value, taking the border as a text detection box; or when the long side of the outline border is in the vertical direction, if the difference value of the two corner points of the short side formed by the outline border in the vertical direction is larger than a second threshold value, taking the border as a text detection box.

The step of determining the detection box of the text to be detected based on the bounding box and the outline border specifically comprises the following steps: and performing weighted fusion according to the boundary frame and the outline frame to determine a detection frame of the text to be detected.

Wherein, carry out feature extraction to the characteristic region, obtain the characteristic map, include: extracting the features of the feature region based on the trained text detection model to obtain a feature map; the text detection model is obtained by training an initial text detection model, and the initial text detection model comprises a feature extraction unit, a text detection unit and a text correction unit.

The text detection model is obtained by the following steps: acquiring a training sample set, wherein the training sample set comprises a plurality of image samples containing texts and text labeling boxes; segmenting a first characteristic region containing a text in an image sample; performing feature extraction on the first feature region through a feature extraction unit to obtain a first feature map; the text detection unit carries out region detection on the first feature map to obtain a first prediction box; the text correction unit performs text detection on the first feature map to obtain a second prediction box; constructing a first loss function through the first prediction box and the text marking box, and the second prediction box and the text marking box; and performing iterative training on the initial text detection model by using the first loss function to obtain a text detection model.

Wherein the training sample set further comprises real categories of texts; the method for obtaining the text detection model further comprises the following steps: detecting through an initial text detection model to obtain a prediction type of the text; constructing a second loss function according to the prediction category and the real category of the text; and performing iterative training on the initial text detection model by using a second loss function to obtain a text detection model.

In order to solve the above technical problems, the second technical solution adopted by the present invention is: there is provided a terminal, the mobile robot comprising a memory, a processor and a computer program stored in the memory and running on the processor, the processor being adapted to implement the steps in the text detection method described above.

In order to solve the above technical problems, the third technical solution adopted by the present invention is: there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps in the text detection method described above.

The invention has the beneficial effects that: different from the situation of the prior art, the text detection method, the terminal and the computer readable storage medium thereof provided by the invention have the advantages that the text image to be detected is obtained, and the feature extraction is carried out on the feature region to obtain the feature map; carrying out region detection on the feature map to obtain a boundary frame of the feature map; performing text detection on the feature map to obtain a contour border of a text in the feature map; determining a detection frame of the text to be detected based on the boundary frame obtained by detection and the outline frame obtained by detection, so as to improve the detection accuracy of the oblique text line; in addition, the text detection method does not need edge reduction processing on the characteristic region, so that the text omission ratio is greatly reduced, and the robustness of text detection is improved.

Drawings

FIG. 1 is a schematic flow chart of a text detection method provided by the present invention;

FIG. 2 is a flowchart illustrating a text detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an embodiment of step S21 in the text detection method provided in FIG. 2;

FIG. 4 is a partial image of a driver's license in one embodiment provided by the present invention;

FIG. 5 is a block diagram of an embodiment of a text detection model provided by the present invention;

FIG. 6 is a schematic block diagram of one embodiment of a terminal provided by the present invention;

FIG. 7 is a schematic block diagram of one embodiment of a computer-readable storage medium provided by the present invention.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

In order to make those skilled in the art better understand the technical solution of the present invention, the text detection method provided by the present invention is further described in detail below with reference to the accompanying drawings and the detailed description.

Referring to fig. 1, fig. 1 is a schematic flow chart of a text detection method according to the present invention. The embodiment provides a text detection method, which includes the following steps.

S11: and acquiring a text image to be detected.

Specifically, the text image to be detected may be an image shot by a camera of the terminal device, or may also be an image stored in the terminal device. The terminal device herein may refer to a device such as a mobile phone and a tablet computer, and may also be a vehicle-mounted device in a vehicle, which is not limited in the embodiment of the present invention.

The acquired text image to be detected can be any image needing text detection, such as: images of natural scenes, and images of direction marks, position marks and the like of the blind in the scene shot when the blind navigates; student work images; text detection of identity cards, driver licenses and other documents, and the like.

S12: and performing feature extraction on the feature area to obtain a feature map.

Specifically, the text detection model includes a segmentation correction unit, a feature extraction unit, a text detection unit, and a text correction unit. Inputting a text image to be detected into a text detection model, identifying the text image to be detected by a segmentation correction unit, and segmenting to obtain a characteristic region; and then, correcting the extracted characteristic region according to the length-width ratio of the characteristic region obtained by detection and identification so as to enable the characteristic region to be at a preset angle. The extracted feature region may also be corrected according to the position at which the identified entry in the feature region is distributed, so that the feature region is at the preset angle. The segmentation correction is carried out on the text image to be detected so as to ensure that the interference of the background area on the text detection is reduced and the proportion of the text area relative to the whole image is not changed.

After the characteristic region is adjusted to a preset angle, the characteristic extraction unit extracts the characteristic of the characteristic region to obtain a characteristic diagram corresponding to the characteristic region. In a specific embodiment, when the feature area is a card area, the feature extraction unit performs feature extraction on the card area to obtain a card feature map.

S13: and carrying out region detection on the feature map to obtain a boundary frame of the feature map.

Specifically, the feature map corresponding to the feature region is input to the text detection unit, and the text detection unit performs region detection after performing feature extraction on the feature map, so as to obtain a bounding box of the text in the feature map. Specifically, the bounding box of the text in the feature map is a rectangular detection box.

S14: and performing text detection on the feature map to obtain an outline border of the text in the feature map.

Specifically, the feature map corresponding to the feature area is input to a text correction unit, and the text correction unit performs feature extraction on the feature map and then performs text detection, so as to obtain offsets of four corners of the text in the feature map relative to four corners of the bounding box, and further obtain a contour border of the text. And the outline border of the text in the feature map is a quadrilateral detection box.

S15: and determining a detection box of the text to be detected based on the boundary box and the outline border.

Specifically, in an optional embodiment, it is determined whether distances between corner points of the outline border and corresponding corner points of the border are all smaller than a first threshold. And if the distances between the corner points of the outline border and the corresponding corner points of the border are smaller than a first threshold value, taking the border as a text detection box. In another optional embodiment, when the long side of the outline border is in the horizontal direction, it is determined whether a difference value of two corner points of the short side formed by the outline border in the horizontal direction is greater than a second threshold. And if the difference value of the two corner points forming the short side in the horizontal direction is larger than a second threshold value, taking the bounding box as a text detection box. In another optional embodiment, when the long side of the outline border is in the vertical direction, it is determined whether a difference value of two corner points of the short side formed by the outline border in the vertical direction is greater than a second threshold. And if the difference value of the two corner points forming the short side in the vertical direction is larger than a second threshold value, taking the bounding box as a text detection box.

In the text detection method provided in the embodiment, a text image to be detected is obtained, and a text detection model is adopted to perform feature extraction on a feature region so as to obtain a feature map; carrying out region detection on the feature map to obtain a boundary frame of the feature map; performing text detection on the feature map to obtain a contour border of a text in the feature map; determining a detection frame of the text to be detected by fusing the boundary frame obtained by detection and the outline frame obtained by detection, so as to improve the detection accuracy of the oblique text line; in addition, the text detection method does not need edge reduction processing on the characteristic region, so that the text omission ratio is greatly reduced, and the robustness of text detection is improved.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a text detection method according to an embodiment of the present invention. The embodiment provides a text detection method, which includes the following steps.

S21: and training the initial text detection model to obtain a text detection model.

Specifically, the initial text detection model is constructed based on the YOLOv3 network. The initial text detection model is constructed based on the YOLOv3 network, so that multi-scale output can be realized, and the missing rate of the text in the text detection process can be effectively reduced. The initial text detection model comprises a segmentation correction unit, a feature extraction unit, a text detection unit and a text correction unit which are connected in sequence. Referring to fig. 3, fig. 3 is a flowchart illustrating an embodiment of step S21 in the text detection method provided in fig. 2. Specifically, training the initial text detection model includes the following steps.

S211: and acquiring a training sample set, wherein the training sample set comprises a plurality of image samples containing texts, text annotation boxes and real categories of the texts.

Specifically, a plurality of text-containing image samples are obtained, and the text-containing image samples may be images captured by a camera of the terminal device. For example, the image sample containing the text may be a picture containing a driving license, a picture containing an identity card, or may include a direction identifier, a position identifier, and other images of a scene where the blind person is located; student assignment images, and the like. The image sample containing the text is also provided with a text labeling box for marking the text. In another alternative embodiment, the text annotation box may be the coordinates of the smallest bounding rectangle of the text region in the image sample.

Referring to fig. 4, fig. 4 is a partial image of a driving license according to an embodiment of the invention. In another optional embodiment, the image sample containing text further comprises a real category of tagged text. Wherein, the real category of the text can be one of keys and values. For example, the real category of the text "name" on an identification card is "key". The real categories of "brand model", "address", and "registration date" on the driver's license are "key"; the true category of the text "zhangxx" on the identification card is "value", and the true categories of "xxxx" on the driving license, "xxxi" and "2007-xxxx" are "value". The sticky texts can be effectively distinguished by labeling the categories of the texts.

In another optional embodiment, the image sample containing the text may be supplemented to the training sample set after being subjected to rotation enhancement processing, affine enhancement processing, and illumination enhancement processing, so as to expand the data volume in the training sample set, and further improve the robustness of the text detection model obtained by training.

In another alternative embodiment, it is also desirable to label the angle at which the text in the image sample is located. Wherein the angle of the text is the angle between the text and the horizontal axis.

S212: a first feature region of the image sample containing text is segmented.

Specifically, the image sample containing the text is input into an initial text detection model, a segmentation correction unit identifies the position of a first feature region containing the text in the image sample containing the text, the first feature region containing the text is separated from a background region, and the segmentation correction unit segments and extracts the first feature region containing the text; and then, correcting the extracted first characteristic region according to the length-width ratio of the text obtained by detection and identification so as to enable the first characteristic region to be at a preset angle. The extracted first feature region may also be corrected according to the position where the identified entry in the first feature region is distributed, so that the first feature region is at the preset angle. Specifically, the angle of the first feature region is determined according to the position relationship between the recognized "key" and the recognized "value" in the first feature region, the position relationship between the position of the "key" and the position of the "value" in the prestored template, and the difference between the positions of the "key" and the "value", and the angle of the first feature region is adjusted to a preset angle according to the angle of the first feature region.

In one embodiment, the first feature region is rotated to a predetermined angle. For example, the preset angle may be 0 °, 90 °, 180 ° or 270 ° relative to the horizontal axis, so as to facilitate subsequent feature extraction on the first feature region, and improve accuracy of text detection of the first feature region.

In another alternative embodiment, the first feature region may also be rotationally adjusted according to the angle of the labeled text, so that the angle between the first feature region and the horizontal axis is 0 °, 90 °, 180 ° or 270 °, thereby facilitating feature extraction on the first feature region subsequently. And further improving the text detection accuracy of the first characteristic region.

S213: and performing feature extraction on the first feature region to obtain a first feature map.

Specifically, after the first feature region is adjusted to a preset angle, the feature extraction unit performs feature extraction on the first feature region to obtain a first feature map corresponding to the first feature region. In a specific embodiment, when the first feature area is a card area, the feature extraction unit performs feature extraction on the card area to obtain a card feature map.

S214: carrying out region detection on the first feature map to obtain a first prediction frame; and performing text detection on the first feature map to obtain a second prediction box.

Specifically, a first feature map corresponding to the first feature region is input to the text detection unit, and the text detection unit performs feature extraction on the first feature map and then performs region detection, so as to obtain a first prediction box of a text in the first feature map. Specifically, the first prediction box of the text in the first feature map is a rectangular detection box.

Specifically, a first feature map corresponding to the first feature area is input to a text correction unit, the text correction unit performs feature extraction on the first feature map and then performs text detection, so that offsets of four corners of a text in the first feature map relative to four corners of the first prediction box are obtained, and then a second prediction box of the text is obtained. And the second prediction box of the text in the first feature map is a quadrilateral detection box.

S215: and constructing a first loss function through the first prediction box and the text marking box and the second prediction box and the text marking box.

Specifically, the error values between the first prediction box and the text labeling box and between the second prediction box and the text labeling box are calculated by adopting a regression loss function. In one embodiment, the first loss function is a regression loss function as shown in equation (1) below.

L＝a₀×exp(-kt)×L₁+(1-a₀×exp(-kt))×L₂ (1)

In the formula: l is₁Regression loss value, L, for rectangular test box₂For the final loss function of the quadrilateral detection box, a₀Initial value is 1, t is iteration number, k is hyper-parameter, and L is under function initial condition₁Has a coefficient of 1, L₂The coefficient of (2) is 0.

Wherein, as the training algebra increases, L₁Gradual decay of the front coefficient, L₂The coefficient gradually increases, namely the rectangular detection frame coefficient gradually decreases, and the quadrilateral detection frame coefficient gradually increases.

S216: and performing iterative training on the initial text detection model by using the first loss function to obtain a text detection model.

Specifically, the initial text detection model is iteratively trained through error values between the first prediction box and the text labeling box and between the second prediction box and the text labeling box to obtain the text detection model.

In an alternative embodiment, the result of the initial text detection model is propagated backwards, and the weight of the initial text detection model is modified according to the loss value fed back by the first loss function. In an optional embodiment, parameters in the initial text detection model may also be modified, so as to implement training of the initial text detection model.

An image sample containing a text is input into an initial text detection model, which predicts the text. When the error values between the first prediction box and the text label box and between the second prediction box and the text label box are smaller than a preset threshold, which can be set by itself, for example, 1%, 5%, and the like, the training of the initial text detection model is stopped and the text detection model is obtained.

S217: and identifying the text detected in the first feature map to obtain the prediction type of the text.

Specifically, an image sample containing a text is input into an initial text detection model, and the initial text detection model predicts a text category in the image sample to obtain a predicted category of the text.

S218: and constructing a second loss function according to the prediction category and the real category of the text.

Specifically, a cross entropy loss function is used to calculate the error value between the prediction class and the real class. In one embodiment, the second Loss function is a Cross-entropy Loss.

S219: and performing iterative training on the text detection model by using a second loss function.

Specifically, the text detection model is iteratively trained through error values between the prediction classes and the real classes.

In an optional embodiment, the result of the text detection model is propagated reversely, and the weight of the text detection model is modified according to the loss value fed back by the second loss function. In an optional embodiment, parameters in the text detection model may also be modified to implement training of the text detection model.

When the error value between the prediction category and the real category of the text is smaller than a preset threshold, the preset threshold can be set by itself, for example, 1%, 5%, and the like, and then the training of the text detection model is stopped.

S22: and acquiring a text image to be detected.

S23: and segmenting the image to be detected to obtain the characteristic region.

Specifically, a text image to be detected is input into the text detection model, the segmentation correction unit identifies the position of a feature region in the text image to be detected, the feature region in the text image to be detected is separated from a background region, and the segmentation correction unit segments and extracts the feature region in the text image to be detected.

S24: and correcting the characteristic region so that the characteristic region is at a preset angle.

Specifically, the extracted feature region is corrected according to the aspect ratio of the feature region obtained through detection and identification, so that the feature region is at a preset angle. The extracted feature region may also be corrected according to the position at which the identified entry in the feature region is distributed, so that the feature region is at the preset angle. In one embodiment, the feature region is rotated to a predetermined angle. For example, the preset angle may be 0 °, 90 °, 180 ° or 270 ° relative to the horizontal axis, so as to facilitate subsequent feature extraction on the feature region, and improve accuracy of text detection of the feature region. The segmentation correction is carried out on the text image to be detected so as to ensure that the interference of the background area on the text detection is reduced and the proportion of the text area relative to the whole image is not changed.

S25: and performing feature extraction on the feature region through a text detection model to obtain a feature map.

Referring to fig. 5, fig. 5 is a schematic diagram of a framework of a text detection model according to an embodiment of the present invention. Specifically, after the feature area is adjusted to a preset angle, the feature extraction unit performs feature extraction on the feature area to obtain a feature map corresponding to the feature area. In a specific embodiment, when the feature area is a card area, the feature extraction unit performs feature extraction on the card area to obtain a card feature map.

S26: and carrying out region detection on the feature map to obtain a boundary frame of the feature map.

Specifically, the feature map corresponding to the feature region is input to the text detection unit, and the text detection unit performs region detection after performing feature extraction on the feature map, so as to obtain a bounding box of the text in the feature map. Specifically, the bounding box of the text in the feature map is a rectangular bounding box.

S27: and performing text detection on the feature map to obtain an outline border of the text in the feature map.

Specifically, the feature map corresponding to the feature area is input to a text correction unit, and the text correction unit performs feature extraction on the feature map and then performs text detection, so as to obtain offsets of four corners of the text in the feature map relative to four corners of the boundary box, and further obtain a quadrilateral outline border of the text. And the outline border of the text in the feature map is a quadrilateral detection box.

S28: and performing weighted fusion according to the boundary frame and the outline frame to determine a detection frame of the text to be detected.

Referring to fig. 6, fig. 6 is a schematic block diagram of an embodiment of a terminal provided in the present invention. As shown in fig. 6, the terminal 70 in this embodiment includes: the processor 71, the memory 72, and a computer program stored in the memory 72 and capable of running on the processor 71 are not described herein for avoiding repetition in order to implement the text detection method described above when the computer program is executed by the processor 71.

Referring to fig. 7, fig. 7 is a schematic block diagram of an embodiment of a computer-readable storage medium provided by the present invention.

The embodiment of the present application further provides a computer-readable storage medium 90, the computer-readable storage medium 90 stores a computer program 901, the computer program 901 includes program instructions, and a processor executes the program instructions to implement any text detection method provided in the embodiment of the present application.

The computer-readable storage medium 90 may be an internal storage unit of the computer device of the foregoing embodiment, such as a hard disk or a memory of the computer device. The computer-readable storage medium 90 may also be an external storage device of the computer device, such as a plug-in hard disk provided on the computer device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A text detection method, comprising:

acquiring a text image to be detected, wherein the text image to be detected at least comprises a characteristic region of a text to be detected;

extracting the features of the feature region to obtain a feature map;

carrying out region detection on the feature map to obtain a boundary frame of the feature map;

performing text detection on the feature map to obtain a contour border of a text in the feature map;

and determining the detection box of the text to be detected based on the boundary box and the outline border.

2. The text detection method according to claim 1,

the step of extracting the features of the feature region to obtain a feature map further comprises the following steps:

segmenting the image to be detected to obtain the characteristic region;

and correcting the characteristic region to obtain the characteristic region at a preset angle.

3. The text detection method according to claim 2,

the step of determining the detection box of the text to be detected based on the boundary box and the outline box specifically includes:

and if the distance between each corner point of the outline border and the corresponding corner point of the border is determined to be smaller than a first threshold value, taking the border as the text detection box.

4. The text detection method according to claim 2,

when the long side of the outline border is in the horizontal direction, if the difference value of the two angular points of the short side formed by the outline border in the horizontal direction is larger than a second threshold value, taking the border as the text detection box;

or when the long side of the outline border is in the vertical direction, if the difference value of the two corner points of the short side formed by the outline border in the vertical direction is larger than a second threshold value, taking the border as the text detection box.

5. The text detection method according to claim 1,

and performing weighted fusion according to the boundary frame and the outline frame, and determining the detection frame of the text to be detected.

6. The text detection method according to claim 2, wherein the extracting the feature of the feature region to obtain a feature map comprises:

extracting the features of the feature region based on the trained text detection model to obtain the feature map; the text detection model is obtained by training an initial text detection model, and the initial text detection model comprises a feature extraction unit, a text detection unit and a text correction unit.

7. The text detection method of claim 6, wherein the text detection model is obtained by:

acquiring a training sample set, wherein the training sample set comprises a plurality of image samples containing texts and text labeling boxes;

segmenting a first characteristic region containing the text in the image sample;

performing feature extraction on the first feature region through the feature extraction unit to obtain a first feature map;

the text detection unit carries out region detection on the first feature map to obtain a first prediction box; the text correction unit performs text detection on the first feature map to obtain a second prediction box;

constructing a first loss function through the first prediction box and the text labeling box, and the second prediction box and the text labeling box;

and performing iterative training on the initial text detection model by using the first loss function to obtain the text detection model.

8. The text detection method of claim 7, wherein the training sample set further comprises a true category of the text;

the method for obtaining the text detection model further comprises:

detecting through the initial text detection model to obtain a prediction type of the text;

constructing a second loss function according to the prediction category and the real category of the text;

and performing iterative training on the initial text detection model by using the second loss function to obtain the text detection model.

9. A terminal, characterized in that the terminal comprises a memory, a processor and a computer program stored in the memory and running on the processor, the processor being adapted to execute the program data to implement the steps in the text detection method according to any of claims 1-8.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the text detection method according to any one of claims 1 to 8.