CN108446698B

CN108446698B - Method, device, medium and electronic equipment for detecting text in image

Info

Publication number: CN108446698B
Application number: CN201810213160.2A
Authority: CN
Inventors: 李玉梅; 杨学行
Original assignee: Tencent Technology Shenzhen Co Ltd; Tencent Dadi Tongtu Beijing Technology Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Dadi Tongtu Beijing Technology Co Ltd
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2020-08-21
Anticipated expiration: 2038-03-15
Also published as: CN108446698A

Abstract

The embodiment of the invention provides a method, a device, a medium and electronic equipment for detecting texts in images. The method for detecting the text comprises the following steps: acquiring an image to be processed; carrying out perspective transformation processing on the image to be processed so as to adjust the image to be processed into a front view and obtain a processed corrected image; text detection is performed based on the corrected image. According to the technical scheme of the embodiment of the invention, the front view can be obtained by adjusting the image to be processed, and then the text detection can be carried out on the basis of the obtained front view, so that the accuracy of the text detection is improved, and the problems of difficult text detection and low accuracy caused by image deformation are avoided.

Description

Method, device, medium and electronic equipment for detecting text in image

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device, a medium and electronic equipment for detecting texts in images.

Background

A natural scene image is an image directly photographed by various photographing apparatuses (e.g., a camera, a mobile phone having a photographing function, etc.) without particular limitation on a scene actually existing in life. Texts in the natural scene image can provide rich semantic information, for example, text information for identifying streets, license plates, menus and the like in the natural scene image, and can assist people to conveniently understand the scene, so that it is necessary to accurately detect the texts in the natural scene image.

However, due to the complexity of the natural scene image, text recognition in the natural scene image is difficult, and the recognition accuracy is low, so how to effectively detect the text in the natural scene image to improve the text detection accuracy becomes a technical problem to be solved urgently.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the invention and therefore may include prior art that does not constitute known to a person of ordinary skill in the art.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a medium, and an electronic device for detecting a text in an image, so as to overcome the problems of difficulty in recognizing the text in the image and low detection accuracy at least to a certain extent.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of an embodiment of the present invention, there is provided a method for detecting text in an image, including: acquiring an image to be processed; carrying out perspective transformation processing on the image to be processed so as to adjust the image to be processed into a front view and obtain a processed corrected image; text detection is performed based on the corrected image.

According to an aspect of an embodiment of the present invention, there is provided an apparatus for detecting text in an image, including: the image acquisition unit is used for acquiring an image to be processed; the first processing unit is used for carrying out perspective transformation processing on the image to be processed so as to adjust the image to be processed into a front view and obtain a processed corrected image; a second processing unit for performing text detection based on the corrected image.

In some embodiments of the present invention, based on the foregoing solution, the first processing unit includes: the matrix construction unit is used for constructing a perspective transformation matrix; and the perspective transformation unit is used for carrying out perspective transformation processing on the image to be processed according to the perspective transformation matrix.

In some embodiments of the present invention, based on the foregoing solution, the matrix construction unit includes: the straight line segment detection unit is used for detecting a straight line segment in the image to be processed; the straight line segment selection unit is used for selecting a target straight line segment meeting the conditions from the detected straight line segments; the quadrangle determining unit is used for determining a quadrangle with the largest area formed by straight lines where the target straight line segments are located; a first generating unit, configured to generate a rectangular frame corresponding to the quadrangle; and the construction unit is used for constructing the perspective transformation matrix according to the corresponding relation between each vertex of the quadrangle and each vertex of the rectangular frame.

In some embodiments of the present invention, based on the foregoing scheme, the straight line segment detecting unit includes: the merging unit is used for determining the included angle between each pixel point in the image to be processed and a horizontal line, merging the pixel points of which the difference value is within a preset range, and obtaining at least one region; a second generating unit, configured to generate a minimum bounding rectangle for each of the regions; the pixel point selecting unit is used for selecting target pixel points, aiming at each region, of which the angle difference between the included angle and the main direction of the minimum circumscribed rectangle is smaller than or equal to a preset value; and the straight line segment determining unit is used for determining whether each region is a straight line segment according to the number of pixel points in the minimum external rectangle of each region and the number of the target pixel points.

In some embodiments of the present invention, based on the foregoing scheme, the straight line segment selecting unit is configured to: and filtering out straight line segments with the length smaller than or equal to a preset length from the detected straight line segments, and/or filtering out straight line segments with the included angle with the vertical direction and/or the horizontal direction larger than or equal to a preset angle to obtain the target straight line segment.

In some embodiments of the present invention, based on the foregoing scheme, the first generating unit is configured to: and generating the rectangular frame by taking two nonadjacent vertexes of the quadrangle as two nonadjacent vertexes of the rectangular frame.

In some embodiments of the present invention, based on the foregoing solution, the second processing unit includes: a third generating unit configured to generate a plurality of images of different sizes based on the corrected image; the first detection unit is used for respectively detecting texts in the images with different sizes so as to obtain text detection frames in the images with different sizes; and the second detection unit is used for carrying out text detection on the corrected image according to the text detection boxes in the images with different sizes.

In some embodiments of the present invention, based on the foregoing scheme, the second detection unit includes: the mapping unit is used for mapping the text detection boxes in the images with different sizes to the corrected image according to the size relationship between the images with different sizes and the corrected image to obtain a plurality of text detection boxes; and the text line determining unit is used for determining the text lines in the corrected image according to the position relation among the text detection boxes.

In some embodiments of the present invention, based on the foregoing scheme, the text line determination unit includes: the fusion unit is used for performing fusion processing on the text detection boxes according to the position relation among the text detection boxes to obtain the text detection boxes after the fusion processing; and the result determining unit is used for taking the text lines contained in the text detection boxes after the fusion processing as the detected text lines in the corrected image.

In some embodiments of the invention, based on the foregoing, the fusion unit is configured to: if any two text detection boxes in the plurality of text detection boxes are mutually contained, or the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a first preset value, fusing any two text detection boxes; and if the directions of any two text detection boxes in the plurality of text detection boxes are consistent, and the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a second preset value, fusing any two text detection boxes.

In some embodiments of the invention, based on the foregoing, the fusion unit is configured to: and generating the minimum circumscribed rectangle of any two text detection boxes, and taking the minimum circumscribed rectangle of any two text detection boxes as the fusion result of any two text detection boxes.

In some embodiments of the present invention, based on the foregoing solution, the apparatus for detecting text in an image further includes: a text line acquisition unit configured to acquire a text line detected in the corrected image; and the identification unit is used for identifying the text content in the text line.

According to an aspect of an embodiment of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method of detecting text in an image as described in the above embodiments.

According to an aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method of detecting text in an image as described in the above embodiments.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the technical solutions provided in some embodiments of the present invention, through performing perspective transformation processing on an image to be processed and performing text detection based on a corrected image obtained after the processing, a front view can be obtained by adjusting an angle of the image to be processed before performing the text detection, and further the text detection can be performed based on the obtained front view, so that accuracy of the text detection is improved, and problems of difficulty in text detection and low accuracy caused by image deformation (for example, deviation of character strokes, change of stroke width and relative positions between strokes, and the like) are avoided.

In the technical solutions provided in some embodiments of the present invention, by detecting a straight line segment in an image to be processed, a target straight line segment meeting a condition is selected, a quadrangle with the largest area that can be formed by the straight line where the target straight line segment is located is determined, and a rectangular frame corresponding to the quadrangle is generated at the same time, so as to construct a perspective transformation matrix according to a correspondence between each vertex of the quadrangle and each vertex of the rectangular frame, so that an accurate perspective transformation matrix can be constructed through the generated quadrangle and the rectangular frame, and further, the image to be processed can be processed based on the constructed perspective transformation matrix, so as to ensure that a corrected image convenient for text detection is obtained.

In the technical solutions provided by some embodiments of the present invention, a plurality of images with different sizes are generated according to a corrected image, and texts in the images are respectively detected, so as to perform text detection on the corrected image according to text detection boxes in the images, so that the boundaries of the text detection boxes in the corrected image can be ensured to be more accurate by detecting and processing the texts in the images, and further, the accuracy of text detection can be provided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

Fig. 1 is a schematic diagram illustrating an exemplary system architecture of a method of detecting text in an image or an apparatus for detecting text in an image to which an embodiment of the present invention may be applied;

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention;

FIG. 3 schematically shows a flow diagram of a method of detecting text in an image according to one embodiment of the invention;

FIG. 4 schematically shows a flow chart of one implementation of step S320 shown in FIG. 3;

FIG. 5 schematically shows a flow chart of one implementation of step S410 shown in FIG. 4;

FIG. 6 schematically shows a flowchart of one implementation of step S510 shown in FIG. 5;

FIG. 7 schematically illustrates a flow chart of one implementation of step S330 shown in FIG. 3;

FIG. 8 schematically illustrates a flow chart of one implementation of step S730 shown in FIG. 7;

FIG. 9 schematically illustrates a flow diagram of a method of detecting text in an image according to another embodiment of the invention;

FIG. 10 schematically illustrates a flow diagram of a text detection scheme in accordance with one embodiment of the present invention;

FIG. 11 schematically shows a flow diagram of a process of perspective transformation of an input image according to one embodiment of the invention;

FIG. 12 shows a schematic diagram of determining horizontal lines in an input image according to an embodiment of the invention;

FIG. 13 shows a schematic diagram of determining a line support area according to an embodiment of the invention;

FIG. 14 shows a schematic diagram of an input image according to an embodiment of the invention;

fig. 15 is a diagram illustrating a result of straight line detection on an input image according to an embodiment of the present invention;

FIG. 16 is a diagram illustrating the determination of quadrilateral and rectangular frames based on the results of line detection according to an embodiment of the present invention;

fig. 17 is a diagram illustrating an effect of perspective transformation processing on an input image according to an embodiment of the present invention;

FIG. 18 schematically shows a flow diagram for text detection of a corrected image according to an embodiment of the invention;

FIG. 19 is a diagram illustrating a specific detection process of a text detection network according to an embodiment of the present invention;

FIG. 20 is a diagram illustrating the results of text detection on a corrected image according to an embodiment of the invention;

FIG. 21 is a graph showing a comparison of the detection effect of the text detection scheme of the embodiment of the present invention and the existing text detection scheme;

FIG. 22 schematically shows a block diagram of an apparatus for detecting text in an image according to an embodiment of the present invention;

fig. 23 schematically shows a block diagram of an apparatus for detecting text in an image according to another embodiment of the present invention;

FIG. 24 schematically shows a block diagram of a first processing unit according to an embodiment of the invention;

FIG. 25 schematically shows a block diagram of a matrix building unit according to an embodiment of the invention;

FIG. 26 schematically shows a block diagram of a detection unit according to an embodiment of the invention;

FIG. 27 schematically shows a block diagram of a second processing unit according to an embodiment of the invention;

FIG. 28 schematically shows a block diagram of a second detection unit according to an embodiment of the invention;

fig. 29 schematically shows a block diagram of a text line determination unit according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 of a method of detecting text in an image or an apparatus for detecting text in an image to which embodiments of the invention may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, and so forth.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services. For example, a user uploads a to-be-processed image including text content to the server 105 by using the terminal device 103 (or the terminal device 101 or 102), after acquiring the to-be-processed image, the server 105 performs perspective transformation processing on the to-be-processed image to adjust the to-be-processed image to a front view, so as to obtain a processed corrected image, and then text detection can be performed based on the corrected image. According to the embodiment of the invention, the text detection is carried out on the basis of the obtained front view, so that the accuracy of the text detection is improved, and the problems of difficult text detection and low accuracy caused by image deformation are solved.

It should be noted that the method for detecting a text in an image provided by the embodiment of the present invention is generally executed by the server 105, and accordingly, the apparatus for detecting a text in an image is generally disposed in the server 105. However, in other embodiments of the present invention, the terminal may also have a similar function as the server, so as to execute the scheme of detecting text in an image provided by the embodiments of the present invention.

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiment of the present invention.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU201, ROM202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a modem or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 210 as necessary, so that a computer program read out therefrom is mounted into the storage section 208 as necessary.

In particular, according to an embodiment of the present invention, the processes described below with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 201.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 3 to 11 and fig. 18.

The implementation details of the technical scheme of the embodiment of the invention are explained in detail as follows:

fig. 3 schematically shows a flow chart of a method of detecting text in an image according to an embodiment of the invention, which is applicable to the electronic device described in the previous embodiment. Referring to fig. 3, the method at least includes steps S310 to S330, which are described in detail as follows:

in step S310, an image to be processed is acquired.

In an embodiment of the present invention, the image to be processed is an image containing text content, for example, the image to be processed may be an image of a natural scene captured by various capturing devices (such as a camera, a mobile phone with a capturing function, and the like).

In step S320, perspective transformation processing is performed on the image to be processed to adjust the image to be processed into a front view, so as to obtain a processed corrected image.

In the embodiment of the present invention, the adjustment of the image to be processed into the front view is to adjust the view of an object (such as an object in the image to be processed) contained in the image to be processed into the front view, so as to facilitate the subsequent text detection. In an embodiment of the present invention, as shown in fig. 4, a processing procedure of step S320 includes the following steps:

step S410, a perspective transformation matrix is constructed.

In an embodiment of the present invention, the perspective transformation matrix represents a transformation relationship between the original image and the corrected image so as to convert the image to be processed into a front view.

And step S420, performing perspective transformation processing on the image to be processed according to the perspective transformation matrix.

In one embodiment of the present invention, referring to fig. 5, a processing procedure of step S410 shown in fig. 4 includes the following steps S510 to S550, which are described in detail as follows:

in step S510, a straight line segment in the image to be processed is detected.

In one embodiment of the present invention, as shown in fig. 6, one process of step S510 includes the following steps:

step S610, determining included angles between each pixel point in the image to be processed and a horizontal line, and combining the pixel points of which the difference values are within a preset range to obtain at least one region;

step S620, generating the minimum circumscribed rectangle of each region;

step S630, aiming at each region, selecting a target pixel point of which the angle difference between the included angle and the main direction of the minimum circumscribed rectangle is smaller than or equal to a preset value;

and step S640, determining whether each region is a straight line segment according to the number of pixel points in the minimum external rectangle of each region and the number of target pixel points.

In an embodiment of the present invention, if the number of target pixels in a region is larger, the region is more likely to be a straight line segment.

Continuing to refer to fig. 5, in step S520, a qualified target straight-line segment is selected from the detected straight-line segments.

In an embodiment of the present invention, the step S520 of selecting a qualified target straight-line segment from the detected straight-line segments includes: and filtering out straight line segments with the length smaller than or equal to a preset length from the detected straight line segments, and/or filtering out straight line segments with the included angle with the vertical direction and/or the horizontal direction larger than or equal to a preset angle to obtain the target straight line segment.

Continuing to refer to fig. 5, in step S530, a quadrangle with the largest area formed by the straight lines of the target straight-line segment is determined.

In one embodiment of the present invention, the straight line segment of the target may be extended, and the quadrangle with the largest area may be determined according to the intersection of two straight lines.

Continuing to refer to fig. 5, in step S540, a rectangular frame corresponding to the quadrangle is generated.

In an embodiment of the present invention, the generating a rectangular frame corresponding to the quadrangle in step S540 includes: and generating the rectangular frame by taking two nonadjacent vertexes of the quadrangle as two nonadjacent vertexes of the rectangular frame.

In this embodiment, the rectangular frame may be generated with the upper left vertex and the lower right vertex of the quadrangle as the upper left vertex and the lower right vertex of the rectangular frame, or with the lower left vertex and the upper right vertex of the quadrangle as the lower left vertex and the upper right vertex of the rectangular frame.

With continued reference to fig. 5, in step S550, the perspective transformation matrix is constructed according to the corresponding relationship between each vertex of the quadrangle and each vertex of the rectangular frame.

According to the technical scheme of the embodiment shown in fig. 5, an accurate perspective transformation matrix can be constructed through the generated quadrangle and the generated rectangular frame, and then the image to be processed can be processed based on the constructed perspective transformation matrix, so that a corrected image convenient for text detection can be obtained.

As shown with continued reference to fig. 3, in step S330, text detection is performed based on the corrected image.

In an embodiment of the present invention, referring to fig. 7, a process of step S330 includes the following steps:

in step S710, a plurality of images with different sizes are generated based on the corrected image.

In one embodiment of the present invention, a plurality of images of different sizes may be generated by performing reduction and/or enlargement processing on the correction image.

Step S720, detecting texts in the images with different sizes respectively to obtain text detection boxes in the images with different sizes.

In an embodiment of the present invention, the text in the image may be detected based on a Full Convolutional Network (FCN) and a Non-maximum suppression (NMS), where the text detection box is to label the detected text content in a form of a box.

And step 730, performing text detection on the corrected image according to the text detection boxes in the images with different sizes.

In an embodiment of the present invention, as shown in fig. 8, a specific processing procedure of step S730 includes the following steps:

step S810, mapping the text detection boxes in the plurality of images with different sizes to the corrected image according to the size relationship between the plurality of images with different sizes and the corrected image, so as to obtain a plurality of text detection boxes.

In an embodiment of the present invention, since there is a size correspondence between a plurality of images with different sizes and the corrected image, for example, the corrected image is reduced and/or enlarged to obtain a plurality of images with different sizes as proposed in the above embodiment, the text detection boxes in the plurality of images with different sizes can be mapped to the corrected image according to the size correspondence, and then a plurality of text detection boxes appear in the corrected image.

Step S820, determining text lines in the corrected image according to the position relationship among the text detection boxes.

In an embodiment of the present invention, step S820 specifically includes: according to the position relation among the text detection boxes, carrying out fusion processing on the text detection boxes to obtain the text detection boxes after the fusion processing; and taking the text line contained in the text detection box after the fusion processing as the detected text line in the corrected image.

In an embodiment of the present invention, the fusion processing may be performed on the plurality of text detection boxes according to the following manner:

fusion method 1: and if any two text detection boxes in the plurality of text detection boxes are mutually contained, or the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a first preset value, fusing any two text detection boxes.

Fusion mode 2: and if the directions of any two text detection boxes in the plurality of text detection boxes are consistent, and the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a second preset value, fusing any two text detection boxes.

In an embodiment of the present invention, the second predetermined value in the fusion mode 2 may be smaller than the first predetermined value in the fusion mode 1, for example, the first predetermined value may be 0.8, the second predetermined value may be 0.5, and the like.

In an embodiment of the present invention, in the fusion manner 1 and the fusion manner 2, the process of fusing any two text detection boxes may be to generate a minimum bounding rectangle of the two text detection boxes, and use the minimum bounding rectangle as a fusion result of the two text detection boxes.

In an embodiment of the present invention, as shown in fig. 9, a method for detecting a text in an image according to another embodiment of the present invention further includes the following steps on the basis of step S310, step S320 and step S330 shown in fig. 3:

in step S340, the text lines detected in the corrected image are acquired.

In one embodiment of the invention, the text lines detected in the corrected image may be text lines framed by a text detection box.

In step S350, the text content in the text line is identified.

In one embodiment of the present invention, the text content in the text line may be recognized by a text Recognition technique, such as an Optical Character Recognition (OCR) technique.

A specific application scenario of the embodiment of the present invention and how to implement the technical solution of the embodiment of the present invention in combination with the application scenario are explained in detail below.

In an application scene of the invention, a natural scene image can be processed, the natural scene image is influenced by factors such as a shooting angle and illumination of a mobile terminal device, and can generate geometric deformation and perspective deformation, and the natural scene image contains fewer text contents, limited text line number, complex typesetting, no obvious paragraph characteristics and sometimes incomplete edge information, so that the text in the natural scene image is difficult to detect. Based on the above problem, a flow of the text detection scheme provided by the embodiment of the present invention is shown in fig. 10, and includes the following steps:

step S1010, inputting an image, that is, inputting a natural scene image to be processed.

In step S1020, perspective transformation, i.e., perspective transformation processing is performed on the input natural scene image to convert the input image into a front view.

Step S1030, text detection is performed based on the converted image.

And step S1040, performing text recognition based on the text detection result.

Among the above steps, step S1020 and step S1030 are the main points of the embodiment of the present invention, and are described in detail below.

In an embodiment of the present invention, the perspective transformation processing performed in step S1020 mainly includes forming a quadrangle by using edges such as windows and billboards appearing in the image, then calculating a perspective transformation matrix based on the quadrangle, and further obtaining a corrected image based on the perspective transformation matrix, where a specific processing procedure is as shown in fig. 11, and includes the following steps:

in step S1021, a Line Segment Detector (LSD) detects a Line Segment in the input image by a Line detection algorithm.

In an embodiment of the present invention, detecting a straight line segment in an image is to find points in the image where brightness changes are obvious, and when the positions of the points are adjacent and the gradient directions are close, a special edge-straight line segment in the image is formed.

During detection, an LSD line detection algorithm can be adopted, and the specific process is as follows:

(1) and calculating the included angle between each pixel in the input image and a level-line (horizontal line) to form a level-line field.

In one embodiment of the present invention, as shown in fig. 12, a point where the brightness change is obvious is found in the image, then the gradient direction is determined based on the brightness change, and a line perpendicular to the gradient direction is taken as a horizontal line.

(2) Pixels with approximately the same direction in the level-line field are merged, which results in a series of regions, which are called line support regions.

Specifically, as shown in fig. 13, a diagram (a) in fig. 13 is a part of an input image; FIG. 13 (b) is a view of the level-line field, wherein each line segment represents the direction of an angle between a pixel and the level-line; FIG. 13 (c) is a schematic diagram of regions obtained after merging pixels having approximately the same direction in the level-line field (i.e., the shaded portion in the diagram (c)).

In one embodiment of the present invention, each line support region is a group of pixels that are also candidates for a straight line segment. Intuitively, when a group of pixels is particularly thin, then the group of pixels is more likely to be a straight line segment. Based on this, the main direction of the minimum bounding rectangle of the line support region may be determined, and if the angle difference between the level-line angle of one pixel in the line support region and the main direction of the minimum bounding rectangle of the line support region is within the tolerance (tolerance)2 τ, this point is called "aligned point". In the embodiment of the present invention, all the pixel numbers in the minimum bounding rectangle of the line support region and the aligned points number therein may be counted for determining whether the line support region is a straight-line segment, for example, if the aligned points are more, the line support region is more likely to be a straight-line segment.

With continued reference to fig. 11, the method further includes the steps of:

in step S1022, a quadrangle is found in the input image based on the detected straight line segments.

In step S1023, a perspective transformation matrix is calculated based on the found quadrangle.

Step S1024 is performed to perform perspective transformation processing on the input image.

Step S1025 outputs the processed image.

In an embodiment of the present invention, referring to fig. 14 to 17, fig. 14 is an input image, which is affected by a shooting angle, illumination, and the like of the mobile terminal device, and is subjected to geometric deformation and perspective deformation. After detecting straight line segments in the input image based on the above step S1021, the detection result is shown in fig. 15, in which a plurality of straight line segments 151 are included. After the detection result of the straight line segments is obtained, the short straight line segments and the straight line segments with larger angles can be filtered out through a threshold value.

After the filtered straight line segment is obtained, the filtered straight line segment can be appropriately extended, and a quadrangle with the largest area can be found through the intersection point of every two straight lines, as shown by a quadrangle 161 in fig. 16. Meanwhile, a rectangular frame 162 is generated based on the vertices of the quadrangle 161. For example, a rectangular frame may be generated with the upper left vertex and the lower right vertex of the quadrangle as the upper left vertex and the lower right vertex of the rectangular frame as shown in fig. 16; alternatively, the rectangular frame may be generated by using the lower left vertex and the upper right vertex of the quadrangle as the lower left vertex and the upper right vertex of the rectangular frame.

After the rectangular frame is generated, a perspective transformation matrix may be constructed from the four vertices of the quadrangle 161 and the four vertices of the rectangular frame 162, and then the input image (i.e., the image shown in fig. 14) may be subjected to perspective transformation processing according to the perspective transformation matrix to obtain a processed corrected image, as shown in fig. 17.

In an embodiment of the present invention, in step S1030, text detection is performed based on the converted image, which may adopt a multi-scale feature fusion and a multi-scale image input mode to improve the detection capability of the detection network on the multi-scale text, and meanwhile, a rule-based text line merging method is used to fuse the detection results of the multi-scale image. The specific process is shown in fig. 18, and includes the following steps:

step S181a, inputting the corrected image into the detection network; in step S181b, the 1/2 image with the corrected image reduced is input to the detection network.

In addition, in step S181b, the 1/2 image obtained by reducing the correction image is input to the detection network, but in another embodiment of the present invention, reduction or enlargement processing in an arbitrary ratio may be performed on the correction image.

In step S182, the result of file detection of the corrected image and the 1/2 image after the correction image reduction is output through the detection network.

In one embodiment of the invention, the detection network may employ Full Convolution (FCN) and non-maximum suppression (NMS) to obtain the final text detection result. Specifically, a specific detection process of detecting the network is shown in fig. 19, and mainly includes a feature extraction process, a feature merging process, and a result output process.

In which the Feature extraction (Feature extractor) process is used to extract features, a ResNet-50 (i.e., a ResNet network with 50 layers) structure may be adopted in the embodiment of the present invention. Four ResNet blocks in the text detection network respectively correspond to four block structures from conv2_ x to conv5_ x in ResNet-50. Unlike resnet-50, the feature extraction process in the embodiment of the present invention discards the last layer of average pore layer. Wherein 50 layers of conv2_ x comprise 3 three-layer convolutions, that is, there are 9 convolution processes in conv2_ x.

The role of the Feature merging (Feature-merging) process is to perform Feature fusion. Since the text in the natural scene image has multiple scales, it is necessary to effectively fuse the features on the multiple scales for text detection. Therefore, in the embodiment of the present invention, the feature f after 3x3max pool in conv2_ x is selected₄And outputs f of conv2_ x, conv3_ x, conv5_ x₃、f₂、f₁Fusion is performed. It should be noted that f₄、f₃、f₂、f₁The feature maps of the present invention are not the same size, so in embodiments of the present invention features are first up-sampled, then associated, and then passed through two convolutional layers (1 × 1,3 × 3).

The output of the result output process is a confidence score map and a coordinate regression value geometry map, both of which are 1/4 scales of the input image, i.e. assuming that the size of the input image is 512 × 512, a 128 × 128 feature map is output. score map represents the probability that each pixel belongs to the text, and geometry map describes the regression rectangle on each pixel. In the embodiment of the present invention, the rectangular box provides three expressions, namely, a standard rectangular box { x, y, w, h }, an inclined rectangular box { x, y, w, h, theta }, and an arbitrary quadrilateral { x1, y1, x2, y2, x3, y3, x4, y4 }. RBOX (rotated box) and QUAD (rectangle) in the text detection network correspond to an inclined rectangular box and any quadrangle respectively, and different output expressions can be selected according to different text detection tasks. If scene text detection in any direction is performed, either RBOX or QUAD can be selected.

It should be noted that "7 × 7,64,/2" in fig. 19 indicates that the layer of convolution is a convolution kernel of 7 × 7, the number of channels is 64, and the step size is 2 (the meaning expressed by other similar descriptions in fig. 19 is similar to that). "resnet block 1256,/2" indicates that the number of channels of the last convolutional layer of block 1 is 256 and the step size is 2 (other similar descriptions in FIG. 19 express similar meaning). "3 × 3, 32" in fig. 19 indicates that this layer of convolution is a convolution kernel of 3 × 3 and the number of channels is 32 (the meaning expressed in other similar descriptions in fig. 19 is similar to that). In other embodiments of the present invention, these parameters can be adjusted appropriately according to actual needs.

With continued reference to FIG. 18, the method further includes the following steps:

in step S183, the detection frames output in step S182 are merged.

In an embodiment of the present invention, the merging process performed on the detection frames output in step S182 mainly includes the following processes: firstly, mapping a detection frame output by a detection network aiming at each image into an original correction image according to a proportional relation; then, the merging processing of the detection frames is carried out in the corrected image based on the following strategies:

(1) if the two detection frames are mutually contained or the ratio of the overlapping area of the two detection frames to any one detection frame in the two detection frames exceeds 0.8, fusing the two detection frames;

(2) and if the directions of the two detection frames are consistent and the ratio of the overlapping area of the two detection frames to any one of the two detection frames exceeds 0.5, fusing the two detection frames.

It should be noted that the numerical values in this embodiment are merely examples, and the numerical values may be set as needed in actual applications.

step S184 outputs the detection result.

In an embodiment of the present invention, if the corrected image is the image shown in fig. 17, the detection result output after the processing by the flow shown in fig. 18 is as shown in fig. 20, and it can be seen in fig. 20 that the boundary of the detection frame 2010 very accurately contains the text content to be detected.

In addition, while the technical solution of the embodiment of the present invention is described by taking two images as an example in fig. 18, in other embodiments of the present invention, the corrected images may be reduced and/or enlarged to obtain a larger number of images, and then input into the detection network for text detection, and the final detection result may be output by merging processing based on the text detection results of the images.

Through the technical scheme of the embodiment of the invention, the accuracy of text detection can be improved, and specifically as shown in fig. 21, (a) in fig. 21 is a detection result of the existing text detection scheme, and (b) in fig. 21 is a result of detecting a text through the technical scheme of the embodiment of the invention, as can be seen from fig. 21, the technical scheme of the embodiment of the invention can accurately detect a text line in an image, solve the problem that the boundary of a detection frame in the existing text detection scheme is inaccurate, and further improve the accuracy of text detection and text recognition.

The technical scheme of the embodiment of the invention can be applied to a specific scene for constructing the basic data of the map query software, for example, images in a natural scene are acquired in a manual acquisition mode, then the text detection scheme of the embodiment of the invention is used for performing text detection on the images in the natural scene, and POI (Point of Interest) data, such as restaurants, hotels, stations, parking lots and the like, is acquired based on the text detection result.

Embodiments of the apparatus of the present invention will be described below, which can be used to perform the method for detecting text in an image in the above-described embodiments of the present invention. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method for detecting text in image of the present invention.

Fig. 22 schematically shows a block diagram of an apparatus for detecting text in an image according to an embodiment of the present invention.

Referring to fig. 22, an apparatus 220 for detecting text in an image according to an embodiment of the present invention includes: an image acquisition unit 221, a first processing unit 222, and a second processing unit 223.

The image acquiring unit 221 is configured to acquire an image to be processed; the first processing unit 222 is configured to perform perspective transformation processing on the image to be processed, so as to adjust the image to be processed into a front view, and obtain a processed corrected image; the second processing unit 223 is used for text detection based on the corrected image.

Fig. 23 schematically shows a block diagram of an apparatus for detecting text in an image according to another embodiment of the present invention.

Referring to fig. 23, an apparatus 230 for detecting a text in an image according to another embodiment of the present invention, based on the apparatus for detecting a text in an image shown in fig. 22, further includes: a text line acquisition unit 224 and a recognition unit 225.

Wherein the text line acquiring unit 224 is configured to acquire a text line detected in the corrected image; the recognition unit 225 is configured to recognize the text content in the text line.

Referring to fig. 24, in one embodiment of the present invention, the first processing unit 222 shown in fig. 22 and 23 includes: a matrix construction unit 2221 and a perspective transformation unit 2222.

The matrix construction unit 2221 is configured to construct a perspective transformation matrix; the perspective transformation unit 2222 is configured to perform perspective transformation processing on the image to be processed according to the perspective transformation matrix.

Referring to fig. 25, in one embodiment of the present invention, the matrix construction unit 2221 includes: a straight line segment detecting unit 251, a straight line segment selecting unit 252, a quadrangle determining unit 253, a first generating unit 254, and a constructing unit 255.

The straight line segment detection unit 251 is used for detecting straight line segments in the image to be processed; the straight line segment selection unit 252 is configured to select a target straight line segment that meets the condition from the detected straight line segments; the quadrangle determining unit 253 is configured to determine a quadrangle with a largest area that can be formed by the straight lines where the target straight line segments are located; the first generating unit 254 is configured to generate a rectangular frame corresponding to the quadrangle; the constructing unit 255 is configured to construct the perspective transformation matrix according to a corresponding relationship between each vertex of the quadrangle and each vertex of the rectangular frame.

Referring to fig. 26, in one embodiment of the present invention, the straight line segment detecting unit 251 includes: a merging unit 2511, a second generating unit 2512, a pixel point selecting unit 2513, and a straight line segment determining unit 2514.

The merging unit 2511 is configured to determine an included angle between each pixel point in the image to be processed and a horizontal line, and merge pixel points whose difference values of the included angles are within a predetermined range to obtain at least one region; a second generating unit 2512 for generating a minimum bounding rectangle for each of the regions; the pixel point selecting unit 2513 is configured to select, for each of the regions, a target pixel point for which an angle difference between the included angle and the main direction of the minimum circumscribed rectangle is smaller than or equal to a predetermined value; the straight line segment determining unit 2514 is configured to determine whether each of the regions is a straight line segment according to the number of pixels in the minimum bounding rectangle of each of the regions and the number of the target pixels.

In some embodiments of the present invention, based on the foregoing scheme, the straight line segment selecting unit 252 is configured to: and filtering out straight line segments with the length smaller than or equal to a preset length from the detected straight line segments, and/or filtering out straight line segments with the included angle with the vertical direction and/or the horizontal direction larger than or equal to a preset angle to obtain the target straight line segment.

In some embodiments of the present invention, based on the foregoing scheme, the first generating unit 254 is configured to: and generating the rectangular frame by taking two nonadjacent vertexes of the quadrangle as two nonadjacent vertexes of the rectangular frame.

Referring to fig. 27, in one embodiment of the present invention, the second processing unit 223 includes: a third generation unit 2231, a first detection unit 2232, and a second detection unit 2233.

Wherein the third generating unit 2231 is configured to generate a plurality of images of different sizes based on the corrected image; the first detecting unit 2232 is configured to detect texts in the images with different sizes respectively to obtain text detection boxes in the images with different sizes; the second detecting unit 2233 is configured to perform text detection on the corrected image according to the text detection boxes in the multiple images with different sizes.

Referring to fig. 28, in one embodiment of the present invention, the second detection unit 2233 includes: a mapping unit 281 and a text line determination unit 282.

The mapping unit 281 is configured to map the text detection boxes in the multiple images with different sizes to the corrected image according to the size relationship between the multiple images with different sizes and the corrected image, so as to obtain multiple text detection boxes; the text line determination unit 282 is configured to determine a text line in the corrected image according to the positional relationship between the plurality of text detection boxes.

Referring to fig. 29, in one embodiment of the present invention, the text line determination unit 282 includes: a fusion unit 2821 and a result determination unit 2822.

The fusing unit 2821 is configured to perform fusing processing on the multiple text detection boxes according to the position relationships among the multiple text detection boxes, so as to obtain a text detection box after the fusing processing; the result determining unit 2822 is configured to use the text line included in the text detection box after the fusion processing as the detected text line in the corrected image.

In some embodiments of the present invention, based on the foregoing scheme, the fusion unit 2821 is configured to: if any two text detection boxes in the plurality of text detection boxes are mutually contained, or the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a first preset value, fusing any two text detection boxes; and if the directions of any two text detection boxes in the plurality of text detection boxes are consistent, and the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a second preset value, fusing any two text detection boxes.

In some embodiments of the present invention, based on the foregoing scheme, the fusion unit 2821 is configured to: and generating the minimum circumscribed rectangle of any two text detection boxes, and taking the minimum circumscribed rectangle of any two text detection boxes as the fusion result of any two text detection boxes.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of detecting text in an image, comprising:

acquiring an image to be processed;

carrying out perspective transformation processing on the image to be processed so as to adjust the image to be processed into a front view and obtain a processed corrected image;

generating a plurality of images of different sizes based on the corrected image;

respectively detecting texts in the images with different sizes to obtain text detection boxes in the images with different sizes;

mapping the text detection boxes in the images with different sizes to the corrected image according to the size relation between the images with different sizes and the corrected image to obtain a plurality of text detection boxes;

according to the position relation among the text detection boxes, carrying out fusion processing on the text detection boxes to obtain the text detection boxes after the fusion processing;

taking the text lines contained in the text detection boxes after the fusion processing as the text lines detected in the corrected image;

the fusing processing of the plurality of text detection boxes according to the position relationship among the plurality of text detection boxes includes:

if the ratio of the overlapping area of any two text detection boxes in the plurality of text detection boxes to the area of one text detection box in the any two text detection boxes exceeds a first preset value, generating a minimum circumscribed rectangle of the any two text detection boxes, and taking the minimum circumscribed rectangle of the any two text detection boxes as a fusion result of the any two text detection boxes;

if the directions of any two text detection boxes in the plurality of text detection boxes are consistent, and the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a second preset value, taking the minimum circumscribed rectangle of any two text detection boxes as the fusion result of any two text detection boxes;

wherein the first predetermined value is greater than the second predetermined value.

2. The method of claim 1, wherein the perspective transformation processing is performed on the image to be processed, and comprises:

constructing a perspective transformation matrix;

and performing perspective transformation processing on the image to be processed according to the perspective transformation matrix.

3. The method of claim 2, wherein constructing a perspective transformation matrix comprises:

detecting straight line segments in the image to be processed;

selecting a target straight line segment meeting the conditions from the detected straight line segments;

determining a quadrangle with the largest area formed by straight lines where the target straight line segments are located;

generating a rectangular frame corresponding to the quadrangle;

and constructing the perspective transformation matrix according to the corresponding relation between each vertex of the quadrangle and each vertex of the rectangular frame.

4. The method of claim 3, wherein detecting straight line segments in the image to be processed comprises:

determining included angles between each pixel point in the image to be processed and a horizontal line, and combining the pixel points of which the difference values are within a preset range to obtain at least one region;

generating a minimum bounding rectangle of each region;

selecting target pixel points of which the angle difference between the included angle and the main direction of the minimum circumscribed rectangle is smaller than or equal to a preset value aiming at each region;

and determining whether each region is a straight line segment or not according to the number of pixel points in the minimum external rectangle of each region and the number of the target pixel points.

5. The method of claim 3, wherein selecting a qualified target straight-line segment from the detected straight-line segments comprises:

and filtering out straight line segments with the length smaller than or equal to a preset length from the detected straight line segments, and/or filtering out straight line segments with the included angle with the vertical direction and/or the horizontal direction larger than or equal to a preset angle to obtain the target straight line segment.

6. The method of claim 3, wherein generating the rectangular box corresponding to the quadrangle comprises:

and generating the rectangular frame by taking two nonadjacent vertexes of the quadrangle as two nonadjacent vertexes of the rectangular frame.

7. The method of claim 1, wherein the fusing of the text detection boxes is performed according to the position relationship between the text detection boxes, and further comprising:

and if any two text detection boxes in the plurality of text detection boxes are mutually contained, fusing any two text detection boxes.

8. The method of claim 7, wherein fusing any two text detection boxes comprises:

and generating the minimum circumscribed rectangle of any two text detection boxes, and taking the minimum circumscribed rectangle of any two text detection boxes as the fusion result of any two text detection boxes.

9. The method of detecting text in an image of any one of claims 1 to 8, further comprising:

acquiring text lines detected in the corrected image;

text content in the text line is identified.

10. An apparatus for detecting text in an image, comprising:

the image acquisition unit is used for acquiring an image to be processed;

the first processing unit is used for carrying out perspective transformation processing on the image to be processed so as to adjust the image to be processed into a front view and obtain a processed corrected image;

a second processing unit configured to perform text detection based on the corrected image;

wherein the second processing unit comprises: a third generating unit configured to generate a plurality of images of different sizes based on the corrected image; the first detection unit is used for respectively detecting texts in the images with different sizes so as to obtain text detection frames in the images with different sizes; the second detection unit is used for carrying out text detection on the corrected image according to the text detection boxes in the images with different sizes;

the second detection unit includes: the mapping unit is used for mapping the text detection boxes in the images with different sizes to the corrected image according to the size relationship between the images with different sizes and the corrected image to obtain a plurality of text detection boxes; a text line determining unit, configured to determine a text line in the corrected image according to a positional relationship between the plurality of text detection boxes;

the text line determination unit includes: the fusion unit is used for performing fusion processing on the text detection boxes according to the position relation among the text detection boxes to obtain the text detection boxes after the fusion processing; a result determining unit, configured to use a text line included in the text detection box after the fusion processing as the detected text line in the corrected image;

the fusion unit is used for: if the ratio of the overlapping area of any two text detection boxes in the plurality of text detection boxes to the area of one text detection box in the any two text detection boxes exceeds a first preset value, generating a minimum circumscribed rectangle of the any two text detection boxes, and taking the minimum circumscribed rectangle of the any two text detection boxes as a fusion result of the any two text detection boxes; and if the directions of any two text detection boxes in the plurality of text detection boxes are consistent, and the ratio of the overlapping area of any two text detection boxes to the area of one text detection box in any two text detection boxes exceeds a second preset value, fusing any two text detection boxes, wherein the first preset value is larger than the second preset value.

11. The apparatus of claim 10, wherein the first processing unit comprises:

the matrix construction unit is used for constructing a perspective transformation matrix;

and the perspective transformation unit is used for carrying out perspective transformation processing on the image to be processed according to the perspective transformation matrix.

12. The apparatus for detecting text in an image according to claim 11, wherein the matrix construction unit comprises:

the straight line segment detection unit is used for detecting a straight line segment in the image to be processed;

the straight line segment selection unit is used for selecting a target straight line segment meeting the conditions from the detected straight line segments;

the quadrangle determining unit is used for determining a quadrangle with the largest area formed by straight lines where the target straight line segments are located;

a first generating unit, configured to generate a rectangular frame corresponding to the quadrangle;

and the construction unit is used for constructing the perspective transformation matrix according to the corresponding relation between each vertex of the quadrangle and each vertex of the rectangular frame.

13. The apparatus of claim 12, wherein the straight line segment detecting unit comprises:

the merging unit is used for determining the included angle between each pixel point in the image to be processed and a horizontal line, merging the pixel points of which the difference value is within a preset range, and obtaining at least one region;

a second generating unit, configured to generate a minimum bounding rectangle for each of the regions;

the pixel point selecting unit is used for selecting target pixel points, aiming at each region, of which the angle difference between the included angle and the main direction of the minimum circumscribed rectangle is smaller than or equal to a preset value;

and the straight line segment determining unit is used for determining whether each region is a straight line segment according to the number of pixel points in the minimum external rectangle of each region and the number of the target pixel points.

14. The apparatus of claim 12, wherein the straight line segment selection unit is configured to:

15. The apparatus for detecting text in an image according to claim 12, wherein the first generating unit is configured to:

16. The apparatus for detecting text in an image according to claim 10, wherein the fusing unit is further configured to:

17. The apparatus for detecting text in an image according to claim 16, wherein the fusion unit is configured to:

18. The apparatus for detecting text in an image according to any one of claims 10 to 17, further comprising:

a text line acquisition unit configured to acquire a text line detected in the corrected image;

and the identification unit is used for identifying the text content in the text line.

19. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method of detecting text in an image according to any one of claims 1 to 9.

20. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a method of detecting text in an image as claimed in any one of claims 1 to 9.