CN111582267A

CN111582267A - Text detection method, computing device and readable storage medium

Info

Publication number: CN111582267A
Application number: CN202010269719.0A
Authority: CN
Inventors: 徐丞申; 李林; 叶明登; 刘荣; 黄萧
Original assignee: Beijing Pierbulaini Software Co ltd
Current assignee: Beijing Pierbulaini Software Co ltd
Priority date: 2020-04-08
Filing date: 2020-04-08
Publication date: 2020-08-25
Anticipated expiration: 2040-04-08
Also published as: CN111582267B

Abstract

The invention discloses a text detection method which is suitable for being executed in computing equipment and comprises the following steps: acquiring an image to be processed, wherein the image to be processed contains text information, and the text information contains one or more lines of texts; inputting an image to be processed into a first target detection model for detection, and acquiring a text image containing text information; inputting the text image into a second target detection model for detection, and acquiring a single character detection frame containing single character information; and sequencing all the single character information according to the coordinates of the single character detection box to obtain complete text information. The invention also discloses a corresponding computing device and a readable storage medium.

Description

Text detection method, computing device and readable storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a text detection method, a computing device, and a readable storage medium.

Background

With the development of deep learning technology in detection and identification, a typical text detection method for text identification is a method combining target detection and identification (such as fast R-CNN, YoloV3, etc.) and convolutional neural network (CRNN) sequence identification, that is, a text region is detected and identified by using a target detection and identification algorithm, and then sequence identification is performed on the text region by using a CRNN algorithm. Because the CRNN algorithm needs a large amount of data to train the model accurately, recently, an algorithm for detecting for multiple times has been proposed, that is, a character region is detected by using a detection and recognition algorithm in the first step, and a single character is detected and recognized by using the detection and recognition algorithm in the second step.

The method adopted by the multi-line text detection at present is a multi-detection scheme, different lines in the same identification item of a text are marked as different categories, then single character identification is carried out on each region, and finally results are spliced together, so that complete text information is identified.

In another method, different lines of the same identification item of the text are marked in the same marking frame, then the region is marked for the second time, the different lines are identified into different categories, a single line is detected and identified, then single character identification is carried out on each line, and finally the results are spliced together, so that complete character information is identified.

Disclosure of Invention

To this end, the present invention provides a text detection method, a computing device and a readable storage medium in an attempt to solve or at least alleviate the problems presented above.

According to an aspect of the present invention, there is provided a text detection method adapted to be executed in a computing device, the method comprising the steps of: acquiring an image to be processed, wherein the image to be processed contains text information, and the text information contains one or more lines of texts; inputting an image to be processed into a first target detection model for detection, and acquiring a text image containing text information; inputting the text image into a second target detection model for detection, and acquiring a single character detection frame containing single character information; and sequencing all the single character information according to the coordinates of the single character detection box to obtain complete text information.

Optionally, in the text detection method according to the present invention, inputting the image to be processed into the first target detection model for detection, and acquiring the text image including the text information includes: inputting an image to be processed into a first detection model to detect a text region containing text information; and cutting the text area and outputting a text image containing text information.

Optionally, in the text detection method according to the present invention, the second detection model further outputs a single word.

Optionally, in the text detection method according to the present invention, before the step of obtaining complete text information by sorting all the single character information according to the coordinates of the single character detection box, the method further includes: and carrying out character recognition on the single character detection frame to obtain the single characters contained in the single character detection frame.

Optionally, in the text detection method according to the present invention, sorting all the single character information according to the coordinates of the single character detection box, and obtaining complete text information includes: sequencing all the single character detection frames according to the horizontal coordinate of the top left vertex of the single character detection frame and the sequence of the horizontal coordinate from left to right; calculating the overlapping ratio among the single character detection frames, and grouping all the single character detection frames according to the overlapping ratio; acquiring the vertical coordinate of the top left vertex of the first single character detection box in each group, and sequencing the groups according to the sequence of the vertical coordinate from top to bottom; and connecting the sequenced groups end to obtain final text information.

Optionally, in the text detection method according to the present invention, calculating an overlap ratio between the single character detection boxes, and grouping all the single character detection boxes according to the overlap ratio includes: acquiring a single character detection frame to be detected; calculating the overlapping ratio of the single character detection frame to be detected and the last single character detection frame in the existing grouping; dividing the single character detection boxes with the overlapping ratio larger than a preset value into the same group; and if the overlapping ratio of the single character detection frame to be detected and the last single character detection frame in all the existing groups is not more than a preset value, adding a new group.

Optionally, in the text detection method according to the present invention, calculating an overlap ratio of the single-word detection frame to be detected and the last single-word detection frame in the existing group includes: acquiring the coordinates of the upper left vertex and the lower right vertex of the single character detection frame to be processed and the last single character detection frame in the existing group; and calculating the overlapping ratio r of the single character detection frame to be processed and the last single character detection frame in the existing grouping according to the following formula:

wherein, the coordinate of the upper left point of the single character detection frame to be detected is (x 1)₁,y1₁) The coordinate of the lower right point of the single character detection box to be processed is (x 1)₂,y1₂) The coordinate of the upper left point of the last single character detection box in the grouping is (x 2)₁,y2₁) The coordinate of the lower right point of the last single character detection box in the grouping is (x 2)₂,y2₂) And x1₁＜x1₂，y1₁＜y1₂。

Optionally, in the text detection method according to the present invention, the first object detection network and the second object detection model are convolutional neural network fast R-CNN.

Optionally, in the text detection method according to the present invention, the image to be processed is an identification card image, and the text information is address information.

According to another aspect of the present invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the text detection method as above.

According to still another aspect of the present invention, there is provided a readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the text detection method as above.

According to the text detection method, the single character detection frames are obtained by carrying out target detection on the obtained text image containing the text information, the single character detection frames are sequenced according to the coordinate information of the single character detection frames and are spliced into complete character information, and in the process, the multiple lines of characters of the character image are treated as a whole, the characters do not need to be cut into lines, the lines do not need to be marked into different categories, so that the detection and identification process is saved, the time complexity of calculation and the memory occupation amount are reduced, the detection efficiency is improved, and meanwhile, the condition of missing detection does not exist.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention;

FIG. 2 illustrates a flow diagram of a text detection method 200 according to one embodiment of the invention;

FIG. 3 illustrates a result diagram of ID card address detection text region detection according to one embodiment of the invention;

FIG. 4 is a diagram illustrating the detection result of the ID card address detection single word detection box according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 shows a schematic diagram of a computing device 100, according to one embodiment of the invention. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. The application 122 is actually a plurality of program instructions that direct the processor 104 to perform corresponding operations. In some embodiments, the application 122 may be arranged to cause the processor 104 to operate with the program data 124 on an operating system.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

In a computing device 100 according to the present invention, the applications 122 include a user data storage application 128, the user data storage application 128 includes a plurality of program instructions, and the program data 124 may include various user behavior records retrieved from the data storage 110. The device 228 may instruct the processor 204 to execute the user data storage method 300 to perform analysis processing on the program data 224 to facilitate the transfer of user behavior records in the first data storage device 110 to the second data storage device 120 to improve data storage efficiency and facilitate computational analysis.

Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations. In some embodiments, the computing device 100 is configured to perform the text detection method of an embodiment of the present invention.

FIG. 2 illustrates a flow diagram of a text detection method 200 according to one embodiment of the invention, the method 200 being performed in a computing device.

The method 200 starts in step S210, where an image to be processed is obtained, where the image to be processed includes text information, and the text information includes one or more lines of text.

According to a specific embodiment of the present invention, the image to be processed includes an image containing text information, such as an identification card image, a ticket image, and other document images, and may also be a detection recognition scene requiring non-text objects sorted in rows or columns. For example, in an identification card image, text information to be detected is address information.

And then, step S220 is performed, in which the image to be processed is input into the first target detection model for detection, and a text image containing text information is obtained. The first target detection model can be a convolutional neural network Faster R-CNN, and can also be other target detection models which can detect text regions in images. Step S220 specifically includes the following two steps:

A. inputting the image to be processed into the first target detection model to detect the text region containing the text information, according to an embodiment of the present invention, taking the detection of the address information on the identification card as an example, this step can detect the text region containing the address information, as shown in fig. 3.

B. And cutting the character area and outputting a text image containing text information. The character area containing the address information on the identity card is cut out and output as a text image.

The first target detection model is a pre-trained target detection model. The training process comprises the following steps: acquiring a training sample set, wherein each training sample in the training sample set is a certificate image marked with a character area (text box); inputting the certificate images in the training sample set into a first target detection model to be trained, and outputting a predicted text box by the first target detection model; calculating a loss function according to the difference between the predicted text box and the labeled text box; adjusting model parameters of the first target detection model according to the loss function, for example, adjusting the model parameters by using a gradient descent method; and when the iteration times reach the preset times or the model converges, stopping training and outputting the trained first target detection model.

And then, step S230 is executed, the text image is input into a second target detection model for detection, and a single character detection frame containing single character information is obtained, wherein the second target detection model in the step may be a convolutional neural network fast R-CNN, or other target detection models capable of detecting a text region in the image.

The second target detection model can also complete character recognition to output single characters, and can also output character categories in a single character detection frame, namely, characters are divided into character categories such as Chinese characters, numbers, letters and the like, if the second target detection model outputs the character categories, the second target detection model also comprises a character recognition step, and the character recognition step can be realized through a convolutional neural network. The result of dividing the detection box for detecting the address information of the identification card is shown in fig. 4 according to an embodiment of the present invention.

The second target detection model is a pre-trained target detection model. The training process comprises the following steps: acquiring a training sample set, wherein each training sample in the training sample set is a text image marked with a single character area (single character frame); inputting the text images in the training sample set into a second target detection model to be trained, and outputting predicted single character frames by the second target detection model; calculating a loss function according to the difference between the predicted single character frame and the marked single character frame; adjusting model parameters of the second target detection model according to the loss function, for example, adjusting the model parameters by using a gradient descent method; and when the iteration times reach the preset times or the model converges, stopping training and outputting the trained second target detection model.

In addition, the text images in the training sample set may be labeled with single characters or character types corresponding to the single character frames, in addition to the single character frames, so that the trained second target detection model may output the single character frames + the single characters, or output the single character frames + the character types.

Subsequently, the process proceeds to step S240, and all the single character detection boxes are sorted in the left-to-right order of the abscissa according to the abscissa of the top left vertex of the single character detection box. According to an embodiment of the invention, the abscissa of the card tends to be right and infinite, and the result of sorting the id card address information in the above embodiment is "city customs area sky water of state 2 and lan 2, of gannansulu.

Then, step S250 is performed to calculate the overlapping ratio between the single character detection boxes, and all the single character detection boxes are grouped according to the overlapping ratio. The method comprises the following specific steps:

sequentially detecting the sequencing result of the step S240 from left to right, and acquiring the coordinates of the upper left vertex and the lower right vertex of the single character detection frame to be processed and the last single character detection frame in the existing grouping; calculating the overlapping ratio of the single character detection frame to be detected and the last single character detection frame in the existing grouping; dividing the single character detection boxes with the overlapping ratio larger than a preset value into the same group; and if the overlapping ratio of the single character detection frame to be detected and the last single character detection frame in all the existing groups is not more than a preset value, adding a new group.

According to one embodiment of the present invention, the calculation formula of the overlap ratio r of the single character detection box is:

According to an embodiment of the invention, the predetermined value of the overlap ratio is 0.7, and when the overlap ratio is greater than 0.7, the two single-character detection boxes are grouped into the same group, and the result of the sequencing of the address information of the identity card is grouped.

Traversing each character in the city customs area sky water of state 2 lan 2, gannansulu 2 from left to right:

1) the 'sweet' character enters a newly-built 1 st group, and the result of the 1 st group is 'sweet';

2) if the overlapping ratio of the south and the last word of the 1 st group is found to be less than 0.7, a 2 nd group is newly established, the south enters the 2 nd group, and the result of the 2 nd group is the south;

3) the 'Su' and the last word 'Gansu' of the 1 st group are overlapped, if more than 0.7 is found, the 'Su' enters the 1 st group, and the first group is updated to 'Gansu'

4) The way is overlapped with the last word 'Su' of the 1 st group to find less than 0.7, and is overlapped with the last word 'nan' of the 2 nd group to find more than 0.7, then the 2 nd group is updated to 'south way'

Until all characters are detected, the 1 st grouping result is 'Tianshui' in the city of Lanzhou city, Gansu province, and the 2 nd grouping result is 'Nanlu 222'

Then, step S260 is carried out, the ordinate of the top left vertex of the first single character detection box in each group is obtained, and the groups are sorted according to the sequence of the ordinate from top to bottom; and connecting the sequenced groups end to obtain final text information.

Continuing to take the identification card address information detection as an example, after the identification card address information is grouped, the 1 st group result is 'Tianshui of the Lanzhou city of Gansu province', the 2 nd group result is 'Nanlu 222', the ordinate of the left vertex of 'Gansu' in the 1 st group and the ordinate of the left vertex of 'Nanjing' in the second group are compared and sorted from top to bottom according to the ordinate, the sorting result is that the 1 st group is in front of the 2 nd group, and the two groups of characters are connected end to obtain the final detection result which is 'Tianshui Nanlu 222 of the Lanzhou city of Gansu province'.

According to the text detection method, the single character detection frames are obtained by carrying out target detection on the obtained text image containing the text information, the single character detection frames are sequenced according to the coordinate information of the single character detection frames and are spliced into complete character information, and in the process, the multiple lines of characters of the character image are treated as a whole, the characters do not need to be cut into lines, the lines do not need to be marked into different categories, so that the detection and identification process is saved, the time complexity of calculation and the memory occupation amount are reduced, and the detection efficiency is improved.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the text detection method of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

A9, the method according to any one of A1-A8, wherein the image to be processed is an ID card image, and the text information is address information.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A text detection method adapted to be executed in a computing device, the method comprising the steps of:

acquiring an image to be processed, wherein the image to be processed contains text information, and the text information contains one or more lines of texts;

inputting the image to be processed into a first target detection model for detection, and acquiring a text image containing the text information;

inputting the text image into a second target detection model for detection, and acquiring a single character detection frame containing single character information;

and sequencing all the single character information according to the coordinates of the single character detection box to obtain complete text information.

2. The method of claim 1, wherein the inputting the image to be processed into a first target detection model for detection, and the obtaining a text image containing the text information comprises:

inputting the image to be processed into the first detection model to detect a text region containing the text information;

and cutting the text area, and outputting a text image containing the text information.

3. The method of claim 1 or 2, wherein the second detection model further outputs a single word.

4. The method of claim 1 or 2, wherein before the step of sorting all the word information according to the coordinates of the word detection box to obtain complete text information, the method further comprises:

and performing character recognition on the single character detection frame to obtain the single characters contained in the single character detection frame.

5. The method according to any one of claims 1-4, wherein the sorting all the word information according to the coordinates of the word detection box to obtain complete text information comprises:

sequencing all the single character detection frames according to the horizontal coordinate of the top left vertex of the single character detection frame and the sequence of the horizontal coordinate from left to right;

calculating the overlapping ratio among the single character detection frames, and grouping all the single character detection frames according to the overlapping ratio;

acquiring the vertical coordinate of the top left vertex of the first single character detection box in each group, and sequencing the groups according to the sequence of the vertical coordinate from top to bottom;

and connecting the sequenced groups end to obtain final text information.

6. The method of claim 5, wherein the calculating an overlap ratio between single-word detection boxes, and grouping all single-word detection boxes according to the overlap ratio comprises:

acquiring a single character detection frame to be detected;

calculating the overlapping ratio of the single character detection frame to be detected and the last single character detection frame in the existing grouping;

dividing the single character detection boxes with the overlapping ratio larger than a preset value into the same group;

and if the overlapping ratio of the single character detection frame to be detected and the last single character detection frame in all the existing groups is not more than a preset value, adding a new group.

7. The method of claim 6, wherein the calculating the overlap ratio of the single-word detection frame to be detected and the last single-word detection frame in the existing packet comprises:

acquiring the coordinates of the upper left vertex and the lower right vertex of the single character detection frame to be processed and the last single character detection frame in the existing group;

and calculating the overlapping ratio r of the single character detection frame to be processed and the last single character detection frame in the existing grouping according to the following formula:

wherein, the coordinate of the upper left point of the single character detection frame to be detected is (x 1)₁,y1₁) The coordinate of the lower right point of the single character detection box to be processed is (x 1)₂,y1₂) The coordinate of the upper left point of the last single character detection box in the grouping is (x 2)₁,y2₁) The last word in the group is detectedThe coordinate of the lower right point is (x 2)₂,y2₂) And x1₁＜x1₂，y1₁＜y1₂。

8. The method of any one of claims 1-7, wherein the first and second object detection networks are convolutional neural networks, fast R-CNN.

9. A computing device, comprising:

at least one processor; and

a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-8.

10. A readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-8.