CN117037175A

CN117037175A - Text detection method, device, storage medium, electronic equipment and product

Info

Publication number: CN117037175A
Application number: CN202211190875.3A
Authority: CN
Inventors: 包志敏; 王斌; 曹浩宇; 姜德强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-11-10

Abstract

The application discloses a text detection method, a device, a storage medium, electronic equipment and a product, which relate to the technical field of artificial intelligence, and can be applied to the fields of blockchain, cloud technology, map car networking and the like, wherein the method comprises the following steps: extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text; performing text detection processing on the feature map to obtain a text region corresponding to the text; carrying out reading direction prediction processing on the feature map to obtain character reading direction vectors corresponding to pixels in the text region; calculating based on character reading direction vectors corresponding to pixels in the text region to obtain single character reading direction vectors corresponding to the text; and determining the reading direction of the text according to the single word reading direction vector, wherein the reading direction is used for cutting a line image corresponding to the text to detect the text. The text detection method and the text detection device can effectively improve the text detection accuracy.

Description

Text detection method, device, storage medium, electronic equipment and product

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a text detection method, a text detection device, a storage medium, electronic equipment and a product.

Background

Text detection tasks typically require text to be detected from images, e.g., in related scenes, where it is necessary to record the text detected from ticket images.

Currently, text detection processes generally include text line detection, text line-to-line image cropping, and line image text recognition. In the current scheme, when a line image corresponding to a text line is cut from an image to be detected, the direction of the text is difficult to accurately judge, so that the cut line image is easy to generate errors such as text turning and the like, and an error text recognition result is obtained, and the problem is particularly obvious in a complex scene.

Therefore, the current text detection work has the problem of poor text detection accuracy.

Disclosure of Invention

The embodiment of the application provides a text detection method and a related device, which can effectively improve the accuracy of text detection.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

according to one embodiment of the present application, a text detection method includes: extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text; performing text detection processing on the feature map to obtain a text region corresponding to the text; carrying out reading direction prediction processing on the feature map to obtain character reading direction vectors corresponding to pixels in the text region; calculating based on character reading direction vectors corresponding to pixels in the text region to obtain single character reading direction vectors corresponding to the text; and determining the reading direction of the text according to the single word reading direction vector, wherein the reading direction is used for cutting a line image corresponding to the text to detect the text.

According to one embodiment of the present application, a text detection apparatus includes: the extraction module is used for extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text; the detection module is used for carrying out text detection processing on the feature map to obtain a text region corresponding to the text; the prediction module is used for carrying out reading direction prediction processing on the feature map to obtain a character reading direction vector corresponding to the pixels in the text region; the calculation module is used for calculating based on the character reading direction vector corresponding to the pixels in the text region to obtain a single character reading direction vector corresponding to the text; and the cutting module is used for determining the reading direction of the text according to the single word reading direction vector, and the reading direction is used for cutting the line image corresponding to the text to detect the text.

In some embodiments of the application, the detection module comprises a first detection unit for: performing text detection processing based on the feature map to obtain a text line area in which each text line is located, wherein the text lines are lines formed by a row of characters in the text; and obtaining the text region according to the text line region where each text line is located in the feature map, wherein the text region comprises the text line region where each text line is located.

In some embodiments of the application, the computing module includes a first computing unit to: averaging the character reading direction vectors corresponding to the pixels in the text line area where each text line is located to obtain an average vector corresponding to each text line; and obtaining a single word reading direction vector corresponding to each text line according to the average vector corresponding to each text line.

In some embodiments of the application, the detection module comprises a second detection unit for: performing text detection processing based on the feature map to obtain a text region in which each word in the text is located; and obtaining the text region according to the text region where each word is located in the text, wherein the text region comprises the text region where each word is located.

In some embodiments of the application, the computing module includes a second computing unit for: averaging the character reading direction vectors corresponding to the pixels in the text region where each word is located to obtain an average vector corresponding to each word; and obtaining a single word reading direction vector corresponding to each word according to the average vector corresponding to each word.

In some embodiments of the present application, each text line in the text corresponds to one of the word reading direction vectors; the cutting module comprises a first cutting unit and is used for: for each text line, calculating cosine similarity between a single word reading direction vector corresponding to the text line and each line boundary vector, wherein each line boundary vector is a vector formed by four clockwise edges of a text line region where the text line is located; and obtaining the reading direction of the text line according to the cosine similarity corresponding to each line boundary vector, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the text line area where the text line is located.

In some embodiments of the application, each word in the text corresponds to one of the single word reading direction vectors; the cutting module comprises a second cutting unit and is used for: for each word, calculating cosine similarity between a single word reading direction vector corresponding to the word and each word boundary vector, wherein each word boundary vector is a vector formed by four clockwise sides of a word area in which the word is located; and obtaining the reading direction of the word according to the cosine similarity corresponding to each word boundary vector, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the word area where the word is located.

In some embodiments of the present application, the reading direction includes reading sequence numbers of four vertices of the text region; the apparatus further comprises a classification module for: cutting out a line image corresponding to the text according to the reading serial number; determining the arrangement type corresponding to the text image according to the edge between the two vertexes corresponding to the first two reading serial numbers; and sending the text image to a text recognizer of a corresponding arrangement type to perform text recognition, and obtaining recognized text content.

According to another embodiment of the application, a computer readable storage medium has stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the method according to the embodiment of the application.

According to another embodiment of the present application, an electronic device includes: a memory storing a computer program; and the processor reads the computer program stored in the memory to execute the method according to the embodiment of the application.

According to another embodiment of the application, a computer program product or computer program includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations described in the embodiments of the present application.

In the embodiment of the application, a feature map corresponding to an image to be detected is extracted, wherein the image to be detected comprises texts; performing text detection processing on the feature map to obtain a text region corresponding to the text; carrying out reading direction prediction processing on the feature map to obtain character reading direction vectors corresponding to pixels in the text region; calculating based on character reading direction vectors corresponding to pixels in the text region to obtain single character reading direction vectors corresponding to the text; and determining the reading direction of the text according to the single word reading direction vector, wherein the reading direction is used for cutting a line image corresponding to the text to detect the text.

In this way, text detection and reading direction prediction are respectively carried out on the feature images through the two branches, character reading direction vectors corresponding to pixels in the image to be detected are obtained through the prediction direction prediction, calculation is carried out according to the character reading direction vectors corresponding to the pixels in the text region, single word reading direction vectors corresponding to the text in the image to be detected are obtained, the reading direction of the text can be accurately determined according to the single word reading direction vectors, the text can be accurately cut according to the reading direction of the text, incorrect line images without text turning and the like are obtained, text recognition can be accurately carried out when text recognition is carried out based on the line images, and the text detection accuracy is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic diagram of a system to which embodiments of the application may be applied.

Fig. 2 shows a flow chart of a text detection method according to an embodiment of the application.

Fig. 3 shows a detection result image of a text line according to an embodiment of the present application.

Fig. 4 shows a schematic diagram of a character reading direction vector according to one embodiment of the application.

Fig. 5 shows a schematic diagram of a character reading direction vector according to another embodiment of the application.

Fig. 6 shows a schematic diagram of text line clipping according to another embodiment of the application.

FIG. 7 illustrates a framework diagram of text detection in a scenario according to an embodiment of the present application.

Fig. 8 shows a schematic diagram of a text detection result according to the present application in a scenario.

Fig. 9 shows a schematic diagram of a text detection result according to the present application in another scenario.

Fig. 10 shows a block diagram of a text detection device according to an embodiment of the application.

FIG. 11 shows a block diagram of an electronic device according to one embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

It will be appreciated that in the specific embodiment of the present application, related data such as images to be detected are involved, when the embodiments of the present application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data is required to comply with related laws and regulations and standards of related countries and regions.

Fig. 1 shows a schematic diagram of a system 100 in which embodiments of the application may be applied. As shown in fig. 1, the system 100 may include a server 101 and a terminal 102.

The server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. In one implementation of the present example, server 101 is a cloud server, and server 101 may provide artificial intelligence cloud services, such as artificial intelligence cloud services that provide large text detection.

The terminal 102 may be any device, and the terminal 102 includes, but is not limited to, a cell phone, a computer, a smart voice interaction device, a smart home appliance, a vehicle terminal, a VR/AR device, a smart watch, a computer, and the like. In one embodiment, the server 101 or terminal 102 may be a node device in a blockchain network or a map internet of vehicles platform.

In one implementation of this example, the server 101 or the terminal 102 may: extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text; performing text detection processing on the feature map to obtain a text region corresponding to the text; carrying out reading direction prediction processing on the feature map to obtain character reading direction vectors corresponding to pixels in the text region; calculating based on character reading direction vectors corresponding to pixels in the text region to obtain single character reading direction vectors corresponding to the text; and determining the reading direction of the text according to the single word reading direction vector, wherein the reading direction is used for cutting a line image corresponding to the text to detect the text.

Fig. 2 schematically shows a flow chart of a text detection method according to an embodiment of the application. The execution subject of the text detection method may be any device, such as the server 101 or the terminal 102 shown in fig. 1.

As shown in fig. 2, the text detection method may include steps S210 to S230.

Step S210, extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text; step S220, performing text detection processing on the feature map to obtain a text region corresponding to the text; step S230, carrying out reading direction prediction processing on the feature map to obtain character reading direction vectors corresponding to pixels in the text region; step S240, calculating based on character reading direction vectors corresponding to pixels in the text region to obtain single word reading direction vectors corresponding to the text; step S250, determining the reading direction of the text according to the single-word reading direction vector, wherein the reading direction is used for cutting the line image corresponding to the text to detect the text.

The image to be detected is an image including text therein, such as chinese, english, etc., and the text in the image to be detected may include at least one word, such as a chinese word or an english word, etc. The feature extraction network can be used for extracting the features of the image to be detected to obtain a corresponding feature map.

The feature map can be subjected to text detection processing through a text detection network (such as a pre-trained full convolution neural network) to obtain a text region corresponding to a text; the characteristic map can be subjected to reading direction prediction processing through a reading direction prediction network (such as a pre-trained full convolution neural network) to obtain a character reading direction vector corresponding to each pixel in a text region in an image to be detected, and the character reading direction vector can reflect the reading direction vector of the text from the pixel dimension.

And calculating based on the character reading direction vector corresponding to the pixel in the text region corresponding to the text, so that a single word reading direction vector corresponding to the text can be obtained, and the single word reading direction vector can reflect the reading direction vector of the text from the text dimension.

The reading direction of the text can be determined according to the single word reading direction vector corresponding to the text, image cutting such as overturning and transformation can be accurately performed according to the reading direction of the text, a line image without errors such as text overturning and the like corresponding to a line in the text can be obtained by cutting, and text detection of the text in the line image can be accurately performed based on the line image.

In this way, based on steps S210 to S250, text detection and reading direction prediction are performed on the feature map through the two branches, a character reading direction vector corresponding to a pixel in the image to be detected is obtained through prediction direction prediction, calculation is performed according to the character reading direction vector corresponding to the pixel in the text region, a single word reading direction vector corresponding to the text in the image to be detected is obtained, the reading direction of the text can be accurately determined according to the single word reading direction vector, the text can be accurately cut according to the reading direction of the text, a line image free of errors such as text turning is obtained, text recognition can be accurately performed when text recognition is performed based on the line image, and the text detection accuracy is effectively improved.

Other further embodiments of the steps performed in the text detection performed by the embodiment of fig. 1 are described below.

In one embodiment, step S210 extracts a feature map corresponding to an image to be detected, where the image to be detected includes text, and includes: sequentially extracting features of the images to be detected through at least two levels of feature extraction networks (such as a resnet50 base network) to obtain a phase feature map of at least two phases; and fusing the phase feature images of different phases to obtain the feature image corresponding to the image to be detected. Wherein the phase feature diagrams of different phases may be different in size. The phase characteristic diagrams of different phases can be fused in a superposition or splicing mode.

In one embodiment, step S210 extracts a feature map corresponding to an image to be detected, where the image to be detected includes text, and includes: and carrying out feature extraction on the image to be detected through a single-layer feature extraction network to obtain an extracted feature map, wherein the extracted feature map is directly used as a feature map corresponding to the image to be detected.

In one embodiment, step S220 of performing text detection processing on the feature map to obtain a text region corresponding to the text, includes: performing text detection processing based on the feature map to obtain a text line area in which each text line is located, wherein the text lines are lines formed by one row of characters in the text; and obtaining a text region according to the text line region where each text line is located in the feature map, wherein the text region comprises the text line region where each text line is located.

In this embodiment, text detection processing is performed in units of text lines, and text detection processing is performed on the feature map, so that a text line area where each text line is located can be obtained, where the text line area where the text line is located may be a rectangular area corresponding to an circumscribed matrix of the text line. Referring to fig. 3, in an example, text detection processing is performed on the feature map to obtain a detection result image shown in fig. 3, where each white rectangular area in the detection result image is a text line area where a text line is located.

Further, the text region corresponding to the text in the image to be detected may include a text line region in which each text line is located. Wherein, each text line may include at least one word, for example, 5 words of a text line "temptation of food", and the "temptation of food" may correspond to a text line area.

In one embodiment, step S220, performing text detection processing on the feature map to obtain a text region where the text is located, includes: performing text detection processing based on the feature map to obtain a text region in which each word in the text is located; and obtaining a text region according to the text region where each word is located in the text, wherein the text region comprises the text region where each word is located.

In this embodiment, text detection processing is performed by using a single word as a unit, and text detection processing is performed on the feature map, so that a text region where each word is located in the text can be obtained, where the text region where each word is located may be a rectangular region corresponding to an circumscribed matrix of the word. In one example, text detection processing is performed on the feature map to obtain a detection result image, where each white rectangular area in the detection result image is a text area where a word is located.

Furthermore, the text region corresponding to the text in the image to be detected may include a text region in which each word is located. For example, a text action "temptation to food" is 5 words in total, and a corresponding text region can be detected for each word.

In one embodiment, step S240, calculating based on the character reading direction vector corresponding to the pixel in the text region, obtains a word reading direction vector corresponding to the text, includes: averaging the character reading direction vectors corresponding to the pixels in the text line area where each text line is located to obtain an average vector corresponding to each text line; and obtaining the word reading direction vector corresponding to each text line according to the average vector corresponding to each text line.

Referring to fig. 4 and 5, for example, for each pixel in the text line region where the text line "temptation for food" and "food still Tastyle" are located, a corresponding character reading direction vector may be predicted, for example, each small arrow directed to the right in the text line region where the "temptation for food" in the right image of fig. 5 may indicate a character reading direction vector (x=1, y=0).

Further, small arrows in different directions, such as corresponding to "words" in different directions in fig. 4, may represent different character reading direction vectors. That is, the character reading direction vectors of all pixels in one text line area may be the same or different, for example, the character reading direction vectors of all pixels in the text line area where "temptation of food" is located may be detected as (x=1, y=0) in some scenes, and another portion as others (e.g., (x=0, y=1)).

And averaging the character reading direction vectors corresponding to the pixels in the text line area where each text line is located, so that an average vector corresponding to each text line can be obtained. According to the average vector corresponding to each text line, the single word reading direction vector corresponding to each text line is obtained, which may be the average vector corresponding to each text line directly, or may be the single word reading direction vector corresponding to each text line after rounding the average vector corresponding to each text line. As shown in the middle image of fig. 5, the word reading direction vector corresponding to the text line "food still Tastyle" is shown by the arrow on "food still Tastyle" facing right, and the word reading direction vector corresponding to the text line "temptation to food fine" is shown by the arrow on "temptation to food fine" facing right.

In other embodiments, step S240, calculating based on the character reading direction vector corresponding to the pixel in the text region, obtains a word reading direction vector corresponding to the text, includes: and calculating a plurality of character reading direction vectors with the largest number in character reading direction vectors corresponding to pixels in a text line area where each text line is located, taking the largest character reading direction vector as a single word reading direction vector corresponding to the text line, for example, taking (x=1, y=0) as the single word reading direction vector corresponding to the text line if the largest number is (x=1, y=0).

In one embodiment, step S240, calculating based on the character reading direction vector corresponding to the pixel in the text region, obtains a word reading direction vector corresponding to the text, includes: averaging the character reading direction vectors corresponding to the pixels in the text region where each word is located to obtain an average vector corresponding to each word; and obtaining a single-word reading direction vector corresponding to each word according to the average vector corresponding to each word.

Referring to fig. 4, for each pixel in the text region where the word of the tatyle is located, a corresponding character reading direction vector may be predicted, and in fig. 4, the character reading direction vector of each pixel in the text region where the word of the tatyle is located is (x=1, y=0), and it may be understood that in some situations, the detected character reading direction vector corresponding to some pixels may occur to be other (e.g., (x=1, y=0). And averaging the character reading direction vectors corresponding to the pixels in the text region where each word is located to obtain an average vector corresponding to each word.

According to the average vector corresponding to each word, the single word reading direction vector corresponding to each word is obtained, which may be the average vector corresponding to each word directly, or may be the single word reading direction vector corresponding to each word after rounding the average vector corresponding to each word. For example, a corresponding word reading direction vector may be calculated for the word of Tastyle, and a corresponding word reading direction vector may be calculated for the word.

And when irregular shapes such as a text behavior wave shape and the like are formed, the reading direction of each word can be accurately determined further according to the single word reading direction vector corresponding to the single word, and the text detection accuracy is further improved.

In one embodiment, each text line in the text corresponds to a word reading direction vector; step S240, determining the reading direction of the text according to the single word reading direction vector, including:

for each text line, calculating cosine similarity between a single word reading direction vector corresponding to the text line and each line boundary vector, wherein each line boundary vector is a vector formed by four clockwise edges of a text line region where the text line is located; and obtaining the reading direction of the text line according to the cosine similarity corresponding to each line boundary vector, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the text line area where the text line is located.

Referring to fig. 5 and 6, four sides of a text line region where a text line "fashion tab" is located may respectively calculate an initial vector corresponding to each side clockwise (i.e., subtracting a head coordinate from a tail coordinate of each side clockwise to obtain an initial vector), and normalize the initial vectors of the four sides to form a line boundary vector of each side.

The single word reading direction vector corresponding to the text line is also a unit vector, and the corresponding cosine similarity can be calculated respectively by the single word reading direction vector and the line boundary vector of each edge, namely the cosine similarity corresponding to the line boundary vector corresponding to the 4 edges is obtained.

According to the cosine similarity corresponding to each line boundary vector, the reading direction of the text line can be obtained, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the text line area where the text line is located; the reading sequence numbers of the four vertexes of the text line area can be calibrated according to the sequence from large to small of cosine similarity corresponding to the line boundary vectors corresponding to the 4 edges. For example, as shown in the left image of fig. 5, the reading numbers of the four vertices of the text line region where the text line "fashion style" is located are 0, 1, 2, and 3 in order clockwise from the vertex in the upper left corner to the vertex in the lower left corner. As shown in fig. 6, 3 rd image shows, the four vertices of the text line region where the text line "fashion style" is located are sequentially 0, 1, 2, and 3 clockwise from the vertex at the lower right corner to the vertex at the upper right corner.

In one embodiment, each word in the text corresponds to a single word reading direction vector; step S240, determining the reading direction of the text according to the single word reading direction vector, including:

for each word, calculating cosine similarity between a single word reading direction vector corresponding to the word and a boundary vector of each word, wherein each word boundary vector is a vector formed by four clockwise sides of a word area in which the word is located; and obtaining the reading direction of the word according to the cosine similarity corresponding to each word boundary vector, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the word region where the word is located.

For example, for four sides of a text region where a word is located, an initial vector corresponding to each side can be calculated respectively clockwise (i.e. the tail coordinate of each side is subtracted from the head coordinate to obtain the initial vector clockwise), and the initial vectors of the four sides are normalized to form a word boundary vector of each side.

The single word reading direction vector corresponding to the word is also a unit vector, and the corresponding cosine similarity can be calculated respectively by the single word reading direction vector and the word boundary vector of each side, namely the cosine similarity corresponding to the word boundary vector corresponding to 4 sides is obtained.

According to the cosine similarity corresponding to each word boundary vector, the reading direction of the word can be obtained, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the word area where the word is located; the reading sequence numbers of the four vertexes of the text region can be calibrated according to the sequence from large to small of cosine similarity corresponding to the line boundary vectors corresponding to the 4 edges.

Further, in one embodiment, the reading direction includes a reading sequence number of four vertices of the text region; after determining the reading direction of the text according to the word reading direction vector, step S250 further includes: cutting a line image corresponding to the text according to the reading serial number; determining the arrangement type corresponding to the line images according to the edge between the two vertexes corresponding to the first two reading serial numbers; and sending the line images to a text recognizer of a corresponding arrangement type to perform text recognition, so as to obtain recognized text content.

In some implementations, the text region is a text line region in which the text line is located. Cutting a line image corresponding to the text according to the reading serial number, including: and cutting the text line region where the text line is located through perspective transformation according to the reading sequence numbers of the four vertexes of the text line region where the text line is located, so as to obtain a line image corresponding to the text line. Specifically, as shown in fig. 6, in the 3 rd image in fig. 6, the text line region where the text line is located is cut according to the reading sequence numbers of the four vertexes, so as to obtain the line image of the text line in the 3 rd image in fig. 6, and according to the reading sequence numbers of the four vertexes, the turned text line can be cut into the line image which can be effectively identified. Further, according to the edge between the two vertices corresponding to the first two reading numbers, the arrangement type corresponding to the line image is determined, for example, as shown in fig. 6, 3 rd image, the reading numbers of the four vertices of the text line region where the text line "food style" is located are sequentially 0, 1, 2 and 3 clockwise from the vertex at the lower right corner to the vertex at the upper right corner. The edge between the two vertices corresponding to the first two reading numbers, namely the edge between the reading numbers 0 and 1, wherein if the edge between 0 and 1 is a long edge, the arrangement type corresponding to the line image can be determined to be 'horizontal row', as shown in the left image in fig. 6, the arrangement type of the line image corresponding to the text line 'food style' is 'horizontal row', because the words of food style are horizontal texts read horizontally. If the edge between 0 and 1 is the short edge, it may be determined that the arrangement type corresponding to the line image is "vertical line", as shown in the 4 th image in fig. 6, the arrangement type corresponding to the line image of "temptation of food" in the text line is "vertical line", because several words of temptation of food are vertical text read vertically. Further, the line images are sent to the text identifiers of the corresponding arrangement types to perform text recognition, so as to obtain the recognized text content, for example, the line images corresponding to the "temptation of food" in the line "in the 4 th image in fig. 6 can be sent to the text identifiers of the arrangement type" vertical arrangement "to perform text recognition, so as to obtain the temptation of the recognized food, and the line images corresponding to the" food Tastyle "in the line" in the 4 th image in fig. 6 can be sent to the text identifiers of the arrangement type "horizontal arrangement" to perform text recognition, so as to obtain the recognized food Tastyle.

In another approach, the text region is a text region in which the word is located. Cutting a line image corresponding to the text according to the reading serial number, including: cutting the text region where the word is located through perspective transformation according to the reading sequence numbers of the four vertexes of the text region where the word is located, so as to obtain a word image of the word; and splicing the word images of the words of the same text line to obtain a line image corresponding to each text line in the text. Further, according to the edge between two vertexes corresponding to the first two reading serial numbers in the text region where the word included in the text line a is located, if the edge between the two vertexes corresponds to the long edge of the line image of the text line a, determining the arrangement type of the line image corresponding to the line of the text line a, if the edge between the two vertexes corresponds to the short edge of the line image of the text line a, determining the arrangement type of the line image corresponding to the line of the text line a, further, sending the line image to a text identifier corresponding to the arrangement type for text identification, and accurately obtaining the identified text content.

The foregoing embodiments are further described below in connection with a process for text detection in a scenario where text detection is performed by applying the foregoing embodiments of the present application.

The process of text detection in this scenario may include steps (1) to (5),

step (1), extracting a feature map corresponding to an image to be detected (Iput image), wherein the image to be detected comprises a text, and extracting the feature map corresponding to the image to be detected, wherein the image to be detected comprises the text, and the method comprises the following steps: sequentially extracting features of the images to be detected through at least two levels of feature extraction networks to obtain a phase feature map of at least two phases; and fusing the phase feature images of different phases to obtain the feature image corresponding to the image to be detected. Wherein the phase feature diagrams of different phases may be different in size. The phase characteristic diagrams of different phases can be fused in a superposition or splicing mode.

In this scenario, referring to fig. 7, at least two levels of feature extraction networks (Resnet-4 stages) are 4-level cascaded Resnet50 base networks, so as to obtain 4-stage feature graphs (conv 1x 1), and the 4-stage feature graphs are fused to obtain feature graphs corresponding to the image to be detected (Iput image).

Step (2), performing text detection processing on the feature map to obtain a text region corresponding to the text; under the scene, text detection processing is performed based on the feature map to obtain a text region corresponding to the text, and the method comprises the following steps: performing text detection processing based on the feature map to obtain a text line area in which each text line is located, wherein the text lines are lines formed by one row of characters in the text; and obtaining a text region according to the text line region where each text line is located in the feature map, wherein the text region comprises the text line region where each text line is located. Referring to the upper right branch of fig. 6, text detection processing is performed on the feature map to obtain a detection result image (upper right image of fig. 6) as shown in fig. 6, where each white rectangular area in the detection result image is a text line area where a text line is located.

Step (3), carrying out reading direction prediction processing on the feature map to obtain a character reading direction vector corresponding to the pixels in the image to be detected; and calculating based on the character reading direction vector corresponding to the pixel in the text region to obtain the single-word reading direction vector corresponding to the text.

Under the scene, based on the character reading direction vector corresponding to the pixels in the text region, calculating to obtain a single word reading direction vector corresponding to the text, including: averaging the character reading direction vectors corresponding to the pixels in the text line area where each text line is located to obtain an average vector corresponding to each text line; and obtaining the word reading direction vector corresponding to each text line according to the average vector corresponding to each text line. Referring to the lower right branch of fig. 6, the word reading direction vector corresponding to each text line in the prediction graph on the lower right side of fig. 6 is shown by the arrow on each text line.

And (4) determining the reading direction of the text according to the reading direction vector of the single word, and cutting a line image corresponding to the text according to the reading direction, wherein the line image is used for text detection.

Under the scene, each text line in the text corresponds to a single word reading direction vector; the text image comprises a line image corresponding to each text line in the text; determining the reading direction of the text according to the reading direction vector of the single word, and cutting a text image corresponding to the text according to the reading direction, wherein the method comprises the following steps: for each text line, calculating cosine similarity between a single word reading direction vector corresponding to the text line and each line boundary vector, wherein each line boundary vector is a vector formed by four clockwise edges of a text line region where the text line is located; obtaining a reading direction of the text line according to the cosine similarity corresponding to each line boundary vector, wherein the reading direction comprises reading sequence numbers of four vertexes of a text line area where the text line is located; and cutting the text line region where the text line is located through perspective transformation according to the reading sequence numbers of the four vertexes of the text line region where the text line is located, so as to obtain a line image corresponding to the text line.

Step (5), determining the arrangement type corresponding to the text image according to the edge between the two vertexes corresponding to the first two reading serial numbers; and sending the text image to a text recognizer of a corresponding arrangement type for text recognition, and obtaining recognized text content.

In this scenario, the text region is the text line region in which the text line is located. For example, as shown in fig. 6, 3 rd image, the reading numbers of the four vertices of the text line region where the text line "fashion style" is located are 0, 1, 2, and 3 in order clockwise from the vertex at the lower right corner to the vertex at the upper right corner. The edge between the two vertices corresponding to the first two reading numbers, namely the edge between the reading numbers 0 and 1, wherein if the edge between 0 and 1 is a long edge, the arrangement type corresponding to the line image can be determined to be 'horizontal row', as shown in the left image in fig. 6, the arrangement type of the line image corresponding to the text line 'food style' is 'horizontal row', because the words of food style are horizontal texts read horizontally. If the edge between 0 and 1 is the short edge, it may be determined that the arrangement type corresponding to the line image is "vertical line", as shown in the 4 th image in fig. 6, the arrangement type corresponding to the line image of "temptation of food" in the text line is "vertical line", because several words of temptation of food are vertical text read vertically. Further, the text image is sent to a text recognizer of a corresponding arrangement type to perform text recognition, so as to obtain recognized text content, for example, a line image corresponding to a line "temptation of food" in the 4 th image in fig. 6 may be sent to a text recognizer of a vertical arrangement type to perform text recognition, so as to obtain temptation of recognized food, and a line image corresponding to a line "food Tastyle" in the 4 th image in fig. 6 may be sent to a text recognizer of a horizontal arrangement type to perform text recognition, so as to obtain recognized food Tastyle.

In the related art, text detection exists: the defect 1 is that the horizontal text detection or multi-angle text detector does not have information of the text reading direction, and can not determine whether the line images are turned by 90 degrees, 180 degrees and 270 degrees, so that the cut line images are turned by mistake, and a wrong text recognition result is obtained. Disadvantage 2: the problem of mixed horizontal and vertical rows of text lines exists in complex typesetting, and the conventional text detector cannot judge the horizontal and vertical row condition of the text lines in a specific image, so that when the text is sent into a text recognizer, horizontal characters can be sent into the vertical row recognizer, or vertical texts can be sent into the horizontal row recognizer and other errors. Disadvantage 3: the method for classifying and determining the arrangement type based on the neural network based on the single-line text line is easy to cause the condition of inconsistent context, lacks the consistency constraint of the whole graph and has limited classification accuracy.

In this scenario, by applying the foregoing embodiment of the present application, first, the concept of a single word reading direction of a text is introduced, as shown in fig. 7, using the split-based PSENet text detector shown in fig. 7, a prediction branch of a reading direction is added, a character reading direction vector is predicted for each pixel of a text line, and a single word reading direction vector is obtained by taking an average value of the entire text line, and when clipping, whether a 180 degree, 90 degree or 270 degree flip occurs for each text line can be considered. Secondly, the arrangement types of the horizontal rows and the vertical rows can be judged through the edges between the two vertexes corresponding to the first two reading serial numbers, and the line images with different arrangement types are sent to the corresponding text identifier. Third, adding header information (e.g., reading sequence number) in the reading direction in the network shown in fig. 7 can reduce the computational overhead, and can capture more accurate text context prediction. In the scene, the accuracy of text detection is effectively improved by applying the embodiment of the application.

In the scenario shown in fig. 8 and fig. 9, in an example, as shown in fig. 8, there is a 180-degree rotation of the text in the image to be detected on the left side in fig. 8, and by applying the embodiment of the present application, the text in the image to be detected can be accurately identified, which is prone to error in the related art. In another example, as shown in fig. 9, in the case that the text in the image to be detected on the left side in fig. 9 is arranged in a mixed manner, the text in the image to be detected can be accurately identified by applying the embodiment of the application, and errors are prone to occur in the related art.

In order to facilitate better implementation of the text detection method provided by the embodiment of the application, the embodiment of the application also provides a text detection device based on the text detection method. Where the meaning of a noun is the same as in the text detection method described above, specific implementation details may be referred to in the description of the method embodiment. Fig. 10 shows a block diagram of a text detection device according to an embodiment of the application.

As shown in fig. 10, the text detection apparatus 300 may include: the extracting module 310 may be configured to extract a feature map corresponding to an image to be detected, where the image to be detected includes text; the detection module 320 may be configured to perform text detection processing on the feature map to obtain a text region corresponding to the text; the prediction module 330 may be configured to perform a reading direction prediction process on the feature map to obtain a character reading direction vector corresponding to a pixel in the text region; the calculation module 340 may be configured to calculate, based on the character reading direction vector corresponding to the pixel in the text region, a word reading direction vector corresponding to the text; the clipping module 350 may be configured to determine a reading direction of the text according to the single word reading direction vector, where the reading direction is used to clip a line image corresponding to the text for text detection.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

In addition, the embodiment of the present application further provides an electronic device, which may be a terminal or a server, as shown in fig. 11, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, specifically:

The electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 11 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby detecting the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user page, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more computer programs into the memory 402 according to the following instructions, and the processor 401 executes the computer programs stored in the memory 402, so as to implement the functions of the foregoing embodiments of the present application.

As the processor 401 may execute: extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text; performing text detection processing on the feature map to obtain a text region corresponding to the text; carrying out reading direction prediction processing on the feature map to obtain character reading direction vectors corresponding to pixels in the text region; calculating based on character reading direction vectors corresponding to pixels in the text region to obtain single character reading direction vectors corresponding to the text; and determining the reading direction of the text according to the single word reading direction vector, wherein the reading direction is used for cutting a line image corresponding to the text to detect the text.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of the various methods of the above embodiments may be performed by a computer program, or by computer program control related hardware, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application also provide a computer readable storage medium having stored therein a computer program that can be loaded by a processor to perform the steps of any of the methods provided by the embodiments of the present application.

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Since the computer program stored in the computer readable storage medium may execute the steps of any one of the methods provided in the embodiments of the present application, the beneficial effects that can be achieved by the methods provided in the embodiments of the present application may be achieved, which are detailed in the previous embodiments and are not described herein.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the application described above.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It will be understood that the application is not limited to the embodiments which have been described above and shown in the drawings, but that various modifications and changes can be made without departing from the scope thereof.

Claims

1. A text detection method, comprising:

extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text;

performing text detection processing on the feature map to obtain a text region corresponding to the text;

carrying out reading direction prediction processing on the feature map to obtain character reading direction vectors corresponding to pixels in the text region;

calculating based on character reading direction vectors corresponding to pixels in the text region to obtain single character reading direction vectors corresponding to the text;

And determining the reading direction of the text according to the single word reading direction vector, wherein the reading direction is used for cutting a line image corresponding to the text to detect the text.

2. The method of claim 1, wherein the performing text detection processing on the feature map to obtain a text region corresponding to the text comprises:

performing text detection processing based on the feature map to obtain a text line area in which each text line is located, wherein the text lines are lines formed by a row of characters in the text;

and obtaining the text region according to the text line region where each text line is located in the feature map, wherein the text region comprises the text line region where each text line is located.

3. The method according to claim 2, wherein the calculating based on the character reading direction vector corresponding to the pixel in the text region to obtain the word reading direction vector corresponding to the text comprises:

averaging the character reading direction vectors corresponding to the pixels in the text line area where each text line is located to obtain an average vector corresponding to each text line;

and obtaining a single word reading direction vector corresponding to each text line according to the average vector corresponding to each text line.

4. The method of claim 1, wherein the performing text detection processing on the feature map to obtain a text region in which the text is located includes:

performing text detection processing based on the feature map to obtain a text region in which each word in the text is located;

and obtaining the text region according to the text region where each word is located in the text, wherein the text region comprises the text region where each word is located.

5. The method of claim 4, wherein the calculating based on the character reading direction vector corresponding to the pixel in the text region to obtain the word reading direction vector corresponding to the text comprises:

averaging the character reading direction vectors corresponding to the pixels in the text region where each word is located to obtain an average vector corresponding to each word;

and obtaining a single word reading direction vector corresponding to each word according to the average vector corresponding to each word.

6. The method of claim 1, wherein each text line in the text corresponds to one of the word reading direction vectors;

the determining the reading direction of the text according to the single word reading direction vector comprises the following steps:

For each text line, calculating cosine similarity between a single word reading direction vector corresponding to the text line and each line boundary vector, wherein each line boundary vector is a vector formed by four clockwise edges of a text line region where the text line is located;

and obtaining the reading direction of the text line according to the cosine similarity corresponding to each line boundary vector, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the text line area where the text line is located.

7. The method of claim 1, wherein each word in the text corresponds to one of the single word reading direction vectors;

for each word, calculating cosine similarity between a single word reading direction vector corresponding to the word and each word boundary vector, wherein each word boundary vector is a vector formed by four clockwise sides of a word area in which the word is located;

and obtaining the reading direction of the word according to the cosine similarity corresponding to each word boundary vector, wherein the reading direction comprises the reading sequence numbers of the four vertexes of the word area where the word is located.

8. The method of any one of claims 1 to 7, wherein the reading direction comprises a reading number of four vertices of the text region; after the reading direction of the text is determined according to the single word reading direction vector, the method further comprises:

Cutting out a line image corresponding to the text according to the reading serial number;

determining the arrangement type corresponding to the row images according to the edge between the two vertexes corresponding to the first two reading serial numbers;

and sending the line images to a text recognizer of a corresponding arrangement type to perform text recognition, and obtaining recognized text content.

9. A text detection device, comprising:

the extraction module is used for extracting a feature map corresponding to an image to be detected, wherein the image to be detected comprises a text;

the detection module is used for carrying out text detection processing on the feature map to obtain a text region corresponding to the text;

the prediction module is used for carrying out reading direction prediction processing on the feature map to obtain a character reading direction vector corresponding to the pixels in the text region;

the calculation module is used for calculating based on the character reading direction vector corresponding to the pixels in the text region to obtain a single character reading direction vector corresponding to the text;

and the cutting module is used for determining the reading direction of the text according to the single word reading direction vector, and the reading direction is used for cutting the line image corresponding to the text to detect the text.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor of a computer, causes the computer to perform the method of any of claims 1 to 8.

11. An electronic device, comprising: a memory storing a computer program; a processor reading a computer program stored in a memory to perform the method of any one of claims 1 to 8.

12. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8.